Build a Data Lake with AWS Services (Part 2)

DevOps

Note: This blog post was reviewed using AI for factual correctness and clarity. All content was tested in my private homelab to ensure accuracy.

Welcome back! In Part 1, we explored the architecture of a simple yet powerful AWS-based data lake. Now, in Part 2, we’ll roll up our sleeves and actually provision resources using Terraform. Let’s get building. 🛠️


🪣 S3 Storage for Data

Let’s start with our raw storage layer — S3.

Key things to keep in mind:

  • S3 bucket names must be globally unique.
  • Set ACLs to private for security.
  • Use lifecycle rules to optimize cost: frequently accessed data stays in Standard, older data can move to STANDARD_IA or GLACIER.
resource "aws_s3_bucket" "data" {
  bucket = "${var.company_name}-${var.team_name}-data-${var.environment}"
  acl    = "private"
  tags   = local.common.tags

  lifecycle_rule {
    id      = "${var.company_name}-${var.team_name}-data-${var.environment}-rule"
    enabled = true

    transition {
      days          = 60
      storage_class = "STANDARD_IA"
    }

    transition {
      days          = 365
      storage_class = "GLACIER"
    }
  }
}

📣 Data Notifications with SNS

Want to trigger workflows when new data hits your bucket? Easy. Set up an SNS topic and hook it to S3.

resource "aws_sns_topic" "sns_s3_data_notification_queue" {
  name = "${var.team_name}-sns-s3-notification-${var.environment}"
  tags = local.common.tags
}

Now create the policy and link it:

data "aws_iam_policy_document" "sns_policy_doc" {
  statement {
    actions = ["SNS:Publish"]
    effect = "Allow"
    principals {
      type = "AWS"
      identifiers = ["*"]
    }
    resources = ["${aws_sns_topic.sns_s3_data_notification_queue.arn}"]
    condition {
      test     = "ArnLike"
      variable = "aws:SourceArn"
      values = ["${aws_s3_bucket.data.arn}"]
    }
  }
}

resource "aws_sns_topic_policy" "default" {
  arn    = aws_sns_topic.sns_s3_data_notification_queue.arn
  policy = data.aws_iam_policy_document.sns_policy_doc.json
}

resource "aws_s3_bucket_notification" "data-bucket-notification" {
  bucket = aws_s3_bucket.data.id

  topic {
    topic_arn = aws_sns_topic.sns_s3_data_notification_queue.arn
    events = ["s3:ObjectCreated:*"]
  }
}

🧠 Spinning Up an EMR Cluster

All of our ETL jobs will run on a shared EMR cluster. Let’s set one up with Spark as our core application:

resource "aws_emr_cluster" "etl_cluster" {
  name          = "${var.team_name}-etl-cluster-${var.environment}"
  release_label = "emr-6.10.0"
  applications = ["Spark"]

  master_instance_group {
    instance_type = "m5.xlarge"
  }

  core_instance_group {
    instance_type  = "m5.xlarge"
    instance_count = 5
    name           = "${var.team_name}-etl-cluster-core-${var.environment}"
  }

  tags = merge(local.common.tags, {
    useTag = "etl"
  })
}

🔐 IAM Roles for EMR

EMR clusters need IAM roles to function. First, the service role:

data "aws_iam_policy_document" "emr_service_assume_role" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type = "Service"
      identifiers = [
        "elasticmapreduce.amazonaws.com",
        "application-autoscaling.amazonaws.com"
      ]
    }
  }
}

resource "aws_iam_role" "emr_service_role" {
  name               = "${var.team_name}-emr-service-role-${var.environment}"
  assume_role_policy = data.aws_iam_policy_document.emr_service_assume_role.json
  path               = "/${var.team_name}/"
  tags               = local.common.tags
}

resource "aws_iam_role_policy_attachment" "emr_service_role_attachment" {
  role       = aws_iam_role.emr_service_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole"
}

Add the role to your EMR cluster:

resource "aws_emr_cluster" "etl_cluster" {
  ...
service_role = aws_iam_role.emr_service_role.arn
}

📈 Auto-Scaling Configuration

Let’s enable autoscaling to reduce cost and improve task efficiency. We need a separate autoscaling role:

resource "aws_iam_role" "emr_autoscaling_role" {
  name               = "${var.team_name}-emr-autoscaling-role-${var.environment}"
  assume_role_policy = data.aws_iam_policy_document.emr_service_assume_role.json
  path               = "/${var.team_name}/"
  tags               = local.common.tags
}

resource "aws_iam_role_policy_attachment" "emr_autoscaling_attachment" {
  role       = aws_iam_role.emr_autoscaling_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforAutoScalingRole"
}

resource "aws_emr_cluster" "etl_cluster" {
  ...
autoscaling_role = aws_iam_role.emr_autoscaling_role.arn
}

Now define scaling limits:

resource "aws_emr_managed_scaling_policy" "emr_scaling_policy" {
  cluster_id = aws_emr_cluster.etl_cluster.id

  compute_limits {
    unit_type                       = "Instances"
    minimum_capacity_units          = 5
    maximum_capacity_units          = 200
    maximum_ondemand_capacity_units = 40
    maximum_core_capacity_units     = 10
  }
}

📄 EMR Logging to S3

Let’s store logs in a dedicated bucket:

resource "aws_s3_bucket" "emr_log" {
  bucket = "${var.company_name}-${var.team_name}-emr-log-${var.environment}"
  acl    = "private"
  tags   = local.common.tags

  lifecycle_rule {
    id      = "${var.company_name}-${var.team_name}-emr-log-${var.environment}-rule"
    enabled = true
    expiration {
      days = 30
    }
  }
}

Set the log URI in EMR:

resource "aws_emr_cluster" "etl_cluster" {
  ...
log_uri = "s3n://${aws_s3_bucket.emr_log.id}/"
}

That’s a wrap for Part 2! 🎉 In the next part, we’ll connect EMR to Glue for metadata management and look into querying with Athena and Redshift.