Build a Data Lake with AWS Services (Part 2)
Note: This blog post was reviewed using AI for factual correctness and clarity. All content was tested in my private homelab to ensure accuracy.
Welcome back! In Part 1, we explored the architecture of a simple yet powerful AWS-based data lake. Now, in Part 2, we’ll roll up our sleeves and actually provision resources using Terraform. Let’s get building. 🛠️
🪣 S3 Storage for Data
Let’s start with our raw storage layer — S3.
Key things to keep in mind:
- S3 bucket names must be globally unique.
- Set ACLs to private for security.
- Use lifecycle rules to optimize cost: frequently accessed data stays in
Standard, older data can move toSTANDARD_IAorGLACIER.
resource "aws_s3_bucket" "data" {
bucket = "${var.company_name}-${var.team_name}-data-${var.environment}"
acl = "private"
tags = local.common.tags
lifecycle_rule {
id = "${var.company_name}-${var.team_name}-data-${var.environment}-rule"
enabled = true
transition {
days = 60
storage_class = "STANDARD_IA"
}
transition {
days = 365
storage_class = "GLACIER"
}
}
}
📣 Data Notifications with SNS
Want to trigger workflows when new data hits your bucket? Easy. Set up an SNS topic and hook it to S3.
resource "aws_sns_topic" "sns_s3_data_notification_queue" {
name = "${var.team_name}-sns-s3-notification-${var.environment}"
tags = local.common.tags
}
Now create the policy and link it:
data "aws_iam_policy_document" "sns_policy_doc" {
statement {
actions = ["SNS:Publish"]
effect = "Allow"
principals {
type = "AWS"
identifiers = ["*"]
}
resources = ["${aws_sns_topic.sns_s3_data_notification_queue.arn}"]
condition {
test = "ArnLike"
variable = "aws:SourceArn"
values = ["${aws_s3_bucket.data.arn}"]
}
}
}
resource "aws_sns_topic_policy" "default" {
arn = aws_sns_topic.sns_s3_data_notification_queue.arn
policy = data.aws_iam_policy_document.sns_policy_doc.json
}
resource "aws_s3_bucket_notification" "data-bucket-notification" {
bucket = aws_s3_bucket.data.id
topic {
topic_arn = aws_sns_topic.sns_s3_data_notification_queue.arn
events = ["s3:ObjectCreated:*"]
}
}
🧠 Spinning Up an EMR Cluster
All of our ETL jobs will run on a shared EMR cluster. Let’s set one up with Spark as our core application:
resource "aws_emr_cluster" "etl_cluster" {
name = "${var.team_name}-etl-cluster-${var.environment}"
release_label = "emr-6.10.0"
applications = ["Spark"]
master_instance_group {
instance_type = "m5.xlarge"
}
core_instance_group {
instance_type = "m5.xlarge"
instance_count = 5
name = "${var.team_name}-etl-cluster-core-${var.environment}"
}
tags = merge(local.common.tags, {
useTag = "etl"
})
}
🔐 IAM Roles for EMR
EMR clusters need IAM roles to function. First, the service role:
data "aws_iam_policy_document" "emr_service_assume_role" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = [
"elasticmapreduce.amazonaws.com",
"application-autoscaling.amazonaws.com"
]
}
}
}
resource "aws_iam_role" "emr_service_role" {
name = "${var.team_name}-emr-service-role-${var.environment}"
assume_role_policy = data.aws_iam_policy_document.emr_service_assume_role.json
path = "/${var.team_name}/"
tags = local.common.tags
}
resource "aws_iam_role_policy_attachment" "emr_service_role_attachment" {
role = aws_iam_role.emr_service_role.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole"
}
Add the role to your EMR cluster:
resource "aws_emr_cluster" "etl_cluster" {
...
service_role = aws_iam_role.emr_service_role.arn
}
📈 Auto-Scaling Configuration
Let’s enable autoscaling to reduce cost and improve task efficiency. We need a separate autoscaling role:
resource "aws_iam_role" "emr_autoscaling_role" {
name = "${var.team_name}-emr-autoscaling-role-${var.environment}"
assume_role_policy = data.aws_iam_policy_document.emr_service_assume_role.json
path = "/${var.team_name}/"
tags = local.common.tags
}
resource "aws_iam_role_policy_attachment" "emr_autoscaling_attachment" {
role = aws_iam_role.emr_autoscaling_role.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforAutoScalingRole"
}
resource "aws_emr_cluster" "etl_cluster" {
...
autoscaling_role = aws_iam_role.emr_autoscaling_role.arn
}
Now define scaling limits:
resource "aws_emr_managed_scaling_policy" "emr_scaling_policy" {
cluster_id = aws_emr_cluster.etl_cluster.id
compute_limits {
unit_type = "Instances"
minimum_capacity_units = 5
maximum_capacity_units = 200
maximum_ondemand_capacity_units = 40
maximum_core_capacity_units = 10
}
}
📄 EMR Logging to S3
Let’s store logs in a dedicated bucket:
resource "aws_s3_bucket" "emr_log" {
bucket = "${var.company_name}-${var.team_name}-emr-log-${var.environment}"
acl = "private"
tags = local.common.tags
lifecycle_rule {
id = "${var.company_name}-${var.team_name}-emr-log-${var.environment}-rule"
enabled = true
expiration {
days = 30
}
}
}
Set the log URI in EMR:
resource "aws_emr_cluster" "etl_cluster" {
...
log_uri = "s3n://${aws_s3_bucket.emr_log.id}/"
}
That’s a wrap for Part 2! 🎉 In the next part, we’ll connect EMR to Glue for metadata management and look into querying with Athena and Redshift.