Build a Data Lake with AWS Services (Part 3)
Note: This blog post was reviewed using AI for factual correctness and clarity. All content was tested in my private homelab to ensure accuracy.
Welcome to Part 3 of our AWS Data Lake series! So far, we’ve set up S3 storage, an EMR cluster, and even handled autoscaling and logging. Now it’s time to bring meaning to our data by setting up metadata with AWS Glue and unlocking query power with Athena and Redshift.
📚 Metadata Management with AWS Glue
S3 holds our data, but it doesn’t know what it means. That’s where AWS Glue comes in — acting like a catalog or schema registry.
🧱 Define the Glue Catalog Database
resource "aws_glue_catalog_database" "data_lake" {
name = "${var.team_name}_data_lake_${var.environment}"
}
This acts like a namespace for all our tables. Simple and essential.
🧪 Create a Table for Structured Data
Assume we have Parquet files in S3 that we want to expose for SQL queries:
resource "aws_glue_catalog_table" "events" {
name = "event_data"
database_name = aws_glue_catalog_database.data_lake.name
table_type = "EXTERNAL_TABLE"
storage_descriptor {
location = "s3://${aws_s3_bucket.data.bucket}/events/"
input_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"
output_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"
compressed = true
ser_de_info {
name = "parquet"
serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
}
columns = [
{
name = "event_id"
type = "string"
},
{
name = "timestamp"
type = "timestamp"
},
{
name = "user_id"
type = "string"
}
]
}
partition_keys = [
{
name = "event_date"
type = "string"
}
]
}
You can also create partitioned tables for better performance and cost control with Athena.
🔍 Querying with Athena
Now that Glue knows your schema, it’s query time.
Athena is a serverless interactive SQL engine — no setup required!
✅ Enable Athena Output
Athena needs a place to store query results:
resource "aws_s3_bucket" "athena_output" {
bucket = "${var.company_name}-${var.team_name}-athena-output-${var.environment}"
acl = "private"
tags = local.common.tags
}
Set this bucket as your query result location in the Athena console or using the SDK.
Then you can run queries like:
SELECT *
FROM data_lake.event_data
WHERE event_date = '2024-01-01';
📊 Integrating with Redshift Spectrum
Already using Redshift? Glue makes external S3 data available to Redshift via Redshift Spectrum.
Steps:
- Define external schema in Redshift (pointing to Glue DB).
- Run SQL queries across Redshift and S3-stored data.
- Enjoy high-performance joins and filters across platforms.
This unlocks hybrid analytical workflows — combining warehouse performance with data lake scale.
🧠 Summary
By now, you’ve got:
- ✅ Data stored efficiently in S3
- ✅ EMR clusters for processing
- ✅ Metadata catalogs with Glue
- ✅ Serverless queries with Athena
- ✅ Optional integration with Redshift
This is a fully operational, modular, and scalable AWS-native data lake.
🛣️ What’s Next?
You might consider:
- Adding CI/CD pipelines for Terraform
- Triggering workflows using EventBridge or Step Functions
- Monitoring jobs and access logs with CloudWatch
- Integrating with SageMaker for ML workflows
Thanks for following along. Now go make your data work for you! 📊🚀