Build a Data Lake with AWS Services (Part 3)

DevOps

Note: This blog post was reviewed using AI for factual correctness and clarity. All content was tested in my private homelab to ensure accuracy.

Welcome to Part 3 of our AWS Data Lake series! So far, we’ve set up S3 storage, an EMR cluster, and even handled autoscaling and logging. Now it’s time to bring meaning to our data by setting up metadata with AWS Glue and unlocking query power with Athena and Redshift.


📚 Metadata Management with AWS Glue

S3 holds our data, but it doesn’t know what it means. That’s where AWS Glue comes in — acting like a catalog or schema registry.

🧱 Define the Glue Catalog Database

resource "aws_glue_catalog_database" "data_lake" {
  name = "${var.team_name}_data_lake_${var.environment}"
}

This acts like a namespace for all our tables. Simple and essential.


🧪 Create a Table for Structured Data

Assume we have Parquet files in S3 that we want to expose for SQL queries:

resource "aws_glue_catalog_table" "events" {
  name          = "event_data"
  database_name = aws_glue_catalog_database.data_lake.name
  table_type    = "EXTERNAL_TABLE"

  storage_descriptor {
    location      = "s3://${aws_s3_bucket.data.bucket}/events/"
    input_format  = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"
    compressed    = true

    ser_de_info {
      name                  = "parquet"
      serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
    }

    columns = [
      {
        name = "event_id"
        type = "string"
      },
      {
        name = "timestamp"
        type = "timestamp"
      },
      {
        name = "user_id"
        type = "string"
      }
    ]
  }

  partition_keys = [
    {
      name = "event_date"
      type = "string"
    }
  ]
}

You can also create partitioned tables for better performance and cost control with Athena.


🔍 Querying with Athena

Now that Glue knows your schema, it’s query time.

Athena is a serverless interactive SQL engine — no setup required!

✅ Enable Athena Output

Athena needs a place to store query results:

resource "aws_s3_bucket" "athena_output" {
  bucket = "${var.company_name}-${var.team_name}-athena-output-${var.environment}"
  acl    = "private"
  tags   = local.common.tags
}

Set this bucket as your query result location in the Athena console or using the SDK.

Then you can run queries like:

SELECT *
FROM data_lake.event_data
WHERE event_date = '2024-01-01';

📊 Integrating with Redshift Spectrum

Already using Redshift? Glue makes external S3 data available to Redshift via Redshift Spectrum.

Steps:

  1. Define external schema in Redshift (pointing to Glue DB).
  2. Run SQL queries across Redshift and S3-stored data.
  3. Enjoy high-performance joins and filters across platforms.

This unlocks hybrid analytical workflows — combining warehouse performance with data lake scale.


🧠 Summary

By now, you’ve got:

  • ✅ Data stored efficiently in S3
  • ✅ EMR clusters for processing
  • ✅ Metadata catalogs with Glue
  • ✅ Serverless queries with Athena
  • ✅ Optional integration with Redshift

This is a fully operational, modular, and scalable AWS-native data lake.


🛣️ What’s Next?

You might consider:

  • Adding CI/CD pipelines for Terraform
  • Triggering workflows using EventBridge or Step Functions
  • Monitoring jobs and access logs with CloudWatch
  • Integrating with SageMaker for ML workflows

Thanks for following along. Now go make your data work for you! 📊🚀