Build a Data Lake with AWS Services (Part 1)

Published April 3, 2023 Updated September 4, 2025

DevOps

Note: This blog post was reviewed using AI for factual correctness and clarity. All content was tested in my private homelab to ensure accuracy.

Data lakes are more than a buzzword — they’re essential infrastructure for organizations dealing with data at scale. While big names like Databricks, Snowflake, and Cloudera offer powerful (and pricey) solutions, many teams already using AWS can build a simple and flexible data lake using native AWS services.

In this post, we’ll walk through the design philosophy, architecture, and service roles in building a data lake entirely on AWS. This is Part 1 of the journey — implementation comes next!

🧠 Design Philosophy: Decouple Everything

A golden rule learned from those before us: separate storage from compute.

Why? Because compute platforms are always evolving. Spark loves HDFS, Redshift has its own storage layer, and who knows what the next hot tool will be? Keeping data in an open, flexible format allows you to switch or add compute tools as needed.

Also, data comes in all shapes and sizes:

Logs → plain text
Snapshots → structured files
Real-time → direct DB queries

With decoupling, you’re free to store it all in the best-fit format while choosing the right tools to process it later.

In AWS terms:

Storage → S3
Metadata → Glue
Compute → EMR, Athena, Notebooks, Redshift

🪣 Storage Layer: Amazon S3

Why S3?

It’s durable, scalable, inexpensive, and battle-tested. But it’s not a full-fledged file system — it’s a key-value object store. That means we need to be a bit mindful in how we use it:

⚠️ Tips for Using S3 Wisely:

Avoid millions of tiny files: Each small object requires an HTTP request. Listing large prefixes becomes painfully slow.
Eventual consistency? Not anymore! Since 2021, S3 supports strong consistency — yay!
“Directories” are an illusion: Keys like a/b/c/d.txt look like a folder structure, but S3 treats them as flat keys. You can list or delete using prefixes, but directories don’t have native properties or hierarchy behavior.

🗂️ Metadata Layer: AWS Glue

S3 stores bytes — Glue tells us what those bytes mean.

Glue provides a database-like interface to your S3 data by storing schemas, such as:

Column names and types
File formats (CSV, Parquet, etc.)
Partitioning details

Glue also connects to external databases to pull in their schema info, letting you join across S3 and RDS/Redshift. It turns your blob storage into something queryable.

🧮 Compute Layer: EMR, Athena, Notebooks, Redshift

Once metadata is in Glue, you unlock a whole suite of AWS services for compute:

EMR: Launch Spark or Hive clusters in minutes, no manual setup
Athena: Query your S3 data using SQL — no infrastructure required
SageMaker/Notebooks: Pull data into Jupyter for analytics or ML
Redshift Spectrum: Extend Redshift queries to include external S3 data

The beauty: write once to S3, read it everywhere.

🛠️ Implementation Overview

Everything in this architecture can be defined using Terraform, following an Infrastructure-as-Code approach.

We’ll:

Define resources for S3, Glue, IAM, EMR, etc.
Wire them up securely
Maintain flexibility for optional components

Security matters — but if you’re just prototyping, you can strip things down for faster iteration (shout out to our DevOps friends trying to keep it safe 😉).

Stay tuned for Part 2, where we’ll dive into the Terraform implementation and get this lake up and running!