Three Months In: What It Costs, What Broke, and Whether It Was Worth It

Published March 16, 2026

Side Project AWS Cost Lessons Learned Retrospective

Series: Building a Side Project That Runs Itself

The Origin Story
The Static Site Bet
Kotlin for a CLI Backend
Multi-Cloud Without the Drama
The Pipeline That Runs While I Game
Three Months In: Retrospective (this post)

Stockadora launched in December 2025. It’s now March 2026 — about three and a half months in. Not long enough to call it a success story. Long enough to have a real opinion about the decisions.

This is the retrospective post. I want it to be genuinely honest: what the cost model actually looks like, what actually broke, what I’d do differently, and what the whole thing has meant to me as an engineer. Not a victory lap. Not a post-mortem. Just the view from here.

How the Costs Actually Work

I’m going to talk about costs differently than most side-project posts do, because I think the interesting question isn’t “how much does it cost” — it’s what drives the cost, and how predictably does it scale.

Web infrastructure: scales with traffic, almost irrelevant

The hosting stack — CloudFront, S3, Route 53, Athena — is structured such that the costs are either fixed or scale so favorably with traffic that they’re essentially noise.

CloudFront has been running at around $0.05–$0.20 per month. That’s not a typo. A CDN that has served real traffic for months, costs less than a pack of gum per month. This is the entire argument for the static site architecture made concrete: once your site is a pile of HTML files on S3, traffic stops being a cost driver and starts being something you just don’t think about. A content post getting shared and driving a few thousand visits in an afternoon? Same $0.05.

S3 costs have grown steadily as data accumulates — from around $3/month at launch to $7–12/month now. This is expected and benign: the data bucket grows as the Kotlin CLI writes new filing summaries daily. But it won’t grow forever. I have lifecycle rules in place that archive objects older than a year to cheaper storage tiers, so once the project hits its one-year mark, S3 costs should plateau and stay there. After that, the cost scales with SEC filing volume — the same driver as the AI spend — which means it can’t suddenly spike without a corresponding change in how many companies are filing with the SEC. That’s not a number that moves fast.

Athena is under $1/month. Pay-per-query on a once-daily batch job is about as cheap as infrastructure gets.

The design lesson here: web traffic is completely decoupled from infrastructure cost. Viral traffic spikes cost nothing. If anything ever did drive AWS costs meaningfully higher, it would be data volume or query frequency — both of which I control.

AI: scales with SEC filings, not users

This is the interesting one, and the place where the cost model diverges from most web products.

The expensive part of running Stockadora is the AI. Filing summaries, insider trading analysis, news digests — these all burn tokens on every run. And crucially, the token count scales with how many SEC filings were filed that day, not with how many people visited the site.

This makes the cost model unusually predictable. SEC filing volume follows a well-known seasonal rhythm. The heaviest period is January through March — companies racing to file their annual 10-K reports before deadlines. Q1 earnings season in April–May is another peak. The summer lull is real. If you know the SEC filing calendar, you roughly know your AI bill six months in advance.

There are also episodic spikes: major market events that trigger 8-K filings, regulatory deadlines, acquisition announcements that cascade into filings from multiple parties. These add variance, but bounded variance. Nothing about AI costs goes exponential if the product gets popular.

The more interesting cost story is what happened across the life of the project. I started on AWS Bedrock, running Claude Sonnet — a powerful model, but priced accordingly. Monthly AI costs were significant, and the credits absorbed them. When I moved to Vertex AI and Gemini Flash earlier this year, the per-token cost dropped considerably. Gemini Flash is built for high-volume workloads like this: lots of medium-length documents, structured output, predictable schemas. The output quality held up; the cost improved.

The credits chapter

None of this would have been as easy to experiment with if I’d been paying out of pocket from day one.

AWS Activate provided a meaningful credit allocation early on, which funded the MVP and the Bedrock experimentation phase without billing anxiety. Once the site was live and doing something real — actual data, actual pages, actual users — I used that as the application to Google Cloud for Startups. A working product is a much stronger application than a pitch deck. Approved.

The GCP credits come with an expiration window, which creates an interesting incentive: use them or lose them. I’ve been deliberately increasing AI usage — expanding context windows, running richer analysis passes, experimenting with features I might have deferred if I were watching per-token costs closely. The constraint of “spend these credits productively before they expire” has pushed me to try things I’d have otherwise been too conservative to run. Some of those experiments have become permanent improvements.

The result is a cost structure where I’ve been paying very little in real money, while running a system that would cost a serious amount to operate at scale without the credits. That window won’t last forever, but it’s been the right environment for early experimentation.

What Actually Broke

The silent failure

The worst kind of production incident is one that doesn’t throw an error. A few months in, the SEC updated the URL structure of their full-text search API. The Kotlin CLI’s crawler hit the old path and got 404s. But I hadn’t written the pipeline to treat “no filings found” as a failure — it was just an empty result.

The pipeline ran successfully for two days. It processed zero filings. It wrote zero summaries. The site rebuilt itself faithfully from stale data every night and looked completely normal. I caught it by noticing that the “last updated” timestamps on the site seemed old.

Here’s the uncomfortable part: the code was largely written by AI and reviewed by me. And one thing I learned the hard way is that AI-generated code tends to be more reliable than it should be — in the wrong direction. It handles the happy path exceptionally well. It rarely throws unexpected exceptions. What it can be quietly bad at is distinguishing between “successfully processed nothing” and “failed to find anything.” The crawler returned an empty list with no error, and the pipeline happily moved on.

The two-day gap before I caught it wasn’t negligence — it was camouflage. The pipeline runs Monday through Friday. Zero filings on a given day is completely normal on weekends. So two consecutive weekdays of empty results looked, from the outside, like a weekend. Nothing alarmed, nothing paged, nothing obvious. The failure mode was indistinguishable from expected behavior.

The actual code fix took maybe 20 minutes. What took longer was stepping back and updating my prompting practices — adding explicit guidance about defensive assumptions and failure modes to the context I give AI when writing pipeline code. I spent more time improving the documentation and best practices for future AI-assisted work than I spent fixing the bug itself. That felt like the right investment.

The lesson: explicit assertions about the shape of results matter more than the absence of errors. I added a pipeline step that fails loudly if the filing count is suspiciously low for a weekday. That check has caught two other edge cases since.

The model migration

Google deprecated an earlier Gemini model version I was running and announced a cutoff date with reasonable lead time. Not an emergency, but not a free afternoon either. Migration required updating the model identifier, re-testing prompt templates (the newer model had different defaults for structured output), and doing a backfill run for the affected data.

About an afternoon of work. The new model performs better for this use case. I’m annoyed in retrospect that I didn’t stay closer to the latest model version — the migration would have been incremental rather than a one-time catch-up.

Beyond those two: the hosting infrastructure hasn’t had a meaningful incident. CloudFront, S3, Route 53 — boring and reliable in exactly the way I hoped.

What I’d Do Differently

Add the silent failure checks on day one. The EDGAR URL incident was preventable. Any pipeline that processes external data should assert something meaningful about the output before declaring success.

Set billing alerts immediately. It took me until month two to configure cost anomaly alerts on both clouds. Five minutes, eliminates the mental overhead of manually checking dashboards. Non-negotiable from day one on any future project.

Containerize the Kotlin CLI sooner. It runs directly on the self-hosted runner machine today, which means the runner needs the right JDK, the Gradle cache needs to be in a predictable place, and there’s some implicit machine configuration baked in. Docker would make this portable, reproducible, and easier to debug. It’s on the list and I keep not doing it.

What Worked Better Than Expected

The static architecture’s ceiling is extremely high — and the design is more demanding than it looks. This was my first time running a truly backend-free content site at this scale, and I had a quiet assumption going in that static meant simple. It’s not.

What static actually means is pre-computed. Not no-compute. The compute still happens — all of it — just at build time rather than request time. The Kotlin CLI processes filings, calls Gemini, aggregates data, and writes JSON. Astro reads that JSON and renders thousands of HTML pages. Every page a user visits was fully assembled hours before they asked for it. The server doesn’t think; it just serves.

That shift from “compute on request” to “compute on schedule” has a consequence people underestimate: it demands a more careful data model, not a more relaxed one. With a traditional backend, you can afford some fuzziness in your data layer — a missing field can be handled at query time, a new dimension can be added to a response without a deploy. With pre-computation, your data model needs to anticipate everything the frontend will need before the build runs. There are no dynamic fallbacks. If your backend model doesn’t produce what the frontend template expects, you find out during the build, not during a user request — which is actually great for reliability, but it requires discipline upfront.

I ended up with two distinct model layers: a storage model in the backend, shaped around what’s efficient to produce and store — filing metadata, raw Gemini output, aggregated company data; and a frontend model shaped around what Astro templates actually consume — flattened, denormalized, ready to render without any further transformation. The pipeline’s job is to translate between the two. A well-designed financial data website and a dynamically-served one have essentially the same design work. The only meaningful difference is where the computation runs.

What disappears entirely is the runtime infrastructure: no API server, no database connections, no query latency, no “the database fell over” incident category. S3 and CloudFront handle any traffic level I’m likely to reach without a thought from me. The architecture that cost nothing to run at zero traffic is the same architecture at scale.

Gemini output quality held up — because I treated it as an engineering problem, not a magic box.

The framing I used from the start: prompts are code. They live in the repository, they go through review, and they get iterated on systematically. When output quality drifts or a new edge case surfaces, the prompt is the first thing I look at, not the model.

What makes this tractable at low cost is the evaluation loop. I don’t manually review summaries to decide if a prompt change is an improvement. I use other models — mostly Claude, sometimes Codex — to evaluate Gemini’s output against a rubric: does this summary capture the material disclosures? Is the tone appropriate? Does it hallucinate specifics not present in the source? The AI agents run the change-feedback-improve cycle largely on their own, with me setting the workflow and reviewing the conclusions. Gemini Flash’s low cost and fast response time makes it practical to run hundreds of evaluation samples per iteration without watching the clock. You can move fast when the feedback loop is cheap.

What this forced on me, usefully, was stepping back from code details into workflow design. And at the workflow level, the interesting question is: what does a good SEC filing analysis actually look like?

I found out the hard way that the naive answer — feed the full filing to the model and ask for a summary — produces mediocre results. SEC filings are long, structured documents full of boilerplate legal language, exhibit indexes, signatures, and repetition. A 10-K can run 200+ pages. Blindly feeding that and hoping for a coherent summary is not a workflow; it’s an optimistic prompt.

So I read several filings myself. Painfully. And I paid attention to how I was actually reading them — what I skimmed, what I stopped at, what I had to read twice. That became the workflow:

Strip the noise — remove boilerplate, legal exhibits, signature blocks, and formatting artifacts. Convert the remaining content to clean Markdown.
Select the valuable documents — a filing is often a package of dozens of attachments. Not all of them matter equally. Identify and rank which documents actually contain material information for this filing type.
Chunk within documents — even within a valuable document, not every section has equal signal. Pick the chunks most likely to contain what an investor actually cares about: risk factors, management discussion, material events.
Then summarize — by this point the model is working with curated, high-signal input rather than a wall of legal text.

The output quality is a product of the workflow design as much as the model. The validation layer catches maybe one hallucination per week across hundreds of filings, which feels like a reasonable error rate — but that number is low because the model is working with clean, focused input, not because the model is magic.

The pipeline genuinely runs itself. My honest maintenance average: one to two hours a month. Re-running the occasional failed job, a dependency update here and there, tuning prompts when I notice something off. Spikes happen — the EDGAR incident was a few hours, the model migration was an afternoon — but those are quarterly events.

Recently that number dropped further. I’ve been running an OpenClaw instance on one of my long-running homelab machines, and it’s taken over most of the daily oversight. Every day it checks whether the GitHub Actions workflows completed successfully, reads through the logs, and decides if anything looks wrong. If it spots a problem, it pings me on Slack with a summary and, more often than not, a proposed fix PR already drafted. I review, merge or tweak, and move on.

I’ll be honest: watching an AI agent monitor another AI-powered pipeline and then propose code to fix it gives me a feeling I didn’t expect — somewhere between amusement and genuine excitement. It’s not just automation anymore. It feels like the project has a caretaker. And that caretaker costs me nothing but a machine that was already running.

The Part That Surprised Me Most

I didn’t start Stockadora with a clear business picture. I started it because I wanted to build something real and because there were ideas I’d never had the chance to try in a company context. The constraint of “I want to keep my weekends” turned out to be the best design constraint I’ve ever worked under. It forced everything to be simple enough to leave alone.

But what genuinely surprised me was how much of this project was made possible by AI tooling that didn’t meaningfully exist in this form a year ago.

Every piece of this was touched by AI: Kotlin code written and iterated with Claude Code, content generation powered by Gemini, dependency upgrades handled by GitHub Dependabot surfacing and drafting the PRs, infrastructure changes suggested and reviewed with Claude Code reading the Terraform. I didn’t expect the stack to feel this collaborative with tooling. I expected to do most of the thinking and have AI do boilerplate. What actually happened is that the AI handled problems I would have avoided starting because they were too tedious, which expanded the scope of what I was willing to attempt.

The result is that a solo engineer can run a system like this in production with serious AI workloads and stay under two hours of maintenance a month. That would have seemed implausible to me in early 2025. It’s just how I build now.

Is It Worth It?

The project is young. It’s not generating meaningful revenue. Traffic is growing slowly. There’s no clear path to “this pays for itself” in the near future, and I haven’t lost sleep over that because it was never the goal.

What I got instead: a system running in production that processes real financial data, uses real AI infrastructure, fails in real ways and recovers from them. Experience trying ideas I never had permission to try inside a company. A forcing function that made me learn Terraform deeply, think hard about cost architecture, and figure out how to make a pipeline genuinely self-managing.

The project will keep running. The pipeline is self-driving enough that I can focus on improvements when I feel like it and ignore it when I don’t. AI is powering both the product and the development — when I come back to it to build something new, the tooling picks up where it left off.

That feedback loop — build, ship, let it run, come back with AI assistance and improve it — feels like a different model of side project ownership than I’ve experienced before. I think I’ll be doing this for a long time.