Chapter 9: Making It Observable

7 min readThis chapter is still being written

The very first issues I filed at Apex — issue #1 and issue #2 — were “Monitor Fargate Backups” and “Monitor Fargate Logs.” Day one. Before customers. Before features. Before anything. I knew exactly what to do. I‘d been a CTO for years. I’d given talks about operational maturity. I could tell you, in my sleep, that monitoring comes first.

Then I spent the next month building features instead.

I want to be honest about why, because I think the reason is universal and nobody talks about it. Building features feels like progress. You can see a feature. You can demo a feature. You can show a feature to a potential customer and watch their eyes light up. Monitoring is invisible work. Nobody demos their alerting pipeline. Nobody tweets “just shipped: we now know when our database is slow.” The dopamine loop of building visible things is so strong that it drowns out the quieter, more important work of making sure those visible things actually function.

And every day I put it off, it felt less urgent. Nothing had visibly broken. The system seemed fine.

Seemed. That's the word that should terrify you.

I was confusing the absence of visible failure with the presence of health. They are not the same thing. When you‘re a solo builder shipping thirty commits a day, observability feels like premature optimization. “I’ll add monitoring once there‘s something worth monitoring.” But the moment there’s something worth monitoring is the exact moment you can no longer afford to not have it.

By the time customers showed up, monitoring was barely functional. The usage collector was broken — four million cache tokens just… missing from the data. The pulse reporter tracking instance health died on container restart and never came back. Fleet-wide RPC collection was failing across all twenty-two running instances. Warnings about broken API keys were being collected but never surfaced to anyone who could act on them.

I could build features at thirty commits per day. I could not tell you which of my customers' instances were actually working.

Was this the right call? No. I knew it was wrong on day one, which is why I filed those issues first. The knowledge didn't help because the incentives pointed the other way. Features are visible. Monitoring is invisible. Nobody sees your alerting pipeline until the day everything breaks and everyone sees the absence of it.

And then John Malecki showed up.

John generated nine issues in four days. Nine. Each one revealing a platform weakness I had no idea existed. His agent lost its memory after conversations exceeded a certain length. Health checks came back green when core features were broken. API connections dropped every ten minutes and silently reconnected in a degraded state. The Memory Tab showed all zeros — not because there was no data, but because the container connection had failed and the frontend just shrugged and rendered empty.

John was my monitoring system. A paying customer, doing the observability work my infrastructure should have been doing. He was filing the alerts I should have been receiving. He was testing the failure modes I should have been catching. He was, in the most literal sense, the human being on the other end of the screen discovering that my product was broken in ways I couldn't see.

That is not a viable architecture.

If your first signal that something is broken comes from a customer, you have failed at observability. Not partially. Completely. The customer found the bug, debugged it, filed the ticket — all the work a twenty-dollar-a-month monitoring service should have done. Unlike the monitoring service, the customer can churn.

Observability is the bridge between building and operating — the difference between “I think it‘s working” and “I can see it’s working.” Without it, your understanding of the product is a guess that gets staler every hour.

Minimum Viable Observability

Minimum viable observability is simpler than you think. Three things.

Know when it‘s down. A health check endpoint and an uptime monitor that pings it. An hour of work. If you don’t have this, you're finding out your system is down when a customer emails you. Or churns.

Know when it's broken but running. This bit me harder than outright downtime. The Memory Tab showing all zeros when the container connection failed — the page loaded, a core feature was silently broken, wrong data presented with total confidence. Log the things that matter: errors with context, external API calls with response times, auth failures, anything touching money or user data. Alert on rates, not individual errors — one error is noise, a spike is signal.

Know how it‘s being used. At Apex, per-customer usage tracking didn’t exist until issue #153 — ten days after customers started using the product. I couldn‘t tell you who was active, who was struggling, who was about to churn. Usage data isn’t just analytics for a dashboard. It's how you know whether the thing you built is the thing people need.

The Silent Failure Pattern

A pattern I saw over and over at Apex, and I want you to watch for it: the silent failure. The system doesn‘t crash. It doesn’t throw an error. It doesn't page you. It quietly does the wrong thing. Settings sync reports success while files remain unchanged — issue #164. The pulse reporter dies and nobody notices — issue #9. API key updates save to the database but never push to instances — issue #182. The system is lying. Politely. Confidently. Continuously.

Silent failures aren't unique to AI-built software, but AI-generated code has a specific tendency that makes them more common: it optimizes for the happy path. Errors are caught and dismissed rather than caught and surfaced. Try/catch blocks that swallow exceptions. Error callbacks that return success. “Work” and “work reliably” are different requests, and the AI defaults to the first one.

They're the hardest bugs to catch because every signal you normally rely on — error logs, status codes, crash reports — says everything is fine. The system is confidently wrong.

The fix is defensive: for every critical operation, verify the outcome. Don‘t just check the return code. Confirm the thing actually happened. Read back what you wrote. Check that the file changed. Verify the instance received the update. This feels paranoid. It’s not. It's the minimum level of distrust a production system requires.

Cost

A word about cost, because nobody mentions this and it matters a lot when you‘re a solo builder watching every dollar. Observability tools get expensive fast. Logging alone can run hundreds of dollars a month if you’re not careful about volume. Metrics platforms charge per host, per custom metric, per million data points. Tracing — if you actually instrument your code properly — generates enormous amounts of data.

At one-to-two, you don‘t need Datadog’s enterprise tier. You need tools with sane free tiers and predictable pricing. Uptime monitoring is free or nearly free from a dozen providers. Error tracking through Sentry‘s free tier handles a solo product comfortably. Logging means structured logs to stdout, collected by whatever your cloud provider offers — CloudWatch, Cloud Logging, whatever. Don’t pay for a third-party log aggregator until you have a reason. Usage analytics can be a simple database table and a query you run each morning. The goal is seeing, not sophistication. I'd rather have a five-dollar-a-month setup I actually check than a five-hundred-dollar-a-month platform I configured once and forgot about.

How It Actually Unfolded

Here‘s how observability actually unfolded at Apex, and what I’d do again.

The first thing I built — once I finally stopped ignoring my own issue backlog — was a health check endpoint and an uptime ping. An hour of work. Should have been hour one of the entire project. Instead it was week four. The moment it went live, it caught a container restart failure I didn't know was happening. One hour of work, one real bug found immediately. That ratio never gets worse. Monitoring always pays off faster than you expect.

The second week, I wired up error logging with rate-based alerts. Not every error — just the ones that mattered. API failures, database timeouts, auth errors. The key insight was alerting on rates, not individual events. A single 500 error at 3am is noise. Ten 500 errors in five minutes is a pattern. The threshold took some tuning — I started too sensitive and got paged for nothing, then backed off too far and missed a real incident. You find the right level by living with it. There's no formula.

Week three was usage tracking, and this was the one that changed how I thought about the product. Once I could see who was using what, how often, and when they stopped, the entire product roadmap shifted. Features I thought were critical were barely touched. Features I‘d built as afterthoughts were the ones customers actually used daily. If I’d had this data from day one, I would have built a different product. Not a better product necessarily — but a more honest one. One shaped by what people actually did rather than what I imagined they would do.

By week four, I had a simple dashboard — nothing fancy, just a page that pulled together health status, error rates, and usage numbers. For the first time, I could look at one screen and know whether Apex was working. Not guess. Know. The relief was physical. I hadn't realized how much background anxiety I was carrying from not knowing the state of my own system.

Four weeks, less than three days of actual work. I went from blind to seeing. I filed those monitoring issues on day one and didn‘t actually build them until it was almost too late. Don’t repeat my mistake. The features can wait a day. The monitoring can't.