Article

Death by a Thousand Logs: The Hidden Operational Costs Killing Your Margins

Jan 12, 2026

10 minute read

Andy Van Becelaere

Cloud Architect

Part 3 of 3: The $10K Mistake Series

There’s a moment in every startup’s life cycle that I’ve come to recognize instantly. The engineering team has optimized their CloudFront configuration, migrated to HTTP APIs, and put proper caching in front of S3. The AWS bill drops by 30%, maybe 40%. Everyone high-fives. The CFO sends a congratulatory email. And then, three months later, the bill starts creeping back up.

Not dramatically. Just a few hundred dollars here, a few hundred there. Nobody notices at first because the absolute numbers are still lower than before the optimization. But the trend line is unmistakable: costs are growing faster than traffic. Something is quietly eating away at those hard-won savings.

I got a call from a CTO about exactly this situation. They’d implemented everything from Parts 1 and 2 of this series. Their infrastructure costs had dropped from $12,000 to $7,500 per month. Victory, right? Except six months later, they were back up to $9,800 despite traffic only growing by 15%. When I dug into their Cost Explorer, the culprit was hiding in plain sight: CloudWatch Logs had gone from $180/month to $1,850/month. Their logging costs had grown by 10x while nobody was paying attention.

This is the part of AWS cost optimization that nobody talks about because it’s not sexy. It’s not about clever architecture or cutting-edge services. It’s about the operational overhead that accumulates slowly, like plaque in an artery, until one day you realize you’re spending $2,000 per month to store logs that nobody ever reads.

The Logging Cost Nobody Expects

The company I mentioned had made a decision that seemed perfectly reasonable at the time. They’d enabled API Gateway execution logging for debugging during development. Full request and response logging, every header, every query parameter, every response body. It was incredibly useful for troubleshooting issues. They’d planned to dial it back after launch, but launch was chaotic, and nobody ever got around to it.

Fast forward eighteen months, and they were processing 150 million API requests per month. Each request was generating about 2KB of log data: request details, Lambda execution logs, response data, timing information. That’s 300GB of logs per month flowing into CloudWatch. CloudWatch charges $0.50 per GB for ingestion and $0.03 per GB per month for storage. The math was brutal: $150/month for ingestion, plus accumulating storage costs that were now at $54/month and growing every month because they’d never set up log retention policies.

But here’s where it gets worse. They’d also enabled CloudFront access logs because someone had read it was a best practice. CloudFront access logs are free to generate, but they’re stored in S3. At 150 million requests per month, they were generating about 30GB of log files monthly. That’s only $0.69 per month in S3 storage costs, which sounds fine until you realize they’d been running for two years and had accumulated 720GB of logs they’d never looked at. They were paying $16.50/month to store logs from 2024 that had zero business value.

I sat down with their team and asked a simple question: “When was the last time you actually looked at these logs?” The answer was telling. They used CloudWatch Insights maybe once a quarter to debug a specific issue. They’d never opened the CloudFront access logs. Not once. They were spending nearly $2,000 per month on logging infrastructure that provided almost no value.

We implemented a tiered logging strategy based on actual needs. For API Gateway, we switched from execution logs to access logs, which capture the essential information: who called what endpoint, when, and what the response code was, without the verbose request/response bodies. This cut their log volume by 85%. For the remaining execution logs on critical endpoints, we implemented sampling at 10%. If they needed to debug an issue, they’d have enough data points to identify patterns without logging every single request.

For CloudWatch log retention, we set up lifecycle policies. Application logs got 30 days of retention: enough to debug recent issues but not so long that they were paying to store ancient history. Access logs got 90 days because they occasionally needed them for security audits. Everything older than that got automatically deleted. We also moved some historical logs to S3 with Glacier Deep Archive storage class, which costs $0.00099 per GB per month instead of CloudWatch’s $0.03 per GB. For logs they might need for compliance but would probably never access, this was a 97% cost reduction.

For CloudFront access logs, we implemented S3 lifecycle policies to transition logs older than 90 days to Glacier, and delete logs older than one year entirely. We also switched from standard access logs to real-time logs for just their most critical paths, which gave them better debugging capabilities for the 5% of traffic that actually mattered while reducing overall log volume.

The impact was immediate and dramatic. Their CloudWatch costs dropped from $1,850/month to $285/month. Their S3 log storage costs went from $16.50/month to $2.40/month. Total savings: $1,579/month, or nearly $19,000 per year. And the bonus result was their ability to debug issues actually improved because they were focusing their logging on the endpoints and scenarios that mattered instead of drowning in noise.

The Monitoring Blind Spots

But fixing logging costs revealed another problem. They’d been so focused on keeping their application running that they’d never set up proper cost monitoring. They had CloudWatch alarms for application health: API error rates, Lambda duration, database connections, but nothing that would alert them when their AWS bill was trending in the wrong direction.

I’ve seen this pattern repeatedly. Engineering teams are great at monitoring technical metrics but terrible at monitoring financial ones. They’ll get paged at 3 AM if their API latency spikes by 50ms, but they won’t notice when their logging costs increase by 500% over six months. The result is that cost problems compound silently until someone finally looks at the bill and panics.

We set up a cost monitoring framework that treated AWS spending like any other operational metric. First, we configured AWS Budgets with alerts at 80%, 90%, and 100% of their monthly target. This gave them early warning when costs were trending high. But percentage-based alerts aren’t enough because they don’t tell you what changed or why.

We also set up CloudWatch alarms for specific cost drivers. If their CloudFront request count increased by more than 20% week-over-week, they got an alert. If their S3 data transfer costs spiked, they got an alert. If their Lambda invocation count doubled overnight, they got an alert. These weren’t budget alerts: they were operational alerts that something had changed in their application behavior.

The most valuable monitoring we implemented was cache hit ratio tracking. We set up a CloudWatch dashboard that showed their CloudFront cache hit ratio by path pattern. When it dropped below 80% for static assets, they got an alert. This caught issues like accidentally deploying non-versioned filenames or misconfigured cache behaviors before they became expensive problems.

We also implemented cost allocation tags across their entire infrastructure. Every resource got tagged with environment, application, team, and cost center. This let them break down their AWS bill by product line and see exactly where money was going. They discovered that their internal admin dashboard, which served maybe 50 users, was costing them $400/month because it was making inefficient API calls and had no caching. They refactored it in an afternoon and saved $350/month.

The monitoring framework cost them nothing to implement: it was all built on free AWS features like CloudWatch alarms and Cost Explorer. But it saved them from future cost surprises and gave them the visibility they needed to make informed architectural decisions.

The Cost-Optimized Reference Architecture

After helping dozens of companies fix these issues, I’ve developed a reference architecture that avoids all the anti-patterns we’ve discussed. This isn’t theoretical: it’s based on real production systems serving millions of users while keeping costs under control.

The frontend is a React SPA built with versioned filenames for all assets. The build process generates files like app.a3f2b1c.js and styles.d4e5f6g.css, which means we can cache them forever without worrying about stale content. These files live in a private S3 bucket with Intelligent-Tiering enabled. Most assets stay in Standard storage because they’re accessed frequently, but older assets automatically move to cheaper storage tiers.

CloudFront sits in front of S3 with Origin Access Control configured, which means users can’t bypass the CDN and hit S3 directly. We have separate cache behaviors for different content types. Versioned assets get one-year TTLs and aggressive caching. The index.html file gets a five-minute TTL because it contains references to the versioned assets and needs to update when we deploy. Compression is enabled for all text-based content, which typically reduces transfer sizes by 70–80%.

The API layer uses HTTP APIs, not REST APIs, because we don’t need API keys or request validation. We use JWT authorizers for authentication, which HTTP APIs support natively. The APIs point to Lambda functions with appropriate memory and timeout settings: we’ve found that 1024MB is the sweet spot for most workloads because the extra CPU power reduces duration enough to offset the higher per-GB-second cost.

For API endpoints that serve semi-static data: things like product catalogs or configuration data that change infrequently, we’ve implemented caching at the API Gateway level with short TTLs. A 60-second cache on a product listing endpoint can reduce database load by 95% during traffic spikes while keeping data fresh enough for most use cases.

We don’t route API traffic through CloudFront unless there’s a specific reason to do so, like needing AWS WAF protection or geographic restrictions. For most APIs, pointing directly to API Gateway’s regional endpoint is faster and cheaper. When we do use CloudFront for APIs, we configure it properly with cache behaviors that respect Cache-Control headers and forward all necessary headers and cookies.

For user-generated content like profile photos or document uploads, we use S3 with Intelligent-Tiering and lifecycle policies. Files that haven’t been accessed in 90 days automatically move to Infrequent Access storage. Files older than one year move to Glacier. We serve this content through CloudFront with appropriate cache headers based on how frequently the content changes.

Our logging strategy is surgical. We use API Gateway access logs for request tracking, which gives us the essential information without the verbose execution logs. For critical endpoints, we enable execution logging with 10% sampling. CloudWatch log retention is set to 30 days for application logs and 90 days for access logs. We use CloudWatch Insights for ad-hoc querying when we need to debug issues, but we don’t pay to store logs forever.

For monitoring, we track the metrics that actually matter for cost optimization. Cache hit ratios, request counts by endpoint, data transfer volumes, and Lambda duration percentiles. We have CloudWatch alarms configured for anomalies: if our CloudFront request count spikes by 50% overnight, we want to know immediately. We review our Cost Explorer monthly and look for trends, not just absolute numbers.

The entire architecture is defined in infrastructure-as-code using CloudFormation (CDK) or Terraform. This means we can replicate it across environments, track changes over time, and catch configuration drift before it becomes expensive. Every resource has cost allocation tags, which lets us break down spending by product, team, and environment.

This architecture handles millions of requests per month while keeping costs predictable and manageable. A typical small-to-medium web app running on this architecture costs $500–1,500/month depending on traffic, which is 60–70% less than the same app built with the anti-patterns we’ve discussed.

The Ongoing Practice of Cost Optimization

Here’s the uncomfortable truth about AWS cost optimization: it’s never finished. The cloud is constantly evolving. New services launch with better price-performance ratios. Your application’s usage patterns change. Features that made sense last year might not make sense today. If you treat cost optimization as a one-time project, you’ll end up right back where you started within a year.

The companies that manage their AWS costs effectively have made it part of their operational rhythm. They review Cost Explorer monthly, looking for trends and anomalies. They monitor their cache hit ratios and request patterns. They question architectural decisions regularly. They understand that every inefficiency costs money every single day until they fix it.

They also understand that cost optimization isn’t just about cutting costs: it’s about spending money effectively. Sometimes the right decision is to spend more on infrastructure if it enables faster development or better user experience. The goal isn’t to have the lowest possible AWS bill. The goal is to understand exactly what you’re paying for and make conscious decisions about where to invest.

I’ve worked with companies that spent $50,000 per month on AWS and were getting incredible value, and companies that spent $5,000 per month and were wasting half of it. The difference wasn’t the absolute number: it was whether they understood their costs and were making intentional choices about their architecture.

If you’ve made it through all three parts of this series, you now know the most common and expensive AWS mistakes I see. You know how to fix CloudFront caching issues, how to choose the right API Gateway type, how to serve content efficiently from S3, and how to avoid drowning in logging costs. You have a reference architecture that avoids these pitfalls and a monitoring framework to catch new issues before they become expensive.

The question now is: what are you going to do about it? Pull up your AWS console right now. Look at your Cost Explorer. Find the biggest cost centers and ask yourself if you’re getting value for that spending. Check your CloudFront cache hit ratios. Review your API Gateway types. Look at your CloudWatch Logs costs. I guarantee you’ll find at least one issue that’s costing you money.

Fix it. Then fix the next one. Then set up monitoring so you catch the next issue before it compounds. Treat your AWS infrastructure like the business asset it is: something that requires ongoing attention and optimization, not just initial setup.

Your CFO will thank you. Your engineering team will thank you. And your future self, looking at next year’s AWS bill, will definitely thank you.

This wraps up the three-part series on AWS cost optimization. If you’ve implemented any of these fixes, I’d love to hear your results. Drop a comment with your savings story, or reach out if you’re dealing with a cost issue that doesn’t fit these patterns. I’ve probably seen something similar and can point you in the right direction.

And if you found this series valuable, share it with your team. The best way to avoid these expensive mistakes is to learn from other people’s experiences instead of making them yourself.

Death by a Thousand Logs: The Hidden Operational Costs Killing Your Margins