When a normal workflow turns into a $30,000 invoice.
On May 14, The Register covered a story that had been bouncing around Hacker News the day before. A developer was running a local coding-agent setup (Droid talking to LiteLLM, LiteLLM forwarding through AWS Bedrock to Claude Opus) on a normal development workflow. No leaked API key. No crypto miner. No single absurd request. Just an agent doing its job: reading the repo, generating code, iterating.
The gross Bedrock charge for that session ran to $37,901.73. AWS credits absorbed about $8,000 of it. Net cost: $29,875.19.
The cost breakdown is where the engineering lesson lives. The same workflow, with caching actually working through the stack, would have cost roughly $1,000 instead of $30,000. The model was identical. The prompts were close enough to cache. The cache just was not being hit.
The developer had AWS Cost Anomaly Detection configured, with a reasonable threshold. The alert never fired. The Register's reporting names the structural reason: AWS Cost Anomaly Detection does not cover charges billed through AWS Marketplace. Claude on Bedrock is billed through Marketplace. So the alert mechanism could not see the spend it was meant to alarm on, even with the alert configured correctly.
Budget alerts are not a control surface, and "supports prompt caching" is not "is using prompt caching." Both are notification mechanisms, not enforcement mechanisms, and a normal workflow is enough to expose the gap between them.
The first instinct reading this is to file it as user error. The developer should have configured CAD differently, watched the bill daily, used a lower-cost model in the loop. There is a kernel of truth there. Daily bill-watching would have caught it. So would a tighter alert threshold, in a system that worked the way most engineers assume it does.
That frame misses the structural issue. The developer did configure CAD. The threshold was reasonable. The system did not deliver on what it advertised: it did not see Marketplace billing. And one layer up, the same pattern repeats. Every component in the toolchain (Droid, LiteLLM, Bedrock, the Anthropic SDK underneath) advertises support for prompt caching. None of them owned end-to-end caching. The result was a workflow where every layer was "doing its job" and the cache never engaged.
A decade ago, the same pattern hit cloud infrastructure broadly. EC2 was easy to spin up. Lambda was easy to call. The bill at the end of the month was not easy to predict. The industry's response was not "use less cloud." It was IAM, service quotas, billing tags, cost allocation, anomaly detection, FinOps as a discipline. Cloud became cheap to run and cheap to audit, because the operating posture caught up with the operating reality.
AI workloads are at the same inflection point. Premium models behind autonomous or semi-autonomous agents make small configuration problems into large financial events before finance, engineering, or operations notice. The right response is not "use cheaper models." It is the same operational discipline cloud went through: hard controls, telemetry that matches the billing reality, and architecture where the layer closest to the model owns the contract.
Four moves, in the same shape we have applied to AI privacy and AI tool boundaries in prior pieces.
Cache eligibility lives in the prompt prefix. If the prefix is stable across requests, the model returns cached tokens at a fraction of the cost. If anything upstream of the model mutates the prefix between requests, even a timestamp injected into a system prompt, the cache becomes useless and every input token is charged at the uncached rate. The 6.47B-versus-1.67B split in the Bedrock incident is what this failure looks like at scale.
The fix is not "tell every layer to be careful." The fix is that one layer owns prompt assembly, ideally the layer closest to the model endpoint, and upstream layers are forbidden from mutating the prompt in ways that break cache eligibility. This is a code-review rule, not a runtime check. Anti-patterns (timestamps in system prompts, request IDs interpolated into the prefix, tool schemas reordered per request) get caught at PR time, not after the invoice arrives.
Alerts notify; they do not stop. The next request after the alert fires is the same as the request before the alert fired. The system has no idea the alert went out.
Hard ceilings live in the application layer. Per-project token budgets, per-environment request quotas, per-day spend caps. When the ceiling is hit, the next request returns a 429 or a safe error, not a $50 charge. The ceilings are not aspirations; they are refusals. Production kill switches that an operator can trip without rolling code. This is the difference between cost as a notification and cost as a control surface.
For agent workflows specifically, time-boxing complements spend-boxing. An agent that has run for two hours without operator approval should suspend, regardless of how much budget is left. The Bedrock incident was a coding agent that kept running; an agent that had been forced to surface for approval after two hours would have been caught.
The CAD-versus-Marketplace gap is the central data point. Native AWS cost tooling is excellent for native AWS services; it does not consistently cover third-party services billed through AWS Marketplace. Bedrock-as-billing-pass-through is a known blind spot. Bedrock is not unique in this; any cloud where premium models are sold through a marketplace channel has the same structural risk.
The defense is belt and suspenders. Native cloud billing alerts where they work. In-application token counters and spend ceilings for everything the cloud tooling does not see. Daily reconciliation between what the application thinks it spent and what the cloud will eventually bill. The point is not redundancy for redundancy's sake; the point is that no single observability mechanism covers the whole billing surface, and a system that depends on one mechanism is one Marketplace re-routing away from a $30,000 surprise.
Every component in the modern AI stack advertises prompt caching. The Anthropic SDK has caching. Bedrock has caching. LiteLLM has caching. Agent frameworks describe caching support in their feature lists. None of those statements tell you whether your specific workflow, through your specific layer stack, is actually hitting the cache.
The only honest answer comes from telemetry. Log the input token count and the cache-read token count on every request. Compute the cache hit rate over a representative slice of traffic. If the cache-read ratio is less than what your workflow's prompt prefix should produce, something between your application and the model is breaking cache eligibility. Find it. Until it is found, treat the cost as uncached and budget accordingly.
This is the validation step that the Bedrock incident skipped. The developer was running a workflow where caching, if engaged, would have produced massive savings. They assumed the layers were cooperating because the layers advertised cooperation. They were not. A single hour of cache-hit-rate validation would have surfaced it before $30,000 of uncached input tokens accumulated.
When five layers each "support" a feature and none of them owns it, the feature does not work; the same is true of cost controls, which is why "alerts plus credits plus prompt caching support" can compose into a $30,000 invoice from a normal workflow.
If you are running AI workloads in production and the question "could a normal session in our stack cost us five figures we did not budget for" is open, the next move is a 30-minute conversation. Bring the rough toolchain shape (which model, which gateway, which agent framework if any), an estimate of your monthly AI spend, and where your cost telemetry lives today. Within the call we will tell you where the gap most likely sits, what we would tighten first, and whether a deeper SDS engagement is the right next step.
We do not take every engagement, and we will tell you on the call whether we are the right partner.