
Konstantin Semenenko
June 26, 2026
7
minutes read
AI keeps getting cheaper, but AI bills keep climbing. The fix is architecture, not a cheaper model - measure cost per outcome, route to smaller models, and cache what repeats to cut production costs by more than half.




The price of AI is falling fast - Andreessen Horowitz tracks the per-token cost of a fixed quality level dropping roughly 10x a year, from $60 per million tokens in late 2021 to about $0.06 three years later. Yet company AI bills keep rising: Menlo Ventures found enterprise spend on large language models climbed from $3.5 billion to $8.4 billion in a single six-month stretch. Cheaper tokens, bigger bills.
This guide closes that gap. The core idea up front: your AI bill is set by architecture and usage, not by the price of a model. You don't capture the falling token price by waiting for it - you capture it by building the structure that turns cheaper tokens into a lower bill instead of into more tokens. Five levers do that. Here they are at a glance, then a step-by-step on each, with real numbers and the order to apply them.
Stacking matters. Caching and batching combine, taking effective input cost on an input-heavy job down by around 95% versus the sticker rate. Routing and caching are independent, so you get both. The point isn't any single trick - it's the compounding stack.
It's worth knowing the stakes before you start. McKinsey's 2025 State of AI survey, covering nearly 2,000 organizations, found that while most organizations - more than three-quarters - use AI somewhere, only about 6% see meaningful bottom-line impact, and roughly two-thirds haven't begun scaling it. The teams that pull ahead didn't buy a cheaper model - they redesigned the work and measured it. We build and ship AI products, and the cost question lands at a predictable moment: the pilot worked, leadership said ship it, and three months in the invoice stopped being a rounding error. This is the playbook we use when that happens.
Before optimizing, know what you're paying for. An AI feature's cost has more parts than the model bill, and each behaves differently.
Two properties make this unlike ordinary software cost. It's variable - cost scales with every request, so a feature that costs a couple hundred dollars a month in a pilot can cost six figures at production traffic. And it compounds through usage: cheaper inference - each request you send the model - just invites more of it. That's the Jevons paradox, the old rule that making something cheaper makes people use far more of it, which is why total bills climb even as per-token prices fall. Optimize the architecture and the usage, not just the model you picked.
You can't fix a number you can't see. Most teams start from one figure, the aggregate monthly total from their provider, and that figure tells you nothing about what to change.
Two things to instrument:
Once you can see the breakdown, diagnose before you optimize:
McKinsey's data points the same way: the companies actually making money on AI are the ones measuring it against business results, not the ones staring at a model bill and hoping.
This is where the biggest savings hide, and most teams walk straight past them: they send every request - trivial or hard - to their single most expensive model. In almost any real workload the traffic is lopsided. Most of it is simple, and a small model answers those just as well for a fraction of the price. The frontier model only earns its keep on the genuinely hard requests. The price gap makes the waste obvious - across 2026 model menus, the cheapest usable model and the most capable one can be about 100x apart. Running the expensive one on a question a cheap one would have nailed is money straight out the window.
A routing map you can adapt, using Claude tiers as the example (other providers have equivalent small/mid/flagship tiers):
These are example prices (Claude tiers, as of June 2026) to illustrate the tiering - not universal pricing. Every provider has its own rates, tier names, and discounts, and they change often, so check current numbers before you model your own bill.
How it works: a small, cheap classifier - a fast model whose only job is to sort each request into easy or hard - decides which tier it goes to. The numbers are well documented. RouteLLM, an open-source router from the research group LMSYS published at ICLR 2025, reported cost cuts above 85% on one benchmark while holding 95% of GPT-4's quality, sending only about a quarter of requests (in the tuned version, one in seven) to the expensive model. The sorter itself is tiny - it decides in under ten milliseconds, nothing next to a model answer that takes hundreds of times longer.
One discipline makes or breaks routing. The savings show up immediately on the bill; the quality cost shows up late, and never on the invoice. Route too aggressively and a slice of answers quietly gets worse, with nothing in the billing report to warn you. So gate every change with an eval - a quick quality check that runs a few hundred real cases before you move more traffic to a cheaper model. Move the cheap-model share up one notch at a time, not all at once.
[Image: a routing diagram - incoming request hits a fast classifier, simple queries branch to a small model, complex ones to a flagship, both feeding a cost-and-quality dashboard]
Two provider features cut the bill without touching your model choice. Both are underused.
Here's what caching looks like in practice. Take a customer-support bot with a long knowledge base - say fifty pages of product docs - pasted into its prompt, answering a few thousand questions a day. Without caching, it re-reads all fifty pages on every single question. With caching, it reads them once and pays about a tenth of the price for that context on every question after. Same answers, a much smaller bill. Anthropic's own pricing documentation puts those cache reads at a tenth of standard input, and a cache write costs only a little more than a normal read, so the first repeat already pays it back. Most AI features sit on a steady set of instructions or documents like this, which is why caching is usually the first thing to switch on, not the last.
Which lever matters depends on one ratio: output tokens cost several times more than input (five times, on current Claude pricing). A classification job that sends 5,000 tokens and returns a 50-token label is almost all input - cache it. A chatbot with short prompts and long answers is almost all output - downsize or cap instead.
The cheapest token is the one you never send.
We'll say this plainly: not every step in an AI feature is an AI problem. On our own builds we're strict about it - the model only touches the parts that genuinely need language or judgment, and plain code handles the rest. It's cheaper, and honestly it breaks less, because a function doesn't have an off day or a creative interpretation.
The largest cost decision happens before any of the above, at the feature level. McKinsey's data shows most organizations stuck in pilots that never reach scale, and many AI features never earn the running cost they carry forever. A demo that impresses in a meeting can be a margin problem in production.
Deciding what not to build is a cost optimization too - the cheapest one available, because a feature you skip costs nothing to run. That decision is most of what discovery is for.
If you're starting from a bill that's climbing faster than usage justifies, this is the order that captures the most, fastest:
None of this is exotic. Bills run away not for lack of clever tricks but because cost discipline gets deferred while the team is shipping, and the architecture hardens around expensive defaults. Token prices will keep falling on their own; whether that reaches your bottom line depends entirely on the structure you put around it.
A fair warning on the numbers in this guide: the figures are real, but they're benchmarks and provider rates, not promises. What you'll actually save depends on your traffic shape (how much of it is simple versus hard), how your prompts are built, how fast you need answers, and how much quality you're willing to trade. That's exactly why measurement comes first and a quality check sits in front of every change.
If you're planning to add AI to your product, the cost structure should be designed before development starts, not patched in after the first big invoice. In discovery, we map where the model is really needed, where plain code is enough, and what the production cost will actually look like.
Book a call and we'll walk through it: book a call with us.


