Generative AI

AI cost optimization for business: a practical guide

Konstantin Semenenko

June 26, 2026

minutes read

AI keeps getting cheaper, but AI bills keep climbing. The fix is architecture, not a cheaper model - measure cost per outcome, route to smaller models, and cache what repeats to cut production costs by more than half.

The price of AI is falling fast - Andreessen Horowitz tracks the per-token cost of a fixed quality level dropping roughly 10x a year, from $60 per million tokens in late 2021 to about $0.06 three years later. Yet company AI bills keep rising: Menlo Ventures found enterprise spend on large language models climbed from $3.5 billion to $8.4 billion in a single six-month stretch. Cheaper tokens, bigger bills.

‍

This guide closes that gap. The core idea up front: your AI bill is set by architecture and usage, not by the price of a model. You don't capture the falling token price by waiting for it - you capture it by building the structure that turns cheaper tokens into a lower bill instead of into more tokens. Five levers do that. Here they are at a glance, then a step-by-step on each, with real numbers and the order to apply them.

‍

The five levers at a glance

#	Lever	What it does	Potential saving	Effort	Make it a priority when...
1	Measure cost per outcome	Shows where money goes; reframes the target	Enables all the rest	Low	Always — do this first
2	Route to the cheapest capable model	Sends easy requests to small models	40–85% in suitable workloads	Medium	You run one model for everything
3	Cache repeated context	Stops re-paying for static prompt parts	Up to 90% on cached input	Low	Same instructions or documents sent every call
4	Batch non-urgent work	Async processing at a standing discount	50% on input + output	Low	Reports, bulk jobs, overnight work
5	Cut tokens at the source	Compress input, cap output, use code	20–40%+ on input; varies	Medium	Long prompts or verbose answers

Stacking matters. Caching and batching combine, taking effective input cost on an input-heavy job down by around 95% versus the sticker rate. Routing and caching are independent, so you get both. The point isn't any single trick - it's the compounding stack.

‍

It's worth knowing the stakes before you start. McKinsey's 2025 State of AI survey, covering nearly 2,000 organizations, found that while most organizations - more than three-quarters - use AI somewhere, only about 6% see meaningful bottom-line impact, and roughly two-thirds haven't begun scaling it. The teams that pull ahead didn't buy a cheaper model - they redesigned the work and measured it. We build and ship AI products, and the cost question lands at a predictable moment: the pilot worked, leadership said ship it, and three months in the invoice stopped being a rounding error. This is the playbook we use when that happens.

‍

Where AI costs actually come from

Before optimizing, know what you're paying for. An AI feature's cost has more parts than the model bill, and each behaves differently.

Cost source	What it is	Usually large when...	First move
Input tokens	Prompt, system instructions, history, retrieved docs	You resend long context every call	Prompt caching
Output tokens	What the model generates (costs ~5x input)	Answers are long or verbose	Cap output / downsize model
Retrieval + storage	The database that finds documents to feed the model	Heavy knowledge-base (RAG) retrieval	Prune the index; cache retrieval
Orchestration	Extra agent steps, tool calls, and retries behind one action	Multi-step agents	Fewer steps; routing; caching

Two properties make this unlike ordinary software cost. It's variable - cost scales with every request, so a feature that costs a couple hundred dollars a month in a pilot can cost six figures at production traffic. And it compounds through usage: cheaper inference - each request you send the model - just invites more of it. That's the Jevons paradox, the old rule that making something cheaper makes people use far more of it, which is why total bills climb even as per-token prices fall. Optimize the architecture and the usage, not just the model you picked.

‍

Step 1: See where the money actually goes

You can't fix a number you can't see. Most teams start from one figure, the aggregate monthly total from their provider, and that figure tells you nothing about what to change.

‍

Two things to instrument:

Cost attribution by feature, user, model, and request type. This tells you where the money actually goes, so you stop shaving tokens off a prompt that runs a thousand times a month while ignoring one that runs ten million times.
Cost per outcome as your headline metric - not cost per token. Pick the unit that maps to value:

Use case	Track cost per...	Why it beats cost per token
Customer support	resolved ticket	A cheap model that fails and escalates costs more per resolution
Document processing	processed document	Captures retries and multi-step overhead
Sales / lead handling	qualified lead or closed deal	Ties spend directly to revenue
Content generation	published asset	Counts rejected drafts, not just calls

Once you can see the breakdown, diagnose before you optimize:

If spend is dominated by...	It means...	Start with
Input tokens	Static context resent every call	Caching (Step 3)
Output tokens	Verbose generation	Cap output / smaller model (Step 4)
Many calls per action	Extra agent steps, tool calls, and retries	Routing + fewer steps (Steps 2, 4)
One model for everything	No routing	Routing (Step 2)

McKinsey's data points the same way: the companies actually making money on AI are the ones measuring it against business results, not the ones staring at a model bill and hoping.

‍

Step 2: Stop paying premium prices for simple requests

This is where the biggest savings hide, and most teams walk straight past them: they send every request - trivial or hard - to their single most expensive model. In almost any real workload the traffic is lopsided. Most of it is simple, and a small model answers those just as well for a fraction of the price. The frontier model only earns its keep on the genuinely hard requests. The price gap makes the waste obvious - across 2026 model menus, the cheapest usable model and the most capable one can be about 100x apart. Running the expensive one on a question a cheap one would have nailed is money straight out the window.

‍

A routing map you can adapt, using Claude tiers as the example (other providers have equivalent small/mid/flagship tiers):

Request type	Examples	Model tier	Claude example, as of June 2026 (input/output per M tokens)
Trivial	Classification, extraction, formatting, routing	Smallest	Haiku 4.5 — ~$1 / $5
Standard	Summarization, translation, drafting, FAQ	Mid	Sonnet 4.6 — ~$3 / $15
Complex	Multi-step reasoning, coding, nuanced analysis	Flagship	Opus 4.8 — ~$5 / $25
Critical	High-stakes decisions, code review	Flagship + human/eval gate	Opus 4.8 + review

These are example prices (Claude tiers, as of June 2026) to illustrate the tiering - not universal pricing. Every provider has its own rates, tier names, and discounts, and they change often, so check current numbers before you model your own bill.

‍

How it works: a small, cheap classifier - a fast model whose only job is to sort each request into easy or hard - decides which tier it goes to. The numbers are well documented. RouteLLM, an open-source router from the research group LMSYS published at ICLR 2025, reported cost cuts above 85% on one benchmark while holding 95% of GPT-4's quality, sending only about a quarter of requests (in the tuned version, one in seven) to the expensive model. The sorter itself is tiny - it decides in under ten milliseconds, nothing next to a model answer that takes hundreds of times longer.

‍

One discipline makes or breaks routing. The savings show up immediately on the bill; the quality cost shows up late, and never on the invoice. Route too aggressively and a slice of answers quietly gets worse, with nothing in the billing report to warn you. So gate every change with an eval - a quick quality check that runs a few hundred real cases before you move more traffic to a cheaper model. Move the cheap-model share up one notch at a time, not all at once.

‍

[Image: a routing diagram - incoming request hits a fast classifier, simple queries branch to a small model, complex ones to a flagship, both feeding a cost-and-quality dashboard]

‍

Step 3: Reuse instead of paying twice

Two provider features cut the bill without touching your model choice. Both are underused.

Technique	Mechanism	Discount	Best for	Watch out for
Prompt caching	Reuses the unchanging front part of the prompt instead of reprocessing it	Cache reads ~10% of input price (90% off)	System prompts, tools, fixed documents	Min ~1,024 tokens; that front part must be identical each time
Semantic caching	Reuses a stored answer when a new question means the same thing	Avoids the call entirely	FAQs, repeated user questions	Needs a similarity threshold and a quality check
Batch API	Runs async, processed within ~24h	50% off input and output	Reports, bulk classification, overnight jobs	Not for real-time requests

Here's what caching looks like in practice. Take a customer-support bot with a long knowledge base - say fifty pages of product docs - pasted into its prompt, answering a few thousand questions a day. Without caching, it re-reads all fifty pages on every single question. With caching, it reads them once and pays about a tenth of the price for that context on every question after. Same answers, a much smaller bill. Anthropic's own pricing documentation puts those cache reads at a tenth of standard input, and a cache write costs only a little more than a normal read, so the first repeat already pays it back. Most AI features sit on a steady set of instructions or documents like this, which is why caching is usually the first thing to switch on, not the last.

‍

Which lever matters depends on one ratio: output tokens cost several times more than input (five times, on current Claude pricing). A classification job that sends 5,000 tokens and returns a 50-token label is almost all input - cache it. A chatbot with short prompts and long answers is almost all output - downsize or cap instead.

‍

Step 4: Send the model less work

The cheapest token is the one you never send.

Technique	How	Typical effect	Note
Prompt compression	Trim redundant instructions; use structured formats; summarize low-relevance context	~20–40% fewer input tokens	Validate quality with an eval — over-compression triggers retries
Output capping	Fixed format, token ceiling, max item count	Cuts output spend directly	Only where a process, not a human, reads the output
Deterministic code	Replace non-language steps (validation, lookups, rule-based ranking) with functions	Removes those calls entirely	More reliable than a model for the same job

We'll say this plainly: not every step in an AI feature is an AI problem. On our own builds we're strict about it - the model only touches the parts that genuinely need language or judgment, and plain code handles the rest. It's cheaper, and honestly it breaks less, because a function doesn't have an off day or a creative interpretation.

‍

Step 5: Decide what's not worth building

The largest cost decision happens before any of the above, at the feature level. McKinsey's data shows most organizations stuck in pilots that never reach scale, and many AI features never earn the running cost they carry forever. A demo that impresses in a meeting can be a margin problem in production.

Lean toward building with AI when...	Lean toward skipping (or non-AI) when...
The task needs language understanding or judgment	A rule or lookup solves it deterministically
Volume justifies the ongoing running cost	It runs rarely and a person can absorb it
Errors are cheap or easily caught	Errors are high-stakes and hard to detect
It moves a real business metric	It's impressive in a demo but changes no metric

Deciding what not to build is a cost optimization too - the cheapest one available, because a feature you skip costs nothing to run. That decision is most of what discovery is for.

‍

A 30-day rollout plan

If you're starting from a bill that's climbing faster than usage justifies, this is the order that captures the most, fastest:

Window	Focus	Concrete action	Expected outcome
Week 1	Visibility	Add cost attribution by feature/model/request; pick a cost-per-outcome metric	A baseline and a diagnosis
Week 2	Quick wins	Turn on prompt caching for static prefixes; move async jobs to batch	First large bill drop, low risk
Week 3	Routing	Add a classifier and a model registry; put a quality check in front of changes; raise the cheap-model share gradually	The biggest structural saving
Week 4	Trim + decide	Compress prompts, cap outputs, replace LLM steps with code; review which features earn their cost	A leaner, governed system

‍

Checklist (quick reference)

Instrument first. Attribution by feature, user, model, request type. Headline metric: cost per outcome.
Route by difficulty. Cheapest capable model per request; a quality check on every change.
Cache the repeated parts. System prompts, tools, fixed documents; reuse stored answers for repeated questions.
Batch what can wait. Async queue for non-urgent work.
Trim input, cap output. Compress where quality holds; limit output where no human reads it.
Replace LLM steps with code wherever the task doesn't need language.
Decide what not to build. The cheapest feature to run is the one you chose not to ship.

‍

None of this is exotic. Bills run away not for lack of clever tricks but because cost discipline gets deferred while the team is shipping, and the architecture hardens around expensive defaults. Token prices will keep falling on their own; whether that reaches your bottom line depends entirely on the structure you put around it.

‍

A fair warning on the numbers in this guide: the figures are real, but they're benchmarks and provider rates, not promises. What you'll actually save depends on your traffic shape (how much of it is simple versus hard), how your prompts are built, how fast you need answers, and how much quality you're willing to trade. That's exactly why measurement comes first and a quality check sits in front of every change.

‍

If you're planning to add AI to your product, the cost structure should be designed before development starts, not patched in after the first big invoice. In discovery, we map where the model is really needed, where plain code is enough, and what the production cost will actually look like.

‍

Book a call and we'll walk through it: book a call with us.

“You can’t monetize pain. You can only monetize value. The moment users feel cared for, they’ll see paying as an investment in themselves — not a cost.”

News & Insights

View all

You know what you want to build. Let's go ship it.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.