Generative AI

AI cost optimization for business: a practical guide

The price of AI is falling fast - Andreessen Horowitz tracks the per-token cost of a fixed quality level dropping roughly 10x a year, from $60 per million tokens in late 2021 to about $0.06 three years later. Yet company AI bills keep rising: Menlo Ventures found enterprise spend on large language models climbed from $3.5 billion to $8.4 billion in a single six-month stretch. Cheaper tokens, bigger bills.

This guide closes that gap. The core idea up front: your AI bill is set by architecture and usage, not by the price of a model. You don't capture the falling token price by waiting for it - you capture it by building the structure that turns cheaper tokens into a lower bill instead of into more tokens. Five levers do that. Here they are at a glance, then a step-by-step on each, with real numbers and the order to apply them.

The five levers at a glance

# Lever What it does Potential saving Effort Make it a priority when...
1 Measure cost per outcome Shows where money goes; reframes the target Enables all the rest Low Always — do this first
2 Route to the cheapest capable model Sends easy requests to small models 40–85% in suitable workloads Medium You run one model for everything
3 Cache repeated context Stops re-paying for static prompt parts Up to 90% on cached input Low Same instructions or documents sent every call
4 Batch non-urgent work Async processing at a standing discount 50% on input + output Low Reports, bulk jobs, overnight work
5 Cut tokens at the source Compress input, cap output, use code 20–40%+ on input; varies Medium Long prompts or verbose answers

Stacking matters. Caching and batching combine, taking effective input cost on an input-heavy job down by around 95% versus the sticker rate. Routing and caching are independent, so you get both. The point isn't any single trick - it's the compounding stack.

It's worth knowing the stakes before you start. McKinsey's 2025 State of AI survey, covering nearly 2,000 organizations, found that while most organizations - more than three-quarters - use AI somewhere, only about 6% see meaningful bottom-line impact, and roughly two-thirds haven't begun scaling it. The teams that pull ahead didn't buy a cheaper model - they redesigned the work and measured it. We build and ship AI products, and the cost question lands at a predictable moment: the pilot worked, leadership said ship it, and three months in the invoice stopped being a rounding error. This is the playbook we use when that happens.

Where AI costs actually come from

Before optimizing, know what you're paying for. An AI feature's cost has more parts than the model bill, and each behaves differently.

Cost source What it is Usually large when... First move
Input tokens Prompt, system instructions, history, retrieved docs You resend long context every call Prompt caching
Output tokens What the model generates (costs ~5x input) Answers are long or verbose Cap output / downsize model
Retrieval + storage The database that finds documents to feed the model Heavy knowledge-base (RAG) retrieval Prune the index; cache retrieval
Orchestration Extra agent steps, tool calls, and retries behind one action Multi-step agents Fewer steps; routing; caching

Two properties make this unlike ordinary software cost. It's variable - cost scales with every request, so a feature that costs a couple hundred dollars a month in a pilot can cost six figures at production traffic. And it compounds through usage: cheaper inference - each request you send the model - just invites more of it. That's the Jevons paradox, the old rule that making something cheaper makes people use far more of it, which is why total bills climb even as per-token prices fall. Optimize the architecture and the usage, not just the model you picked.

Step 1: See where the money actually goes

You can't fix a number you can't see. Most teams start from one figure, the aggregate monthly total from their provider, and that figure tells you nothing about what to change.

Two things to instrument:

  1. Cost attribution by feature, user, model, and request type. This tells you where the money actually goes, so you stop shaving tokens off a prompt that runs a thousand times a month while ignoring one that runs ten million times.
  2. Cost per outcome as your headline metric - not cost per token. Pick the unit that maps to value:
Use case Track cost per... Why it beats cost per token
Customer support resolved ticket A cheap model that fails and escalates costs more per resolution
Document processing processed document Captures retries and multi-step overhead
Sales / lead handling qualified lead or closed deal Ties spend directly to revenue
Content generation published asset Counts rejected drafts, not just calls

Once you can see the breakdown, diagnose before you optimize:

If spend is dominated by... It means... Start with
Input tokens Static context resent every call Caching (Step 3)
Output tokens Verbose generation Cap output / smaller model (Step 4)
Many calls per action Extra agent steps, tool calls, and retries Routing + fewer steps (Steps 2, 4)
One model for everything No routing Routing (Step 2)

McKinsey's data points the same way: the companies actually making money on AI are the ones measuring it against business results, not the ones staring at a model bill and hoping.

Step 2: Stop paying premium prices for simple requests

This is where the biggest savings hide, and most teams walk straight past them: they send every request - trivial or hard - to their single most expensive model. In almost any real workload the traffic is lopsided. Most of it is simple, and a small model answers those just as well for a fraction of the price. The frontier model only earns its keep on the genuinely hard requests. The price gap makes the waste obvious - across 2026 model menus, the cheapest usable model and the most capable one can be about 100x apart. Running the expensive one on a question a cheap one would have nailed is money straight out the window.

A routing map you can adapt, using Claude tiers as the example (other providers have equivalent small/mid/flagship tiers):

Request type Examples Model tier Claude example, as of June 2026 (input/output per M tokens)
Trivial Classification, extraction, formatting, routing Smallest Haiku 4.5 — ~$1 / $5
Standard Summarization, translation, drafting, FAQ Mid Sonnet 4.6 — ~$3 / $15
Complex Multi-step reasoning, coding, nuanced analysis Flagship Opus 4.8 — ~$5 / $25
Critical High-stakes decisions, code review Flagship + human/eval gate Opus 4.8 + review

These are example prices (Claude tiers, as of June 2026) to illustrate the tiering - not universal pricing. Every provider has its own rates, tier names, and discounts, and they change often, so check current numbers before you model your own bill.

How it works: a small, cheap classifier - a fast model whose only job is to sort each request into easy or hard - decides which tier it goes to. The numbers are well documented. RouteLLM, an open-source router from the research group LMSYS published at ICLR 2025, reported cost cuts above 85% on one benchmark while holding 95% of GPT-4's quality, sending only about a quarter of requests (in the tuned version, one in seven) to the expensive model. The sorter itself is tiny - it decides in under ten milliseconds, nothing next to a model answer that takes hundreds of times longer.

One discipline makes or breaks routing. The savings show up immediately on the bill; the quality cost shows up late, and never on the invoice. Route too aggressively and a slice of answers quietly gets worse, with nothing in the billing report to warn you. So gate every change with an eval - a quick quality check that runs a few hundred real cases before you move more traffic to a cheaper model. Move the cheap-model share up one notch at a time, not all at once.

[Image: a routing diagram - incoming request hits a fast classifier, simple queries branch to a small model, complex ones to a flagship, both feeding a cost-and-quality dashboard]

Step 3: Reuse instead of paying twice

Two provider features cut the bill without touching your model choice. Both are underused.

Technique Mechanism Discount Best for Watch out for
Prompt caching Reuses the unchanging front part of the prompt instead of reprocessing it Cache reads ~10% of input price (90% off) System prompts, tools, fixed documents Min ~1,024 tokens; that front part must be identical each time
Semantic caching Reuses a stored answer when a new question means the same thing Avoids the call entirely FAQs, repeated user questions Needs a similarity threshold and a quality check
Batch API Runs async, processed within ~24h 50% off input and output Reports, bulk classification, overnight jobs Not for real-time requests

Here's what caching looks like in practice. Take a customer-support bot with a long knowledge base - say fifty pages of product docs - pasted into its prompt, answering a few thousand questions a day. Without caching, it re-reads all fifty pages on every single question. With caching, it reads them once and pays about a tenth of the price for that context on every question after. Same answers, a much smaller bill. Anthropic's own pricing documentation puts those cache reads at a tenth of standard input, and a cache write costs only a little more than a normal read, so the first repeat already pays it back. Most AI features sit on a steady set of instructions or documents like this, which is why caching is usually the first thing to switch on, not the last.

Which lever matters depends on one ratio: output tokens cost several times more than input (five times, on current Claude pricing). A classification job that sends 5,000 tokens and returns a 50-token label is almost all input - cache it. A chatbot with short prompts and long answers is almost all output - downsize or cap instead.

Step 4: Send the model less work

The cheapest token is the one you never send.

Technique How Typical effect Note
Prompt compression Trim redundant instructions; use structured formats; summarize low-relevance context ~20–40% fewer input tokens Validate quality with an eval — over-compression triggers retries
Output capping Fixed format, token ceiling, max item count Cuts output spend directly Only where a process, not a human, reads the output
Deterministic code Replace non-language steps (validation, lookups, rule-based ranking) with functions Removes those calls entirely More reliable than a model for the same job

We'll say this plainly: not every step in an AI feature is an AI problem. On our own builds we're strict about it - the model only touches the parts that genuinely need language or judgment, and plain code handles the rest. It's cheaper, and honestly it breaks less, because a function doesn't have an off day or a creative interpretation.

Step 5: Decide what's not worth building

The largest cost decision happens before any of the above, at the feature level. McKinsey's data shows most organizations stuck in pilots that never reach scale, and many AI features never earn the running cost they carry forever. A demo that impresses in a meeting can be a margin problem in production.

Lean toward building with AI when... Lean toward skipping (or non-AI) when...
The task needs language understanding or judgment A rule or lookup solves it deterministically
Volume justifies the ongoing running cost It runs rarely and a person can absorb it
Errors are cheap or easily caught Errors are high-stakes and hard to detect
It moves a real business metric It's impressive in a demo but changes no metric

Deciding what not to build is a cost optimization too - the cheapest one available, because a feature you skip costs nothing to run. That decision is most of what discovery is for.

A 30-day rollout plan

If you're starting from a bill that's climbing faster than usage justifies, this is the order that captures the most, fastest:

Window Focus Concrete action Expected outcome
Week 1 Visibility Add cost attribution by feature/model/request; pick a cost-per-outcome metric A baseline and a diagnosis
Week 2 Quick wins Turn on prompt caching for static prefixes; move async jobs to batch First large bill drop, low risk
Week 3 Routing Add a classifier and a model registry; put a quality check in front of changes; raise the cheap-model share gradually The biggest structural saving
Week 4 Trim + decide Compress prompts, cap outputs, replace LLM steps with code; review which features earn their cost A leaner, governed system

Checklist (quick reference)

  • Instrument first. Attribution by feature, user, model, request type. Headline metric: cost per outcome.
  • Route by difficulty. Cheapest capable model per request; a quality check on every change.
  • Cache the repeated parts. System prompts, tools, fixed documents; reuse stored answers for repeated questions.
  • Batch what can wait. Async queue for non-urgent work.
  • Trim input, cap output. Compress where quality holds; limit output where no human reads it.
  • Replace LLM steps with code wherever the task doesn't need language.
  • Decide what not to build. The cheapest feature to run is the one you chose not to ship.

None of this is exotic. Bills run away not for lack of clever tricks but because cost discipline gets deferred while the team is shipping, and the architecture hardens around expensive defaults. Token prices will keep falling on their own; whether that reaches your bottom line depends entirely on the structure you put around it.

A fair warning on the numbers in this guide: the figures are real, but they're benchmarks and provider rates, not promises. What you'll actually save depends on your traffic shape (how much of it is simple versus hard), how your prompts are built, how fast you need answers, and how much quality you're willing to trade. That's exactly why measurement comes first and a quality check sits in front of every change.

If you're planning to add AI to your product, the cost structure should be designed before development starts, not patched in after the first big invoice. In discovery, we map where the model is really needed, where plain code is enough, and what the production cost will actually look like.

Book a call and we'll walk through it: book a call with us.

“You can’t monetize pain. You can only monetize value. The moment users feel cared for, they’ll see paying as an investment in themselves — not a cost.”

You know what you want to build. Let's go ship it.

Book a 15-min call
Book a 15-min call
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.