Generative AI

AI token cost optimization: the playbook that actually moves the bill

Konstantin Semenenko

July 3, 2026

minutes read

The model's price per token is rarely what makes an AI bill big. What moves it: how much traffic escalates to the expensive model, whether your prompt prefix is stable enough to cache, how long the output runs, and how many tokens go to junk you never needed to send. The highest-leverage moves, in order: default to the cheapest model and escalate only failures, cache the repeated prefix, filter junk before the model, shrink and shape the payload, and make retries deliberate. Architecture moves the number far more than model choice.

The single most useful thing to know about AI cost is that the price per token, the number on every pricing page, is rarely what makes your bill big. What moves it is the shape of the system around the model: how much traffic escalates to the expensive model, whether the prompt prefix is stable enough to cache, how long the output runs, how many tokens go to identifiers and junk you never needed to send, and how often a broken format pays for a retry. This is the playbook for each of those, in rough order of impact, and the theme underneath all of it is that architecture moves the bill far more than model choice.

‍

We learned most of this the expensive way, on a single run that billed over $300,000, and wrote up the full breakdown in what a $303,030 AI bill taught us. This is the reusable playbook that came out of it.

‍

Lever 1: default to the cheapest model, escalate only failures

The biggest lever is routing. Most workflows send everything to one capable, expensive model when the large majority of the work could run on a small, cheap one. The move is to default every request to the smallest model that can do the job, with reasoning off, and escalate only the cases that actually fail to the larger model.

‍

The economics are stark, and they held on our own run: the large model can cost dozens of times more per token than the small one, so the escalation rate, the share of traffic that reaches the expensive model, becomes your single biggest cost variable. A workflow that runs 80% of steps on a small model and escalates only the hard 20% costs a fraction of an all-large-model workflow with similar results. Watch the escalation rate like it is the budget, because it is.

‍

Lever 2: cache the repeated prefix

The second lever is prompt caching, and it is the highest-ROI, lowest-risk optimization available. When the same prompt prefix repeats across requests, the provider can reuse the processed form instead of recomputing it, billing cached input at a steep discount, up to 90% off on Anthropic, 50% on OpenAI, with byte-identical output. On a workflow that reuses a large system prompt, this is a straight cut on the largest line item in most AI bills.

‍

The catch is that caching rewards a specific prompt structure, and it inverts the usual advice to keep system prompts short. Put everything stable, format rules, tool definitions, taxonomy, examples, at the front of the prompt in a byte-identical prefix, and push the only changing part, the actual user input, to the very end. Any change to a block invalidates that block and everything after it, so dynamic content must live last. One documented team raised its cache hit rate from 7% to 84% just by moving working memory out of the system prompt, cutting overall cost 59%. The discipline that makes it work: every request has to look alike so the cache actually hits.

‍

Lever 3: filter junk before the model

The cheapest token is the one you never send. Raw inputs are rarely one clean task, they bundle empty fields, duplicates, malformed rows, unsupported content, and material too short to matter. Sending the whole thing pays more for a worse result. The fix is to split inputs into coherent units and drop the obvious junk in plain code, before the model is ever called.

‍

This is the optimization teams skip because it is unglamorous, but rows filtered before the LLM cost zero tokens, and at volume that is real money. Every character you do send becomes tokens you pay for, on every row, on every pass, so pre-filtering is not a nice-to-have, it is the first place to cut.

‍

Lever 4: shrink and shape the payload

After filtering, shrink what remains. Every character you add turns into tokens, on both input and output, so verbose payloads are a recurring tax. Two specific moves matter. On input, drop what the model does not need, the worst offender is often identifiers: a single GUID tokenizes to around 20 tokens, so a list of a thousand rows can spend 20,000 tokens on IDs before a useful word. Lean on source order or short codes instead.

‍

On output, remember that output tokens cost several times more than input and never cache, so waste there is the most expensive kind. Have the model emit a compact protocol, not a pretty schema: two short lines instead of verbose JSON with long field names, with the legend living once in the cached system prompt. One caution learned the hard way: shortest is not the goal, smallest reliable is. A too-clever format that breaks on an edge case triggers a parse failure, and a parse failure is just another paid call to repair it.

‍

Lever 5: make retries deliberate

At scale, retry policy is cost policy. A naive system retries everything, turning one hard record into an unbounded series of paid calls. The discipline is to retry the failures worth retrying, rate limits, respecting the retry timing, and record the ones that are not, bad formatting, empty responses, model-side rejections, rather than repairing them with another paid call. For content-filter failures, one conservative rewrite, then stop.

‍

This matters more than it sounds because retries and loops are where cost hides. An agent that loops 10 to 14 times on an ambiguous query can quietly blow a budget, and nobody notices until the invoice arrives. Bounding retries and loops does double duty: it improves reliability and directly controls cost.

‍

The metric that ties it together

Track cost per accepted result, not cost per token. Cost per token hides everything that actually varies: the retries, the repair calls, the escalations, the loops. Cost per accepted result absorbs all of it into one number that reflects what you actually pay to get one good output. Optimize that, and you are optimizing the real bill instead of the pricing-page number.

‍

And instrument enough to see where the money goes: log model, input tokens, output tokens, and cost per call. You cannot optimize a bill you cannot itemize, which is the whole argument for AI agent observability.

‍

The takeaway

AI token cost optimization is architecture, not model shopping. The levers, in order of impact: route to the cheapest model and escalate only failures, cache the repeated prefix, filter junk before the model, shrink and shape the payload, and make retries deliberate. Measure cost per accepted result, not cost per token, and instrument every call so you can see where the spend goes. Do these and the bill bends far more than any provider switch would move it.

‍

If you are about to put serious volume through a model and want the cost structure designed before the first big invoice, that is where our AI Dev Team work starts.

‍

FAQ

What is the biggest lever for reducing AI token costs? Model routing: default to the cheapest capable model and escalate only the failing cases to a larger one. The escalation rate, the share of traffic hitting the expensive model, is usually the single biggest cost variable.

‍

How much does prompt caching save? Up to 90% on cached input tokens with Anthropic and about 50% with OpenAI, with byte-identical output. On workflows that reuse a large system prompt, it is typically the highest-ROI cost cut available. Combined with batch pricing, savings can reach around 95% on repeated tokens.

‍

How do I structure prompts for caching? Put everything stable (format rules, tool definitions, examples) at the front as a byte-identical prefix, and push the only changing part (the user input) to the end. Any change invalidates that block and everything after it, so dynamic content must live last.

‍

Why is output more expensive than input in AI costs? Output tokens typically cost several times more than input tokens and never cache, so every output token is billed in full every time. That makes output waste the most expensive kind, which is why compact output formats matter.

‍

What metric should I use to track AI cost? Cost per accepted result, not cost per token. Cost per token hides the retries, repair calls, escalations, and loops that actually drive the bill; cost per accepted result absorbs all of it into the real number you pay for one good output.

“You can’t monetize pain. You can only monetize value. The moment users feel cared for, they’ll see paying as an investment in themselves — not a cost.”

News & Insights

View all

You know what you want to build. Let's go ship it.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

AI token cost optimization: the playbook that actually moves the bill

Lever 1: default to the cheapest model, escalate only failures

Lever 2: cache the repeated prefix

Lever 3: filter junk before the model

Lever 4: shrink and shape the payload

Lever 5: make retries deliberate

The metric that ties it together

The takeaway

FAQ

News & Insights

How much does AI save in customer support?

The AI productivity paradox: why time saved isn't money saved

AI ROI by industry: where the returns are highest

You know what you want to build. Let's go ship it.

managed code