May 9, 2026

Where Do Codex's Cached Tokens Come From?

After a Codex run, you might notice a token breakdown like this in your terminal:

Token usage: total=188,614 input=173,815 (+ 5,306,112 cached) output=14,799 (reasoning 2,123)

That 5,306,112 cached figure looks enormous - almost 30x the actual input. Here's what's going on.

Prompt Caching 101

OpenAI's prompt caching works by detecting when an API request shares an exact prefix with a recently processed prompt. If a match is found, the model skips recomputing that prefix and reuses cached key/value tensors from the attention layers. The result: up to 80% lower latency and up to 90% cheaper input token costs.

Caching is automatic - no configuration required. It kicks in for prompts of 1,024 tokens or more, with cache hits occurring in increments of 128 tokens.

Key constraint: it must be an exact prefix match. Even a single character change at the start of the prompt busts the cache.

How Codex Exploits This

Codex's agent loop is explicitly architected around prompt caching as a first-class concern.

The Agent Loop Appends, Never Mutates

Each time Codex calls the model - to make a tool call, interpret a result, decide a next step - it sends the full conversation history again. Rather than modifying earlier messages, the loop strictly appends new content at the end:

[system prompt] [tool defs] [env context] [turn 1] [tool call] [tool result] [turn 2] ...
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                         stable prefix -> cache hit

The old prompt is always an exact prefix of the new one. This is by design.

What Lives in the Stable Prefix

System instructions
Tool definitions
Sandbox configuration
Environment context (working directory, approval mode, etc.)

These are kept identical and consistently ordered across all requests within a session. Only new messages get appended.

Why the Cached Count Gets So Large

In an agentic run, the model might be called dozens or hundreds of times - once per tool call, once to interpret results, once per planning step.

Each of those calls re-sends the full conversation history. But because the earlier portion is always an identical prefix, it hits the cache every time.

So your cached total is cumulative across all inference calls in a single turn. If Codex makes 50 model calls and the stable prefix is 100k tokens by mid-session:

50 calls x 100k cached tokens = 5,000,000 cached tokens reported

That's why a single Codex run can show millions of cached tokens even when your actual repo context is modest.

Breaking Down the Numbers

Field	What it means
`input`	Tokens actually processed this run (billed at full rate)
`cached`	Tokens that hit the cache (billed at reduced rate or free)
`output`	Tokens generated by the model
`reasoning`	Internal chain-of-thought tokens (o-series models only)

For the example at the top, the model processed ~174k tokens at full cost, but reused ~5.3M tokens from cache - meaning the actual compute was a small fraction of what a naive implementation would require.

Implications

Cost: Cached tokens are significantly cheaper. For long agentic sessions with lots of tool calls, this is where most of your savings come from.

Latency: Cache hits reduce time-to-first-token. Codex feels snappy mid-session partly because the model isn't reprocessing the same system prompt and tool definitions on every call.

Session structure matters: If you restart a session or change the system prompt, you bust the cache and pay full price again. This is why Codex's architecture keeps the prefix frozen and appends-only.

The reasoning tokens: Separate from the above - these are the model's internal scratchpad tokens from chain-of-thought reasoning (visible in o3/o4-mini style models). They're not cached and not part of your output, but they do consume context window.

# Prompt Caching 101

# How Codex Exploits This

# The Agent Loop Appends, Never Mutates

# What Lives in the Stable Prefix

# Why the Cached Count Gets So Large

# Breaking Down the Numbers

# Implications

# Further Reading