Where Do Codex's Cached Tokens Come From?
After a Codex run, you might notice a token breakdown like this in your terminal:
Token usage: total=188,614 input=173,815 (+ 5,306,112 cached) output=14,799 (reasoning 2,123)
That 5,306,112 cached figure looks enormous - almost 30x the actual input. Here's what's going on.
Prompt Caching 101
OpenAI's prompt caching works by detecting when an API request shares an exact prefix with a recently processed prompt. If a match is found, the model skips recomputing that prefix and reuses cached key/value tensors from the attention layers. The result: up to 80% lower latency and up to 90% cheaper input token costs.
Caching is automatic - no configuration required. It kicks in for prompts of 1,024 tokens or more, with cache hits occurring in increments of 128 tokens.
Key constraint: it must be an exact prefix match. Even a single character change at the start of the prompt busts the cache.
How Codex Exploits This
Codex's agent loop is explicitly architected around prompt caching as a first-class concern.
The Agent Loop Appends, Never Mutates
Each time Codex calls the model - to make a tool call, interpret a result, decide a next step - it sends the full conversation history again. Rather than modifying earlier messages, the loop strictly appends new content at the end:
[system prompt] [tool defs] [env context] [turn 1] [tool call] [tool result] [turn 2] ...
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
stable prefix -> cache hit
The old prompt is always an exact prefix of the new one. This is by design.
What Lives in the Stable Prefix
- System instructions
- Tool definitions
- Sandbox configuration
- Environment context (working directory, approval mode, etc.)
These are kept identical and consistently ordered across all requests within a session. Only new messages get appended.
Why the Cached Count Gets So Large
In an agentic run, the model might be called dozens or hundreds of times - once per tool call, once to interpret results, once per planning step.
Each of those calls re-sends the full conversation history. But because the earlier portion is always an identical prefix, it hits the cache every time.
So your cached total is cumulative across all inference calls in a single turn. If Codex makes 50 model calls and the stable prefix is 100k tokens by mid-session:
50 calls x 100k cached tokens = 5,000,000 cached tokens reported
That's why a single Codex run can show millions of cached tokens even when your actual repo context is modest.
Breaking Down the Numbers
| Field | What it means |
|---|---|
input |
Tokens actually processed this run (billed at full rate) |
cached |
Tokens that hit the cache (billed at reduced rate or free) |
output |
Tokens generated by the model |
reasoning |
Internal chain-of-thought tokens (o-series models only) |
For the example at the top, the model processed ~174k tokens at full cost, but reused ~5.3M tokens from cache - meaning the actual compute was a small fraction of what a naive implementation would require.
Implications
Cost: Cached tokens are significantly cheaper. For long agentic sessions with lots of tool calls, this is where most of your savings come from.
Latency: Cache hits reduce time-to-first-token. Codex feels snappy mid-session partly because the model isn't reprocessing the same system prompt and tool definitions on every call.
Session structure matters: If you restart a session or change the system prompt, you bust the cache and pay full price again. This is why Codex's architecture keeps the prefix frozen and appends-only.
The reasoning tokens: Separate from the above - these are the model's internal scratchpad tokens from chain-of-thought reasoning (visible in o3/o4-mini style models). They're not cached and not part of your output, but they do consume context window.
Further Reading
- Unrolling the Codex agent loop - OpenAI's deep dive into how the agent loop is structured
- Prompt Caching 201 - Optimizing for cache hits, including the
prompt_cache_keyparameter - OpenAI Prompt Caching docs - Reference on retention policies, in-memory vs extended caching