Most agents drag their entire past into every turn. A better default: keep a thin index of what was said hot, and fetch only the few turns you actually need — intact, on demand.
Code: github.com/NirajPandey05/jit_context
There is a quiet assumption baked into how most agents handle memory: that more context is safer than less. If the model might need something, put it in the window. The conversation grows, every prior turn rides along on every new request, and we trust the model to find the part that matters.
That assumption breaks twice. It breaks on cost, because an agent loop re-sends its whole window on every step — a hundred stale turns aren't paid for once, they're paid for on turn 101, 102, and every step after. And it breaks on quality, because models don't read a long window evenly. Relevant facts buried in the middle get underweighted; irrelevant bulk competes for attention with the thing that actually answers the question. Past a point, a bigger context produces a worse answer, not just a costlier one.
So the interesting question isn't "how do we fit more in?" It's "how do we keep the window small and dense without losing the one old turn that matters?" This post is the design we built around that question — for the specific case of long conversation history — plus the benchmark we used to keep ourselves honest.
The design borrows directly from how computers have always managed memory that doesn't fit: a small fast tier that's always present, a large slow tier that holds the bulk, and a rule for moving things between them. Virtual memory pages between RAM and disk. We page between the context window and an external store — for attention instead of address space.
Concretely, there are two tiers. The cold store holds every turn at full fidelity, keyed by id — nothing is thrown away. The hot index holds one compact entry per turn: a short summary, a little metadata (entities, whether the turn recorded a decision), and an embedding of that summary. The index is cheap enough to keep in the window permanently; the payloads are not.
That permanence matters more than it looks. Because the index — a table of contents — is always present, the model can always see that something exists even when it hasn't loaded the detail. "There was a decision about the deploy window at turn 14" stays visible as a one-liner. The worst failure mode of any retrieval system is silently dropping a relevant fact so the model never knows to ask for it. An always-on index is the guard against that.
When a new turn arrives and the conversation is long enough to bother, we have to decide which old turns to reheat. We do it in two stages — cheap-and-broad, then precise-and-narrow.
First, a semantic shortlist: embed the current request, rank every index entry by similarity, take the top dozen. This is fast, recall-oriented, and costs no model call. Second, a model picks: a small, fast model sees the request and those dozen candidate summaries — never the full text — and returns the handful of turn ids whose full detail is actually worth . Similarity finds plausible candidates; the picker applies judgment the embeddings can't.
new turn ──▶ semantic shortlist ──▶ model picks ──▶ assemble window
(query) top-12 by cosine summaries → index + picked turns
[ #12, #14 ] + recent N verbatim
cheap & broad ───────────────────────▶ precise & narrow
The window we finally send is assembled from four parts: the system prompt, the scoped index (relevant + decision-flagged lines, so the table of contents doesn't itself grow without bound), the handful of retrieved turns at full fidelity, and the most recent few turns kept verbatim — because recency is free and usually relevant.
Make it concrete. A sixty-turn conversation; back at turn 9 the assistant said "Decision: the deploy window is Tuesday 02:00 UTC. Let's lock that in." Fifty turns of unrelated chatter follow. Now the user asks: "Remind me, what did we settle on for the deploy window?"
A naive recency window keeps the last four turns — all chatter — and the answer is simply gone. Full context keeps everything, answer included, but pays for sixty turns and dilutes the needle among them. Here's what JIT does instead:
query = "what did we settle on for the deploy window?"
shortlist = top_k(embed(query), index, k=12)
picked = picker(query, shortlist, max=6) # → [9]
window = assemble(system, index, fetch([9]), recent_4, query)
The needle is retrieved intact, the answer is exact, and the window is a third of the size. The point isn't the percentage — it's that JIT beat both baselines at once: more accurate than recency (which lost the turn) and far cheaper than full (which kept all sixty).
The comparison that matters is never "JIT vs. the giant window." It's three-way, because each baseline fails differently:
| approach | accuracy | tokens | the catch |
|---|---|---|---|
| full (everything) | |||
| high | 100% | complete but diluted; billed every step | |
| recency (last N turns) | |||
| ~0% | ~10% | drops anything old, however important | |
| jit (index + fetch) | |||
| high | ~40–50% | only as good as retrieval recall |
That last catch is the whole game, and it cuts both ways. When retrieval is good, JIT gives you a small dense window — cheaper and higher quality. When retrieval misses, JIT has now removed a turn the full window would at least have kept somewhere. So JIT only wins when retrieval precision clears a bar; below it, you've built a slow leak.
A negative result, kept on the record.Our first version injected theentireindex on every turn. On long conversations that table of contents grew so large it cost more than the turns it was saving — token "savings" wentnegative, and got worse the longer the chat. The fix was to scope the injected index to the shortlist plus decision-flagged turns. Only after that did savings turn positive and grow with length. The benchmark existed precisely to catch this; an eval that only ever shows wins isn't measuring anything.
Token savings vs. conversation length, after scoping the index (offline harness):
| turns | 20 | 40 | 80 | 160 | 320 |
|---|---|---|---|---|---|
| scoped index | |||||
| 3% | 50% | 74% | 87% | 94% | |
| full index (every turn) | −30% | −35% | −38% | −40% | −41% |
Self-measured wins on a synthetic generator are a smell test, not proof. To make a claim anyone should believe, you have to run on a public benchmark other systems also report on. For long-term conversational memory, that benchmark is LoCoMo (Maharana et al., presented at ACL 2024, released by snap-research).
LoCoMo is built for exactly the regime JIT targets: very long, multi-session conversations where the answer to a question is some specific thing said far earlier, buried in noise. Its scale is the point — the dataset is ten long conversations averaging on the order of ~600 turns and ~26 sessions each, with roughly ~1,500 question–answer pairs annotated across the set. A short-context trick won't survive it.
What makes it a genuinely useful diagnostic — rather than a single score — is that its questions come in distinct types, and they stress a retrieval-based design very differently:
| question type | what it asks | stresses… |
|---|---|---|
| single-hop | ||
| one fact from one turn | pure retrieval — the JIT sweet spot | |
| multi-hop | ||
| join facts across disjoint turns | pulling several related turns, not just the top one | |
| temporal | ||
| ordering & timelines | whether the index kept timestamps | |
| open-domain | ||
| inference beyond the stated text | reasoning, not just recall | |
| adversarial | unanswerable / trap | excluded from scoring by convention |
Following the standard protocol, the adversarial category is excluded so results stay comparable to published numbers. The run itself mirrors the architecture: a write phase ingests the whole conversation into the index and cold store, then a read phase answers each question by assembling a window and handing it to a real model. Two scorers are reported together for defensibility — word-overlap F1 against the reference, and an LLM-as-judge for semantic correctness — and crucially, accuracy is broken out per category.
Why the per-category breakdown earns trust.A single aggregate number hides the story. The honest pattern a retrieval-first design shows on LoCoMo is strong single-hop, weaker temporal — and the breakdown tells youwhy: temporal questions can have perfect evidence recall (you fetched the right turns) yet still miss, because answering needs ordering the summaries may have stripped. That cleanly separates the two failure modes worth distinguishing —didn't retrieve itversusretrieved it but couldn't use it— and points straight at the fix (timestamp-aware indexing) instead of a vague "improve retrieval."
The four-way LoCoMo comparison is the experiment that turns the idea into a number you can put next to other memory systems. The rule: only place another system's figure beside yours if it ran through the identical harness; otherwise label it "reported, not directly comparable." Different judge models and answer-matching make cross-paper numbers a trap.
Run across all ten conversations (1,540 non-adversarial QA pairs) with real embeddings (gemini-embedding-001
) and a real answerer/judge (gemini-2.5-flash-lite
):
| mode | F1 | judge | ret. recall | token cost |
|---|---|---|---|---|
| full | ||||
| 0.41 | 0.43 | — | 100% | |
| recency | 0.00 | 0.01 | — | ~10% |
| jit | ||||
| 0.37 | 0.36 | 0.71 | ~50% |
The shape is the thesis made real: JIT reaches about 90% of full-context F1 at roughly half the tokens, while recency collapses to near-zero because the answer is almost never in the most recent turns. That gap between JIT and recency — same token budget order, wildly different accuracy — is the entire argument for indexing over truncation.
One caution on reading these: the absolute scores are low because the answerer/judge is a small fast model; full-context F1 is only 0.41. All three modes would rise with a stronger model, so the trustworthy signal is the relative gap (JIT ≈ full, recency ≈ 0), not the 90% ratio in isolation — a ratio of two small numbers is noisy. Lead with the structure, not the percentage.
The per-category breakdown delivers exactly the diagnostic promised above:
| category (jit) | F1 | judge | ret. recall |
|---|---|---|---|
| single-hop | 0.48 | 0.51 | 0.74 |
| multi-hop | 0.28 | 0.26 | 0.73 |
| temporal | 0.22 | 0.10 | 0.69 |
| open-domain | 0.13 | 0.20 | 0.43 |
Single-hop is strongest — pure retrieval, the design's home ground. Temporal is the tell: retrieval recall is a healthy 0.69 (the right turns were fetched), yet judge accuracy is just 0.10 — evidence present, but ordering the summaries stripped. Open-domain is weakest on both axes, because diffuse inferential context is genuinely hard to surface by similarity. None of this is hidden; it's the map of what to fix next.
"Just-in-time context" as a phrase isn't ours — it traces to Anthropic's 2025 context-engineering guidance, which argued for keeping lightweight identifiers in the window and resolving them to full content at runtime. That broader pattern shows up in several places: skills loaded on activation, large tool results offloaded to handles, sub-agents that take a heavy subtask out of the parent's window.
Most of those are forward-looking — the agent decides what to load as it works. The slice described here is retrospective: indexing the conversation's own past and pulling old turns back when they become relevant. And where the common shortcut for long history is compaction — summarize the transcript and discard the detail — the bet here is the opposite: keep every turn intact in cold storage and fetch it whole, so nothing is lost to a lossy summary made before anyone knew which detail would matter. Selection, not compression; conversation history, not all content. That's the corner of the map this occupies.
Retrieval precision is the whole ballgame. A missed fetch is a silent quality loss the model often can't detect. The always-on index mitigates the worst case but doesn't erase it.
Summarizing at index time is lossy and early. You compress a turn before you know what future question will be asked of it — a real circular dependency. Keeping the full turn fetchable is the hedge, but the index summary still gates whether you ever go fetch it.
Every lazy fetch and picker call is latency. Push indexing off the response path, keep the picker on a small model, cap the shortlist — and measure whether the sub-call's cost eats the savings it's chasing.
Dynamic windows fight prompt caches. A context that changes every turn invalidates a provider's cached prefix. Keep the stable parts first and the freshly retrieved parts last to preserve what cache you can.
None of these sink the idea. They define where it pays off: long-horizon, low-locality conversations, past a length threshold, where the relevant past is neither recent nor predictable. In that regime — and the benchmark exists to tell you whether you're in it — late and little is simply the better default.
The implementation — the proxy, the hybrid retriever, and the LoCoMo harness described here — lives at ** github.com/NirajPandey05/jit_context**. It runs offline with no keys for a quick look, and takes real embedder / picker / answerer backends when you want comparable numbers.
Notes & provenance. The token-savings-vs-length figures come from an offline synthetic harness (a smoke test built to be winnable). The headline accuracy table is a real LoCoMo run — all ten conversations, 1,540 non-adversarial QA pairs, real embeddings ( gemini-embedding-001) and a real answerer/judge (gemini-2.5-flash-lite); absolute F1 is bounded by that small model, so the relative gaps are the signal. LoCoMo: Maharana et al., "Evaluating Very Long-Term Conversational Memory of LLM Agents," ACL 2024 (snap-research). "Just-in-time context" as a named pattern follows Anthropic's 2025 context-engineering guidance. Comparisons to other memory systems are only valid when run through an identical harness.