GPT-3.5-Turbo drops from 90% accuracy to 50% when the answer sits in the middle of a 20k-token prompt instead of the start or end. Liu et al. (2023) documented this in "Lost in the Middle: How Language Models Use Long Contexts" at ACL. The edges of your context window are prime real estate. The middle is a graveyard.
This is not a retrieval bug. It is an attention pattern. Transformers use soft attention across the full sequence, but positional encodings and training distributions bias the model toward recent tokens and salient prefixes. When you stuff a long JSON array or a chunked document into the prompt, the signal dilutes. The model attends to the framing, not the buried row at index 847. Attention weights decay toward the center in long sequences because the training corpus rarely requires mid-span reasoning over 20k tokens.
Picture a RAG pipeline in rag_engine.py
where you dump ten retrieved chunks into a single prompt. You sort by cosine similarity and concatenate. Chunk five holds the exact clause that answers the user, but it sits between chunks four and six. Your generation fails. The fix is not a larger context window. The fix is re-ranking