People Keep Saying RAG is Dead
Honestly, every time a new context window drops, my LinkedIn feed turns into a funeral. "RAG is dead." "Long context killed retrieval." "Just dump everything in the prompt." And I get it — when Gemini gives you 2 million tokens and Claude is sitting at a million, it feels like why would you even bother with vector databases and chunking pipelines and all that headache.
But I've been building these systems for actual clients. Not demo apps. Not Medium tutorial projects. Production systems where things break at 2 AM and someone calls you. And my answer is more complicated than "RAG is dead" or "RAG forever."
The math nobody wants to do
So here is what people skip over. A 1M token query on Claude Opus costs roughly $15 in input tokens alone. Sounds fine for one call, right? Now do 100 queries a day. That is $1,500 daily. $45,000 a month. Finance team will find you. They will have questions.
RAG pulls maybe 2,000 to 10,000 tokens per query. That is 50 to 200x cheaper. I mean the difference is not small, it is enormous. We ran into this exact thing at OZ last year — a client wanted to do full-document analysis on legal contracts, and the prototype was beautiful. Dump the whole contract in, ask questions, get answers. Accuracy was great. Then we calculated what happens when 40 lawyers use it daily and suddenly the project needed a very different architecture.
The other thing is latency. At 1M tokens you are looking at 20-30 seconds to first token. For an internal tool where someone runs one analysis a day, fine. For a customer-facing chatbot? Nobody is waiting 30 seconds. RAG adds maybe 100-500ms for the vector search. That matters.
Where long context actually wins
I am not saying big context windows are useless. They are incredible for specific things. When you need to reason across an entire document — like understanding how section 4.2 of a contract relates to the amendment in appendix B — RAG will chop that up and lose the connection. Long context keeps everything together. That is real value.
Small, static knowledge bases too. If your company has maybe 200 pages of documentation that doesn't change much, just stuff it all in the context. No vector database to maintain, no embedding pipeline, no chunking strategy to argue about for two sprints. Copy, paste, done. I have seen teams spend three months building a RAG system for a knowledge base that would fit in a single prompt.
Code understanding is another one. When you need to trace how a function call flows through 15 files, long context is better because it sees everything at once. RAG would retrieve individual chunks and miss the connections.
The "lost in the middle" problem is real though
Here is something that doesn't get enough attention. Claude hits about 90% retrieval accuracy on single-needle benchmarks at 1M tokens. Sounds good until you realize that at 32K tokens it was hitting 98-99%. And for multi-needle tasks — where you need to find and connect multiple pieces of information — accuracy drops by 20-40%. That is a lot.
There is this known issue where models are worse at finding information buried in the middle of long contexts. At a million tokens, the "middle" is hundreds of thousands of tokens wide. That is not a minor blind spot.
RAG has its own version of this problem hidden in embedding quality, but at least you can measure and fix it. Bad retrieval is debuggable. A model ignoring page 847 out of 1,500 pages is harder to catch.
What I actually tell clients
It depends. I know that is the most annoying answer but it is true. For high-value, low-volume analysis tasks — use long context. For anything with more than 100 queries a day or knowledge bases over a million tokens — you still need RAG. For the stuff in between, hybrid approaches work. Use RAG for retrieval, then feed the retrieved chunks plus surrounding context into a larger window for reasoning.
The real move right now is not picking one or the other. It is knowing when each one makes sense and building systems flexible enough to use both. The teams saying "RAG is dead" are mostly building demos, and the teams saying "long context is a gimmick" are mostly protecting their existing infrastructure investments. Neither is being fully honest about it.


