Building a RAG-Based PDF Question Answering System: Engineering Decisions, Failures, and Lessons An AI/ML student built StudyMate AI, a RAG-based PDF question answering system that uses local embeddings and in-memory vector storage. The project overcame initial retrieval failures by adding pre-generated summary chunks and a two-stage retrieval process to handle document-level questions. Key engineering decisions included using HuggingFace's all-MiniLM-L6-v2 embeddings and FAISS for cost-effective local processing. As an AI/ML student preparing applications for research internships at companies like Google, I wanted to build something that went beyond the typical classifier or fine-tuning demo. I wanted a project that demonstrated systems thinking — not just model calling. The result was StudyMate AI : a RAG pipeline that lets you upload any PDF and ask questions about it, grounded strictly in the document's content. This post documents the real engineering decisions I made, the problems I ran into, and what I learned — including the parts that didn't work the first time. Retrieval-Augmented Generation RAG is a pattern where instead of asking an LLM to answer from memory, you first retrieve relevant context from a knowledge source and inject it into the prompt. This means: The alternative — fine-tuning an LLM on your documents — is expensive, slow, and overkill for a single-document use case. RAG was the right architectural choice here. PDF ↓ PyPDFLoader ↓ RecursiveCharacterTextSplitter chunk size=800, overlap=100 ↓ HuggingFace Embeddings all-MiniLM-L6-v2 — runs locally ↓ FAISS Vector Store in-memory ↓ Custom Two-Stage Retriever first page chunks + summary chunks + content chunks + broad search ↓ Groq LLM llama-3.1-8b-instant ↓ Answer The first decision was how to embed the document chunks. OpenAI's embedding API is the popular choice, but it's pay-per-token — during development and testing, that cost accumulates quickly. HuggingFace's all-MiniLM-L6-v2 runs locally on your machine , costs nothing, and requires no API key. For a single-user, single-PDF system, the performance tradeoff is negligible. This is the kind of decision that matters at scale — choosing the right tool for the actual constraints, not the most popular one. Pinecone and Weaviate are the production choices for vector storage. They offer persistence across sessions, horizontal scaling, and multi-user support. None of that was needed here. FAISS runs in-memory with zero setup cost . For one user processing one PDF at a time, it's the correct tradeoff. The rule I applied: use the simplest thing that satisfies your actual constraints. Reach for hosted infrastructure when you need persistence, concurrency, or datasets too large for memory — not before. RetrievalQA to LCEL During development I noticed LangChain had deprecated RetrievalQA . Rather than ignore the warning and ship deprecated code, I migrated to the current LCEL LangChain Expression Language chain composition pattern. The old approach was a black box. The new approach is explicit: retrieval chain = RunnablePassthrough.assign context=RunnableLambda retrieve with summary | format docs | prompt | llm Every step is visible — retrieval, formatting, prompting, generation. This matters for debugging and for understanding what the system is actually doing. This is where it got interesting. After building the basic pipeline, I tested it with: "What is the main purpose of this document?" The response: "I cannot find the answer in the provided documents." But the document's purpose was clearly stated in the abstract. What went wrong? Similarity search surfaces locally similar chunks — chunks whose text is semantically close to the query. A query about "purpose" doesn't semantically match individual chunks about methodology or findings, even though the answer exists in the document. Vanilla RAG is optimized for specific factual questions. Document-level questions — purpose, thesis, overview — require a global view of the document that chunk-level retrieval can't provide. I solved this with two additions: 1. Pre-generated summary chunks At build time, before any user query, I generate 5 targeted summaries from the first 2,500 characters of the document and store them as special chunks in the vector store: summaries to create = { "research question": "What is the exact research question?", "methodology": "Describe the methodology in one sentence.", "findings": "What are the main findings in one sentence?", "conclusions": "What are the conclusions in one sentence?", "limitations": "What are the limitations in one sentence?" } These give the retriever a global view of the document that similarity search alone can't provide. 2. First-page pinning Pages 0 and 1 of any academic document almost always contain the abstract and introduction — where purpose and topic live. I pin these as always-included context regardless of the query: python def retrieve with summary inputs : query = inputs "input" if isinstance inputs, dict else inputs summary results = vector store.similarity search query, k=2, filter={"chunk type": "summary"} content results = vector store.similarity search query, k=3, filter={"chunk type": "content"} broad results = vector store.similarity search query, k=2 first page chunks = c for c in chunks if c.metadata.get "page", 99 in 0, 1 seen, all docs = set , for doc in first page chunks + summary results + content results + broad results: if doc.page content not in seen: seen.add doc.page content all docs.append doc return all docs After this fix, document-level questions worked correctly. Groq's free tier allows 6,000 tokens per minute TPM . My initial implementation used ThreadPoolExecutor with multiple workers to generate summaries in parallel. The result: all 5 API calls fired within milliseconds of each other, consuming ~5,000 tokens in one second and triggering a 429 error immediately. Rate limit reached: Limit 6000, Used 5881, Requested 3378. Please try again in 32.59s. This is a real distributed systems constraint — and solving it required thinking about the problem like a systems engineer, not just a model user. Solution: max workers=1 — sequential generation eliminates the burst time.sleep 35 between calls — 35s gives a safe buffer above the 32.59s reset window :2500 characters — keeps each prompt to ~150 tokens, so 5 summaries stay well within the TPM limitThe tradeoff is ~3 minutes of startup time on the free tier. On Groq's Dev tier 30,000 TPM , the sleep can be removed entirely and workers restored — startup drops to under 10 seconds. The system prompt strictly instructs the model to refuse answering if the context doesn't support it: Strict Rules: 1. Rely ONLY on the clear facts directly mentioned in the context. 2. Do NOT assume, extrapolate, or bring in outside knowledge. 3. If the context does not contain the answer, reply exactly: "I cannot find the answer in the provided documents." Test result with an out-of-scope question: Q: What is the capital of France?I cannot find the answer in the provided documents. Source 1 — The EUROCALL Review, Volume 25, No. 2, September 2017... Source 2 — ...research question, description of participants... Source 3 — ...referred their students to electronic or online resources... The system correctly refuses rather than hallucinating, and returns the chunks it did find — making the reasoning transparent. Vanilla RAG is not enough for document-level questions. Chunk similarity search is optimized for factual, specific queries. Any question requiring a global view of the document — purpose, thesis, summary — needs a separate strategy: pre-summarization, large-k retrieval, or a dedicated summary index. Rate limits are a systems design problem. The solution isn't just adding a sleep — it's understanding the constraint TPM budget , calculating the safe parameters tokens per call × calls per minute , and designing the pipeline around them. Read the deprecation warnings. RetrievalQA and langchain-community both flagged deprecation during development. Ignoring them is technical debt. Evaluating them — deciding when to migrate and when to defer — is engineering judgment. | Component | Choice | |---|---| | Frontend | Streamlit | | LLM | Groq llama-3.1-8b-instant | | Embeddings | HuggingFace all-MiniLM-L6-v2 | | Vector Store | FAISS | | Chain | LangChain LCEL | Built as part of my AI/ML portfolio while preparing research internship applications. Feedback welcome.