Most senior AI/ML interview prep stops at theory. Backpropagation. Transformer architecture. Framework trivia. That’s not wrong — it’s just not enough anymore.
As AI systems move from demos into production, the bar has shifted. Interviewers at senior and staff levels don’t want to know if you can explain a concept — they want to know what you’d actually do when it breaks at 2am, when the latency doubles overnight, when the retrieval looks right but the answers are wrong.
This article covers 15 real interview questions across RAG systems, hallucination reduction, agentic architecture, production LLM operations, model selection, system design, and research evaluation — each with the answer that actually lands with a senior interviewer, and the reasoning behind why it lands.
You start as an SDE. You stay an SDE. The titles just change — and so do the questions.
What the interviewer is really asking: Do you understand why RAG exists — not just what it stands for?
Answer:
Retrieval-Augmented Generation (RAG) solves a fundamental limitation of LLMs: their knowledge is frozen at training time and they cannot access private or real-time data.
Instead of retraining a model every time data changes (expensive, slow), RAG retrieves relevant documents at query time and injects them into the LLM’s context window. The model then generates a response grounded in that retrieved context.
The three problems RAG solves:
Follow-up the interviewer will ask: “When would you NOT use RAG?”
[Training a model even Fine tunning require a lot of resources and clean data proceed with care and make sure you have room for multiple iterations]
What the interviewer is really asking: Have you actually operated a RAG system, or just read about it? Chunking is where most RAG systems silently fail.
Answer:
Chunking is splitting source documents into pieces before embedding. The strategy you choose directly determines retrieval quality — it’s arguably more impactful than which embedding model you pick.
Fixed-size chunking Split by token count (e.g. 512 tokens), with optional overlap (e.g. 50 tokens).
Semantic chunking Split at natural boundaries: paragraphs, sections, sentence clusters.
Hierarchical chunking (Parent-Child) Store large parent chunks (512–1024 tokens) and small child chunks (128 tokens). Embed child chunks for retrieval, but return the parent chunk to the LLM for full context.
Agentic / late chunking Chunk at query time based on what the query needs, not at index time.
The answer that lands well in interviews: “Bigger chunks are not automatically better. Large chunks improve recall but hurt precision — the LLM gets the right document but surrounded by noise. Small chunks improve precision but hurt recall — you miss the surrounding context. Hierarchical chunking solves this but costs you indexing complexity.”
[Do a small test on chunking and then proceed as this will save a lot of time in later on production. Always benchmark the embedding model based on your data to choose best model with your requirements. In RAG the quality of chunks and embeddings matters most.]
What the interviewer is really asking: Do you know why pure vector search fails in production and what the full retrieval pipeline actually looks like?
Answer:
Dense retrieval (vector search) converts queries and documents to embeddings and finds nearest neighbours. It handles semantic similarity well — “what’s the capital of France” matches “Paris is France’s capital” even with no overlapping keywords.
Sparse retrieval (BM25) is keyword-based. It scores documents by term frequency and inverse document frequency. It handles exact matches, rare terms, product names, and acronyms better than dense retrieval.
Why neither alone is enough:
Hybrid retrieval runs both in parallel and merges results using Reciprocal Rank Fusion (RRF) or a learned combiner.
When does reranking earn its cost? A cross-encoder reranker reads each query-document pair together and produces a relevance score. This is far more accurate than bi-encoder embedding similarity but adds 50–300ms depending on the number of candidates.
Reranking earns its latency cost when:
Skip reranking when:
[If you want fast response you need to use smaller LLM model with comparative cost to latency trade off you can't expect a 70B to respond faster then the 3B or 7B model]
What the interviewer is really asking: Do you treat evaluation as a one-time task or a continuous system?
Answer:
RAG evaluation has two dimensions: retrieval quality and generation quality. Both need continuous measurement, not a one-time benchmark.
Retrieval metrics:
Generation metrics:
Frameworks worth knowing:
The production mindset: Run evals on a golden dataset continuously. Every change to chunking strategy, embedding model, retriever, or prompt should be measured against the same baseline. Treat RAG eval like you treat test coverage — a regression is a signal, not a surprise.
[The Drift is real always eval the deployment on weekly basis so you are aware of the quality of response.]
What the interviewer is really asking: Do you understand that hallucinations have different root causes — and that most of them aren’t solved by prompting?
Answer:
Hallucinations fall into three categories with different fixes:
1. Retrieval hallucinations — the model never received the correct context Fix: improve chunking, hybrid retrieval, reranking. The most common cause and the most impactful fix.
2. Parametric hallucinations — the model “knows” something from training that is wrong or outdated Fix: ground responses in retrieved context, add citation requirements, use confidence thresholds
3. Reasoning hallucinations — the model has the right context but draws the wrong conclusion Fix: chain-of-thought prompting, decompose complex queries, use structured output validation
The answer that separates senior candidates: “A system that says ‘I don’t know’ is more production-ready than one that’s confidently wrong. Refusing to answer when confidence is low is a feature, not a failure. I’d rather instrument a refusal rate metric and work to reduce it than ship hallucinations users can’t detect.”
[Make sure the retrieval really have the data it represent in answer. The prompting matters as well as the context be sure when to say NO.]
What the interviewer is really asking: Can you design a system that fails safely under uncertainty?
Answer:
Confidence thresholds work at two levels:
Retrieval confidence: If the top retrieved document’s similarity score is below a threshold (e.g., cosine similarity < 0.75), either:
Generation confidence: Use model logprobs, self-consistency checks, or a separate critic model to score the generated answer’s reliability before returning it to the user.
Graceful degradation pattern:
What the interviewer is really asking: Can you make the right architectural call, or do you reach for agents because they’re trendy?
Answer:
Workflow — a deterministic sequence of steps. The path is fixed regardless of input. Use when: the process is predictable, compliance requires auditability, reliability matters more than flexibility.
Agent — uses an LLM to reason about which action to take next. The path is dynamic. Use when: the next step genuinely depends on the result of the previous step, and you can’t enumerate the decision tree in advance.
Multi-Agent — multiple specialised agents coordinated by an orchestrator. Use when: the problem decomposes naturally into independent specialist roles, and parallelism or specialisation would meaningfully improve quality or speed.
Decision rule: Start with a workflow. Add an agent when you hit a decision point you cannot enumerate. Add multiple agents when specialisation materially improves quality over one general agent.
The trap to call out in interviews: Multi-agent systems introduce coordination overhead, error propagation, and debugging complexity. If a single well-prompted agent solves the problem, use it. Complexity should solve a real problem, not signal architectural sophistication.
[Worked on many flow when one action not needed information for the other agent. If one agent doesn't need to know what other agents are doing the efforts is not worth it. Be simple when it can be.]
What the interviewer is really asking: Have you actually operated an agent in production? This is where most agent systems break.
Answer:
Agentic systems fail in ways that don’t exist in traditional software:
Tool call failures:
Infinite loops:
Cost runaway:
[If you forget the failure and retry limit you might end up paying the LLM cost which was not useful as well as you lost trust.]
What the interviewer is really asking: Can you reason systematically under pressure, or do you guess?
Answer:
The process is elimination, not guessing. Work from the most recent change backwards.
Instruments you need before this question is ever relevant:
The answer that separates candidates: “Observability beats optimization. You cannot debug a latency regression without per-stage traces. The first thing I do on a new AI system is instrument every stage separately — retrieval, reranking, context assembly, generation — so I know exactly where time goes before anything breaks.”
[I worked on one incident when there was responses coming after 3-4mins and then also it was not correct. I followed the pattern and found out the 429 was culprit behind it. That's when i decided once get a 429 use another deployment or smaller model for quick response. as exponential backoff can put you long back into the loop.]
What the interviewer is really asking: Can you engineer cost efficiency into a production AI system, not just get it to work?
Answer:
Cost in LLM systems comes from three places: input tokens, output tokens, and model tier. Optimise each separately.
Input token reduction:
Output token reduction:
Model routing:
Real numbers to know:
[Prompt optimization and model routing can save you 30% of the cost in many cases. you don't need the large model for smaller tasks use the resources like your pocket money.]
What the interviewer is really asking: Do you default to the latest LLM for everything, or do you actually match the tool to the problem?
Answer:
Model selection is an engineering decision, not a prestige decision. The right model is the one that meets the business requirement within acceptable cost, latency, and maintenance constraints.
Decision questions to ask before picking a model:
[If you are working on troubleshooting the problems using a prompt based system can solve your problem without costly fortune on development or Token usages.]
What the interviewer is really asking: Can you make the right customization call, or do you default to the most expensive option?
Answer:
These three approaches address different problems and are not interchangeable.
Prompt engineering:
RAG:
Fine-tuning: [Use Fine Tuning only in case when your data is too different then what these model are trained on. Domain specific use cases only.]
What the interviewer is really asking: Can you design AI systems for regulated, high-stakes domains — not just demos?
Answer:
Healthcare claims processing has three hard constraints most designs ignore:
Architecture:
Points that differentiate a strong answer:
What the interviewer is really asking: Can you think about AI systems with the same reliability engineering mindset as distributed systems?
Answer:
AI systems have failure modes traditional systems don’t:
Graceful degradation tiers:
Implementation patterns:
[In production being a overthinker helps always set eyes on incase this not works what will be happening? How to avoid it? eg. Always use multiple deployment if one return 429 use the another deployment. one get room to breath and other can be used even a smaller cheaper model will be good here for intermediate answer.]
What the interviewer is really asking: Are you appropriately sceptical of academic results, or do you adopt techniques because they’re new?
Answer:
A paper earning a place in production needs to pass more than a benchmark score.
Evaluation checklist:
1. Problem relevance Does the technique actually solve a problem we have, or does it solve a slightly different problem on a curated benchmark?
2. Reproducibility Is the code available? Can we reproduce the reported numbers? Many papers report cherry-picked results. Run it on your own data before drawing conclusions.
3. Real accuracy delta Is the improvement statistically significant and large enough to matter? A 0.3% improvement on a benchmark rarely translates to user-perceptible quality improvement.
4. Latency and cost impact Papers rarely benchmark inference cost. A technique that improves accuracy by 5% but doubles inference time is not automatically worth it.
5. Operational complexity Will this technique require a new infrastructure component? New dependencies? A new serving pattern? Who will maintain it on-call at 2am?
6. Controlled experiment Run A/B test against baseline on production traffic with real users before committing to a rollout.
The line that lands: “Research generates ideas. Production demands evidence. Those are different standards, and conflating them is how teams end up maintaining techniques that never actually moved their metrics.”
[Certain techniques are still looks good on paper but not adopted in the production systems. Always evaluate the theory on Business use cases and output with small POC]
These questions signal that you think like a senior engineer, not a candidate looking for a job:
This content can split into multiple articles:
More from @TheProdSDE — writing about what actually breaks in production:
🔐 Your JWT Is Lying to You — And It’s Letting Botnets Walk Right In why token-based authorization breaks at scale
*🔧 **Stop Prompting, Start Orchestrating: The AI Workflow That Ships Code Without Breaking Production *— the AI workflow that ships code without breaking production
*🔒 **Your Secrets Are Leaking Right Now. Here’s How to Stop It. *— the most common ways credentials escape your system
🏗️ Backend Engineers Are Sleeping on MCP — And It’s About to Change How We Build APIs — and it’s about to change how we build APIs
“You start as SDE. You stay SDE. The titles just change.”
Senior AI Interviews Don’t Test What You Know. They Test What Breaks at 2am. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.