{"slug": "ai-system-design-interview-questions-chatgpt-rag-llm-inference-and-agents", "title": "AI System Design Interview Questions: ChatGPT, RAG, LLM Inference, and Agents", "summary": "System design interviews are evolving to include AI-specific questions on topics like ChatGPT, RAG, LLM inference, and AI agents. A developer outlines key differences from traditional systems, such as probabilistic outputs, accelerator constraints, and new metrics like token latency and answer quality. The guide provides a structured approach to designing AI systems, covering components like gateways, model routers, vector stores, and safety filters.", "body_md": "System design interviews are changing.\n\nTraditional questions such as “Design Twitter,” “Design Uber,” and “Design YouTube” are still important. They test whether you understand databases, caching, partitioning, replication, messaging, and high availability.\n\nBut engineers working on modern platforms now encounter a different category of problem:\n\nThese questions still require classical distributed-systems knowledge. An AI product needs APIs, queues, storage, authentication, observability, rate limiting, and reliable deployment.\n\nThe difference is that it also introduces expensive accelerators, probabilistic output, long-running requests, model routing, vector retrieval, prompt construction, safety controls, and quality evaluation.\n\nThis guide explains the most important AI system design interview questions and what a strong candidate should discuss for each.\n\nFor a broader preparation roadmap covering traditional and modern problems, see [64 System Design Interview Questions, Ranked From Easiest to Hardest](https://dev.to/arslan_ah/64-system-design-interview-questions-ranked-from-easiest-to-hardest-260m).\n\nA conventional service usually transforms an input into a deterministic output.\n\nIf a user requests order number 123, the service should retrieve order 123. Two identical requests should usually return the same underlying information.\n\nGenerative AI systems behave differently.\n\nA model may produce different responses to the same prompt. A response can be grammatically convincing while being factually wrong. Latency depends on the number of generated tokens. Serving capacity is constrained by accelerator memory, not merely CPU utilization. Product quality may depend on prompts, retrieved context, model versions, safety filters, and external tools.\n\nThis creates several new design dimensions.\n\nTraditional systems are often measured using availability, latency, throughput, and error rate.\n\nAI systems need those metrics, but they also need measures such as:\n\nA system that returns a response in 200 milliseconds is not useful if that response is wrong.\n\nAn ordinary API server may process thousands of lightweight requests per second.\n\nAn LLM request can occupy expensive GPU memory while processing a long prompt and generating hundreds of tokens. The architecture must therefore optimize batching, memory utilization, model placement, and request scheduling.\n\nUsers do not normally wait for an entire answer before seeing anything. Tokens are streamed as they are generated.\n\nThis introduces at least two important latency measurements:\n\nA system may have acceptable total latency but still feel slow if the first token takes too long.\n\nAn AI application may depend on:\n\nEach data type has distinct requirements for retention, privacy, freshness, and consistency.\n\nA traditional request may succeed or fail.\n\nAn AI request can technically succeed but produce a low-quality answer, retrieve the wrong documents, call the wrong tool, exceed a cost budget, or violate a safety rule.\n\nThe architecture must detect and respond to these softer failure modes.\n\nBefore considering individual questions, use a consistent interview structure.\n\nAsk what the system is expected to do.\n\nFor example:\n\nWithout this clarification, “Design an AI assistant” is too broad.\n\nEstimate:\n\nAI systems are often constrained by cost as much as by technical capacity.\n\nThe application layer may include:\n\nThe AI layer may include:\n\nKeeping these concerns separate makes the design easier to explain and evolve.\n\nDescribe what happens from the moment a user submits a prompt until the final result is displayed.\n\nA typical path may be:\n\nA strong answer should explain:\n\nThese discussions distinguish a production design from a demo.\n\nA ChatGPT-like system is one of the most comprehensive AI system design questions.\n\nThe functional requirements may include:\n\nA useful high-level architecture contains the following components.\n\nThe gateway handles authentication, request routing, rate limiting, quotas, and basic validation.\n\nLong-running generation requests may use Server-Sent Events or WebSockets to stream tokens to clients.\n\nThis service manages:\n\nConversation metadata can live in a transactional database, while large attachments may be placed in object storage.\n\nModels have finite context windows. The context builder decides what information should be included in the next request.\n\nIt may combine:\n\nSimply sending the entire conversation forever is expensive and eventually impossible. Older content may need summarization or selective retrieval.\n\nThe model gateway provides a single interface to multiple model backends.\n\nIt can route requests based on:\n\nA simple request may use a smaller, faster model, while complex reasoning may be routed to a more capable one.\n\nThe scheduler assigns requests to model replicas running on accelerators.\n\nIt should consider:\n\nA naive first-in, first-out scheduler can allow a few extremely long prompts to delay many short requests.\n\nGenerated tokens should be forwarded incrementally to the user.\n\nThe system must also handle:\n\nInput and output policies may detect:\n\nSafety should not be treated as one filter placed at the end. Different checks may be required before retrieval, before tool execution, before inference, and before returning the final response.\n\nAn interviewer may ask:\n\nThe [Design ChatGPT walkthrough](https://www.designgurus.io/course-play/system-design-interview-crash-course/doc/design-chatgpt) provides a structured example of this problem.\n\nRetrieval-augmented generation, or RAG, allows a model to answer using information retrieved from external sources.\n\nA common interview prompt is:\n\nDesign an enterprise assistant that answers employee questions using internal documents and provides citations.\n\nA RAG system has two major paths:\n\nDocuments may come from file uploads, internal wikis, cloud drives, databases, or support systems.\n\nThe ingestion pipeline performs several stages.\n\nFiles must be converted into usable text.\n\nThe system may need parsers for:\n\nThe extraction process should preserve useful metadata such as titles, headings, page numbers, owners, and access permissions.\n\nLong documents are divided into smaller segments.\n\nChunks that are too large may contain irrelevant text and consume excessive context. Chunks that are too small may lose meaning.\n\nPossible strategies include:\n\nThere is no universally correct chunk size. It should be tested against representative questions.\n\nEach chunk is converted into a numerical vector using an embedding model.\n\nThe embedding service should be versioned because changing models can require re-embedding the entire corpus.\n\nThe system stores:\n\nA vector index enables semantic retrieval. A traditional inverted index can support keyword retrieval. Many production systems combine both.\n\nWhen a user submits a question:\n\nSemantic retrieval is useful when the query and source use different words with similar meanings.\n\nKeyword retrieval is useful for exact terms such as:\n\nCombining both methods often produces better coverage.\n\nVector similarity may retrieve documents that are generally related but not directly useful.\n\nA reranker can score the top candidates more accurately before they are sent to the LLM. This improves answer quality while keeping the final prompt small.\n\nSecurity is one of the most important parts of enterprise RAG.\n\nA user should never retrieve a document they are not authorized to view. Filtering after the model has already received the document is too late.\n\nPermissions should be enforced during retrieval, with tenant and user identity included in the query path.\n\nThe system must react when:\n\nThe ingestion pipeline may use event-driven updates, periodic crawling, or both.\n\nA RAG system should separately evaluate:\n\nThis separation is important. A poor answer can result from failed retrieval even when the model behaves correctly.\n\nThis question focuses less on the product interface and more on the infrastructure that serves models.\n\nA possible prompt is:\n\nDesign a multi-tenant platform that serves several large language models to millions of requests.\n\nThe platform may need to support:\n\nThe gateway exposes a consistent API and performs:\n\nAdmission control is critical. Accepting unlimited work and allowing it to queue indefinitely creates poor latency and can destabilize the system.\n\nThe registry tracks:\n\nRollouts should use immutable versions so requests and incidents can be traced to the exact model that served them.\n\nLoading a large model into GPU memory can take substantial time. The scheduler cannot treat models like lightweight stateless application containers.\n\nIt must decide:\n\nPopular models may remain warm, while rarely used models may accept a cold-start delay.\n\nLLM inference contains two different computational phases.\n\n**Prefill** processes the input prompt and can often benefit from parallel computation.\n\n**Decode** generates tokens sequentially and is usually memory-bandwidth intensive.\n\nSeparating or independently scheduling these phases can improve utilization, but it also adds network and orchestration complexity.\n\nInstead of waiting for a fixed group of requests to finish together, continuous batching adds and removes requests dynamically as generation progresses.\n\nThis improves GPU utilization, especially when responses have different lengths.\n\nThe scheduler must still prevent long requests from starving shorter ones.\n\nThe key-value cache stores intermediate attention state so the model does not recompute the entire prompt for every generated token.\n\nKV-cache management affects:\n\nA shared prompt prefix—such as a large system prompt—may sometimes be cached and reused across compatible requests.\n\nGPU utilization alone may not be sufficient for autoscaling.\n\nUseful signals include:\n\nBecause accelerator provisioning may be slow, the platform may need reserved capacity and predictive scaling.\n\nWhen capacity is limited, the system may:\n\nA strong interview answer discusses the quality and cost consequences of each fallback.\n\nAn AI agent does more than produce text. It can plan a sequence of actions, call tools, observe results, update its state, and continue until a goal is completed.\n\nA typical prompt might be:\n\nDesign an enterprise agent that can search internal documents, update tickets, send emails, and request human approval for sensitive actions.\n\nThe orchestrator controls the execution loop:\n\nThe orchestrator—not the model—should enforce hard limits such as maximum steps, timeouts, budgets, and approval requirements.\n\nThe tool registry describes each available capability:\n\nTool definitions should be versioned because changing their schemas can break existing agent behavior.\n\nTool calls should run through controlled executors rather than allowing the model unrestricted access to internal systems.\n\nThe executor handles:\n\nHigh-risk operations should use narrow, purpose-built APIs.\n\nAgents may need several kinds of memory.\n\n**Working memory** contains the current task, observations, and intermediate steps.\n\n**Session memory** preserves information during one user interaction.\n\n**Long-term memory** stores information across sessions.\n\n**External memory** may contain documents retrieved from databases or vector indexes.\n\nNot everything should be stored forever. Memory needs explicit retention, privacy, and deletion policies.\n\nActions such as sending payments, deleting data, publishing content, or modifying production systems should not be executed solely because a model requested them.\n\nThe agent can create a proposed action, pause its workflow, and wait for authorized approval.\n\nThe approval record should contain:\n\nAgents may retry actions after timeouts.\n\nWithout idempotency, a retry could send the same email twice, create duplicate tickets, or repeat a transaction.\n\nEvery state-changing tool call should include a stable execution identifier or idempotency key.\n\nA strong candidate should discuss:\n\nThe system should impose:\n\nThe [Grokking Modern AI Fundamentals](https://www.designgurus.io/course/grokking-modern-ai-fundamentals) course can provide additional background on agentic AI, planning, memory, and tool-based behavior.\n\nThe four core questions cover much of the modern AI stack, but interviewers can frame the same concepts in narrower ways.\n\nFocus on tenant isolation, document permissions, RAG, conversation history, model routing, auditability, and data retention.\n\nDiscuss repository indexing, code-aware chunking, low-latency suggestions, context selection, IDE integration, private-code protection, and evaluation of generated code.\n\nCover dataset versioning, offline evaluation, human review, model comparison, prompt experiments, regression detection, and production feedback.\n\nExplain routing between internal and third-party models based on cost, quality, latency, privacy, context length, and availability.\n\nFocus on ingestion, embeddings, vector indexes, hybrid retrieval, filtering, reranking, index updates, and relevance metrics.\n\nDiscuss policy versioning, input and output classification, prompt-injection detection, tool restrictions, personally identifiable information, appeals, and false-positive handling.\n\nCover prompt templates, version control, experiments, rollout, rollback, tenant overrides, caching, and compatibility with changing model versions.\n\nAdd image, audio, and document ingestion, media storage, preprocessing, modality-specific models, content safety, and larger payload management.\n\nA weak answer places an LLM box in the center of a diagram and connects it to an API.\n\nA strong answer explains the system around the model.\n\nInterviewers want to see whether you can reason about:\n\nCan you connect the client, application services, retrieval layer, model platform, storage, and observability systems?\n\nCan you compare:\n\nCan the system continue operating when a model, vector index, tool, region, or third-party provider is unavailable?\n\nCan you tell whether a new model or prompt actually improved the product?\n\nCan you protect tenant data, prevent unauthorized retrieval, constrain tool use, and manage sensitive prompts?\n\nCan you estimate and control token usage, accelerator capacity, retrieval cost, storage, and third-party API spending?\n\nThe model is only one component. Production readiness comes from the architecture surrounding it.\n\nCandidates new to large-scale architecture should first learn the traditional foundations.\n\n[Grokking System Design Fundamentals](https://www.designgurus.io/course/grokking-system-design-fundamentals) introduces the core building blocks behind scalable systems, including databases, caches, queues, replication, partitioning, and load balancing.\n\nThe original [Grokking the System Design Interview](https://www.designgurus.io/course/grokking-the-system-design-interview) applies those concepts to common interview problems and teaches a structured way to move from requirements to architecture and trade-offs.\n\nThe [System Design Interview Crash Course](https://www.designgurus.io/course/system-design-interview-crash-course) is useful for practicing a consistent interview framework across modern case studies, including a complete ChatGPT design problem.\n\nEngineers preparing for senior and staff-level discussions can continue with [Advanced System Design Interview, Volume II](https://www.designgurus.io/course/grokking-system-design-interview-ii), which emphasizes open-ended problems, failures, and defensible architectural decisions.\n\n[Grokking Scalable Systems for Interviews](https://www.designgurus.io/course/grokking-scalable-systems-for-interviews) is a useful next step for strengthening scalability, observability, fault tolerance, and performance reasoning.\n\nFor every AI design problem, practice three times:\n\nThat third pass is where most of the valuable interview discussion occurs.\n\nAI system design is not a replacement for traditional system design.\n\nIt is a traditional system design combined with a new set of constraints.\n\nYou still need to understand APIs, storage, caching, queues, partitioning, replication, security, observability, and fault tolerance.\n\nBut you must now apply those concepts to systems with:\n\nStart with four foundational problems:\n\nMaster the request flow, deep dives, failure modes, and trade-offs behind each one.\n\nOnce you can explain those systems clearly, most other AI system design questions become variations of the same underlying building blocks.", "url": "https://wpnews.pro/news/ai-system-design-interview-questions-chatgpt-rag-llm-inference-and-agents", "canonical_source": "https://dev.to/arslan_ah/ai-system-design-interview-questions-chatgpt-rag-llm-inference-and-agents-1doi", "published_at": "2026-06-25 11:38:22+00:00", "updated_at": "2026-06-25 12:13:34.421873+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "generative-ai", "ai-agents", "ai-infrastructure"], "entities": ["ChatGPT", "RAG", "LLM", "AI"], "alternates": {"html": "https://wpnews.pro/news/ai-system-design-interview-questions-chatgpt-rag-llm-inference-and-agents", "markdown": "https://wpnews.pro/news/ai-system-design-interview-questions-chatgpt-rag-llm-inference-and-agents.md", "text": "https://wpnews.pro/news/ai-system-design-interview-questions-chatgpt-rag-llm-inference-and-agents.txt", "jsonld": "https://wpnews.pro/news/ai-system-design-interview-questions-chatgpt-rag-llm-inference-and-agents.jsonld"}}