AI System Design Interview Questions: ChatGPT, RAG, LLM Inference, and Agents

wpnews.pro

System design interviews are changing.

Traditional questions such as “Design Twitter,” “Design Uber,” and “Design YouTube” are still important. They test whether you understand databases, caching, partitioning, replication, messaging, and high availability.

But engineers working on modern platforms now encounter a different category of problem:

These questions still require classical distributed-systems knowledge. An AI product needs APIs, queues, storage, authentication, observability, rate limiting, and reliable deployment.

The difference is that it also introduces expensive accelerators, probabilistic output, long-running requests, model routing, vector retrieval, prompt construction, safety controls, and quality evaluation.

This guide explains the most important AI system design interview questions and what a strong candidate should discuss for each.

For a broader preparation roadmap covering traditional and modern problems, see 64 System Design Interview Questions, Ranked From Easiest to Hardest. A conventional service usually transforms an input into a deterministic output.

If a user requests order number 123, the service should retrieve order 123. Two identical requests should usually return the same underlying information. Generative AI systems behave differently.

A model may produce different responses to the same prompt. A response can be grammatically convincing while being factually wrong. Latency depends on the number of generated tokens. Serving capacity is constrained by accelerator memory, not merely CPU utilization. Product quality may depend on prompts, retrieved context, model versions, safety filters, and external tools.

This creates several new design dimensions.

Traditional systems are often measured using availability, latency, throughput, and error rate.

AI systems need those metrics, but they also need measures such as:

A system that returns a response in 200 milliseconds is not useful if that response is wrong.

An ordinary API server may process thousands of lightweight requests per second.

An LLM request can occupy expensive GPU memory while processing a long prompt and generating hundreds of tokens. The architecture must therefore optimize batching, memory utilization, model placement, and request scheduling.

Users do not normally wait for an entire answer before seeing anything. Tokens are streamed as they are generated.

This introduces at least two important latency measurements:

A system may have acceptable total latency but still feel slow if the first token takes too long.

An AI application may depend on:

Each data type has distinct requirements for retention, privacy, freshness, and consistency.

A traditional request may succeed or fail.

An AI request can technically succeed but produce a low-quality answer, retrieve the wrong documents, call the wrong tool, exceed a cost budget, or violate a safety rule.

The architecture must detect and respond to these softer failure modes.

Before considering individual questions, use a consistent interview structure.

Ask what the system is expected to do.

For example: Without this clarification, “Design an AI assistant” is too broad.

Estimate:

AI systems are often constrained by cost as much as by technical capacity.

The application layer may include:

The AI layer may include:

Keeping these concerns separate makes the design easier to explain and evolve.

Describe what happens from the moment a user submits a prompt until the final result is displayed.

A typical path may be:

A strong answer should explain:

These discussions distinguish a production design from a demo.

A ChatGPT-like system is one of the most comprehensive AI system design questions.

The functional requirements may include:

A useful high-level architecture contains the following components.

The gateway handles authentication, request routing, rate limiting, quotas, and basic validation.

Long-running generation requests may use Server-Sent Events or WebSockets to stream tokens to clients.

This service manages:

Conversation metadata can live in a transactional database, while large attachments may be placed in object storage.

Models have finite context windows. The context builder decides what information should be included in the next request.

It may combine:

Simply sending the entire conversation forever is expensive and eventually impossible. Older content may need summarization or selective retrieval.

The model gateway provides a single interface to multiple model backends.

It can route requests based on:

A simple request may use a smaller, faster model, while complex reasoning may be routed to a more capable one.

The scheduler assigns requests to model replicas running on accelerators.

It should consider:

A naive first-in, first-out scheduler can allow a few extremely long prompts to delay many short requests.

Generated tokens should be forwarded incrementally to the user.

The system must also handle:

Input and output policies may detect:

Safety should not be treated as one filter placed at the end. Different checks may be required before retrieval, before tool execution, before inference, and before returning the final response.

An interviewer may ask:

The Design ChatGPT walkthrough provides a structured example of this problem. Retrieval-augmented generation, or RAG, allows a model to answer using information retrieved from external sources.

A common interview prompt is:

Design an enterprise assistant that answers employee questions using internal documents and provides citations.

A RAG system has two major paths:

Documents may come from file uploads, internal wikis, cloud drives, databases, or support systems.

The ingestion pipeline performs several stages.

Files must be converted into usable text.

The system may need parsers for:

The extraction process should preserve useful metadata such as titles, headings, page numbers, owners, and access permissions.

Long documents are divided into smaller segments.

Chunks that are too large may contain irrelevant text and consume excessive context. Chunks that are too small may lose meaning.

Possible strategies include:

There is no universally correct chunk size. It should be tested against representative questions.

Each chunk is converted into a numerical vector using an embedding model.

The embedding service should be versioned because changing models can require re-embedding the entire corpus.

The system stores:

A vector index enables semantic retrieval. A traditional inverted index can support keyword retrieval. Many production systems combine both.

When a user submits a question:

Semantic retrieval is useful when the query and source use different words with similar meanings.

Keyword retrieval is useful for exact terms such as:

Combining both methods often produces better coverage.

Vector similarity may retrieve documents that are generally related but not directly useful.

A reranker can score the top candidates more accurately before they are sent to the LLM. This improves answer quality while keeping the final prompt small.

Security is one of the most important parts of enterprise RAG.

A user should never retrieve a document they are not authorized to view. Filtering after the model has already received the document is too late.

Permissions should be enforced during retrieval, with tenant and user identity included in the query path.

The system must react when:

The ingestion pipeline may use event-driven updates, periodic crawling, or both.

A RAG system should separately evaluate:

This separation is important. A poor answer can result from failed retrieval even when the model behaves correctly.

This question focuses less on the product interface and more on the infrastructure that serves models.

A possible prompt is:

Design a multi-tenant platform that serves several large language models to millions of requests.

The platform may need to support:

The gateway exposes a consistent API and performs:

Admission control is critical. Accepting unlimited work and allowing it to queue indefinitely creates poor latency and can destabilize the system.

The registry tracks:

Rollouts should use immutable versions so requests and incidents can be traced to the exact model that served them.

a large model into GPU memory can take substantial time. The scheduler cannot treat models like lightweight stateless application containers.

It must decide:

Popular models may remain warm, while rarely used models may accept a cold-start delay.

LLM inference contains two different computational phases.

Prefill processes the input prompt and can often benefit from parallel computation.

Decode generates tokens sequentially and is usually memory-bandwidth intensive.

Separating or independently scheduling these phases can improve utilization, but it also adds network and orchestration complexity.

Instead of waiting for a fixed group of requests to finish together, continuous batching adds and removes requests dynamically as generation progresses.

This improves GPU utilization, especially when responses have different lengths.

The scheduler must still prevent long requests from starving shorter ones.

The key-value cache stores intermediate attention state so the model does not recompute the entire prompt for every generated token.

KV-cache management affects: A shared prompt prefix—such as a large system prompt—may sometimes be cached and reused across compatible requests.

GPU utilization alone may not be sufficient for autoscaling.

Useful signals include:

Because accelerator provisioning may be slow, the platform may need reserved capacity and predictive scaling.

When capacity is limited, the system may:

A strong interview answer discusses the quality and cost consequences of each fallback.

An AI agent does more than produce text. It can plan a sequence of actions, call tools, observe results, update its state, and continue until a goal is completed.

A typical prompt might be:

Design an enterprise agent that can search internal documents, update tickets, send emails, and request human approval for sensitive actions.

The orchestrator controls the execution loop:

The orchestrator—not the model—should enforce hard limits such as maximum steps, timeouts, budgets, and approval requirements.

The tool registry describes each available capability:

Tool definitions should be versioned because changing their schemas can break existing agent behavior.

Tool calls should run through controlled executors rather than allowing the model unrestricted access to internal systems.

The executor handles:

High-risk operations should use narrow, purpose-built APIs.

Agents may need several kinds of memory.

Working memory contains the current task, observations, and intermediate steps.

Session memory preserves information during one user interaction.

Long-term memory stores information across sessions.

External memory may contain documents retrieved from databases or vector indexes.

Not everything should be stored forever. Memory needs explicit retention, privacy, and deletion policies.

Actions such as sending payments, deleting data, publishing content, or modifying production systems should not be executed solely because a model requested them.

The agent can create a proposed action, its workflow, and wait for authorized approval.

The approval record should contain:

Agents may retry actions after timeouts.

Without idempotency, a retry could send the same email twice, create duplicate tickets, or repeat a transaction.

Every state-changing tool call should include a stable execution identifier or idempotency key.

A strong candidate should discuss:

The system should impose:

The Grokking Modern AI Fundamentals course can provide additional background on agentic AI, planning, memory, and tool-based behavior.

The four core questions cover much of the modern AI stack, but interviewers can frame the same concepts in narrower ways.

Focus on tenant isolation, document permissions, RAG, conversation history, model routing, auditability, and data retention.

Discuss repository indexing, code-aware chunking, low-latency suggestions, context selection, IDE integration, private-code protection, and evaluation of generated code.

Cover dataset versioning, offline evaluation, human review, model comparison, prompt experiments, regression detection, and production feedback.

Explain routing between internal and third-party models based on cost, quality, latency, privacy, context length, and availability.

Focus on ingestion, embeddings, vector indexes, hybrid retrieval, filtering, reranking, index updates, and relevance metrics.

Discuss policy versioning, input and output classification, prompt-injection detection, tool restrictions, personally identifiable information, appeals, and false-positive handling.

Cover prompt templates, version control, experiments, rollout, rollback, tenant overrides, caching, and compatibility with changing model versions.

Add image, audio, and document ingestion, media storage, preprocessing, modality-specific models, content safety, and larger payload management.

A weak answer places an LLM box in the center of a diagram and connects it to an API.

A strong answer explains the system around the model.

Interviewers want to see whether you can reason about:

Can you connect the client, application services, retrieval layer, model platform, storage, and observability systems?

Can you compare:

Can the system continue operating when a model, vector index, tool, region, or third-party provider is unavailable?

Can you tell whether a new model or prompt actually improved the product?

Can you protect tenant data, prevent unauthorized retrieval, constrain tool use, and manage sensitive prompts?

Can you estimate and control token usage, accelerator capacity, retrieval cost, storage, and third-party API spending?

The model is only one component. Production readiness comes from the architecture surrounding it.

Candidates new to large-scale architecture should first learn the traditional foundations.

Grokking System Design Fundamentals introduces the core building blocks behind scalable systems, including databases, caches, queues, replication, partitioning, and load balancing.

The original Grokking the System Design Interview applies those concepts to common interview problems and teaches a structured way to move from requirements to architecture and trade-offs.

The System Design Interview Crash Course is useful for practicing a consistent interview framework across modern case studies, including a complete ChatGPT design problem.

Engineers preparing for senior and staff-level discussions can continue with Advanced System Design Interview, Volume II, which emphasizes open-ended problems, failures, and defensible architectural decisions.

Grokking Scalable Systems for Interviews is a useful next step for strengthening scalability, observability, fault tolerance, and performance reasoning.

For every AI design problem, practice three times: That third pass is where most of the valuable interview discussion occurs.

AI system design is not a replacement for traditional system design.

It is a traditional system design combined with a new set of constraints.

You still need to understand APIs, storage, caching, queues, partitioning, replication, security, observability, and fault tolerance.

But you must now apply those concepts to systems with:

Start with four foundational problems:

Master the request flow, deep dives, failure modes, and trade-offs behind each one.

Once you can explain those systems clearly, most other AI system design questions become variations of the same underlying building blocks.

source & further reading

dev.to — original article Traces show what your agent did - a decision ledger shows what it was allowed to do AI Systems Need Evidence, Not Just Observability Top AI Papers on Hugging Face - 2026-06-25

AI System Design Interview Questions: ChatGPT, RAG, LLM Inference, and Agents

Run your AI side-project on zahid.host