I Built a Complete AI Infrastructure Stack from Scratch — Here's What I Learned

wpnews.pro

Most AI projects start at the top of the stack.

You grab an LLM API, wire up a vector database, build a RAG pipeline, and ship. That works — until it doesn't. Until your training job crashes at hour 6. Until your inference cache fills up and nobody knows why. Until a worker dies mid-processing and your embeddings are corrupted.

I wanted to understand what happens below the API layer. So I built the whole thing from scratch.

Over the past few months I built four interconnected systems that form a complete AI infrastructure stack:

VeriStore          → Storage layer (WAL, Raft, crash recovery)
      ↓
llm-serving-cache  → Inference serving (KV cache, GPU memory, routing)
      ↓
Veriflow           → Workload orchestration (training jobs, checkpoints, GPU scheduling)
      ↓
SmartSearch        → AI data pipeline (async ingestion, Kafka, RAG, fault tolerance)

Each layer depends on the one below it. Each solves a real problem I kept running into. And each taught me something I couldn't have learned from reading documentation.

GitHub: https://github.com/NasitSony/VeriStore

The first question I wanted to answer: how does data survive a crash?

Not "what does the documentation say" — but what actually happens at the byte level when a process dies mid-write.

VeriStore is a correctness-first key-value storage engine in C++ built from first principles:

fsync is expensive, but skipping it is dangerous. Group commit is the right tradeoff — batch writes, fsync at boundaries. This is what RocksDB, PostgreSQL, and etcd all do.

The WAL commit point is everything. An object is valid only if its metadata is committed. This single rule makes crash recovery deterministic — you either have the commit record or you don't.

Raft is simpler than it sounds, but the edge cases are brutal. Leader crash during log replication, follower log divergence, split-brain scenarios — each required careful handling.

GitHub: https://github.com/NasitSony/llm-serving-cache

LLM inference is expensive. The prefill step — processing the prompt — is the main cost. If you've seen the same prompt before, you shouldn't have to recompute it.

llm-serving-cache is a control-plane service that tracks where cached attention prefixes live across distributed inference nodes and routes requests to maximize cache reuse.

Key results from benchmarks:

Scenario	Avg Latency	Hit Rate
No Cache	1405 ms	0%
Prefix Reuse	985 ms	50%
Exact Cache	205 ms	100%
GPU-Aware	843 ms	25%

Exact cache reuse reduces latency by ~85% compared to no cache.

The system models GPU memory as discrete blocks and uses best-fit placement to minimize fragmentation. Under memory pressure, it evicts the oldest inactive requests and retries allocation before rejecting.

I also validated this against a real Ollama backend running Llama 3.1 8B:

Cache hits matter enormously for prefill, but decode dominates total latency. A warm request still takes ~5.5 seconds because token generation is slow regardless of caching. Real serving optimization needs to address decode efficiency too.

Admission control is more important than caching. Accepting all requests under load causes queue growth and latency explosion. Rejecting excess load with a hard concurrency limit keeps tail latency controlled.

Single-request latency is misleading. At concurrency=10, P95 latency was 53.5 seconds — nearly 3× the single-request time. Production serving systems need batching, scheduling, and admission control, not just cache reuse.

GitHub: https://github.com/NasitSony/veriflow-control-plane

The pain that started everything: training jobs that crash at hour 6 with no checkpoint, no retry, and no idea why.

Veriflow is a Kubernetes-based job orchestrator that treats AI training as what it actually is — a distributed systems problem.

The key insight: checkpoints need to be first-class citizens.

Most job runners treat AI training like a simple script: run it, and if it fails, restart from zero. Veriflow models job lifecycle as a state machine with checkpoint-aware retry:

JOB_SUBMITTED → RUN_CREATED → POD_RUNNING
→ CHECKPOINT_SAVED            ← checkpoint URI persisted
→ RUN_FAILED                  ← something went wrong
→ RETRY_TRIGGERED             ← scheduler picks it up
→ TRAINING_RESUMED            ← resumes from checkpoint
→ JOB_SUCCEEDED

The scheduler uses FOR UPDATE SKIP LOCKED

in Postgres for concurrency-safe job claiming — tested with two concurrent scheduler instances processing 20 burst-submitted jobs with zero duplicate dispatches.

GPU-aware placement matches jobs to nodes by GPU type, count, and memory requirements using best-fit allocation. Queue-level fairness and quota enforcement prevent one greedy queue from monopolizing the cluster.

FOR UPDATE SKIP LOCKED is underrated. Most people reach for Redis or a dedicated queue for concurrent job processing. Postgres with SKIP LOCKED handles it correctly — and you get transactions and consistency for free.

The scheduler is a control plane, not a cron job. A cron fires and forgets. A control plane continuously reconciles desired state with actual state. This distinction is what makes checkpoint-aware recovery possible.

Checkpoint URIs should be in your job spec from day one. Treating them as an afterthought means you'll always restart from scratch when things go wrong.

GitHub: https://github.com/NasitSony/SmartSearch

Most RAG demos show the happy path: ingest document, generate embeddings, search, return results.

SmartSearch asks a different question: what happens when things fail?

The system is built to handle these scenarios deterministically:

PENDING → PROCESSING → READY | FAILED

, no hidden progressAt-least-once + idempotency is the right default. Exactly-once semantics in Kafka are possible but complex. At-least-once with idempotent writes gets you the same correctness guarantees with far less operational complexity.

Processing age is the most important metric nobody talks about. Latency tells you how fast things are going. Processing age tells you how much work is piling up. A rising processing age means your pipeline is falling behind — before latency spikes make it obvious.

The visibility invariant matters. A document is searchable if and only if its state is READY. This single rule prevents partial visibility and makes the system's behavior predictable under any failure scenario.

Building these four systems taught me something that documentation never could:

Every layer of the AI stack is a distributed systems problem.

Most AI engineers work at the top of this stack and treat the layers below as black boxes. That works until scale, failure, or cost forces the question: what's actually happening down there?

Understanding these layers doesn't just make you a better infrastructure engineer. It makes you better at every layer above — because you understand what guarantees you can actually rely on, and what you need to handle yourself.

If you found this useful, all four repos are on GitHub. Stars and feedback welcome!

source & further reading

dev.to — original article Git is the Developer Tool We All Take for Granted How to parse lots of PDFs and more into markdown, with Laravel Panduan Teknikal: Compile llama.cpp di Debian 12/13 dan Cross Compile ARM64

I Built a Complete AI Infrastructure Stack from Scratch — Here's What I Learned

Run your AI side-project on zahid.host