How I Built a Real-Time Fraud Detection System That Handles 71,000 RPS at p95 <6ms

A developer built Sentinel, a real-time fraud detection system that processes 71,000 requests per second with p95 latency under 6 milliseconds. The system handles 7.8 million requests with zero errors by serving XGBoost machine learning inference through Go and ONNX Runtime instead of Python, achieving a 99% cache hit rate through an LRU cache that saves approximately 200 microseconds per decision.

A deep dive into building Sentinel — an ML inference pipeline that processes 7.8M requests with zero errors, using XGBoost, ONNX, and Go. Fraud detection is a classic hard problem in systems design. You need to: I built Sentinel to solve all four of these — and in the process learned more about systems engineering than any course ever taught me. Transaction Request │ ▼ Go HTTP Server │ ▼ LRU Cache ──── Cache Hit? ──► Return Result < 1ms │ Cache Miss │ ▼ ONNX Runtime │ ▼ XGBoost Model │ ▼ Fraud Score + Decision │ ▼ Prometheus Metrics The key insight: serve ML inference from Go, not Python . I trained XGBoost on the Kaggle Credit Card Fraud Dataset https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud — 284,807 transactions, heavily imbalanced only 0.17% fraud . python import xgboost as xgb from sklearn.model selection import train test split from sklearn.metrics import precision recall curve, auc Handle class imbalance scale pos weight = len y train - y train.sum / y train.sum model = xgb.XGBClassifier n estimators=300, max depth=6, learning rate=0.05, scale pos weight=scale pos weight, eval metric='aucpr', use label encoder=False model.fit X train, y train, eval set= X val, y val , early stopping rounds=20, verbose=False Results: Why PR-AUC over ROC-AUC? Because with imbalanced datasets, ROC-AUC is misleading. PR-AUC punishes you for missing fraud cases. Here's where it gets interesting. Python inference is slow. I needed Go-level performance. python from skl2onnx import convert sklearn from skl2onnx.common.data types import FloatTensorType Export XGBoost → ONNX initial type = 'float input', FloatTensorType None, X train.shape 1 onnx model = convert sklearn model, initial types=initial type with open "fraud model.onnx", "wb" as f: f.write onnx model.SerializeToString ONNX is a universal model format. Once exported, I can serve it from any language — in this case, Go using the onnxruntime-go binding. type InferenceEngine struct { session onnxruntime.Session mu sync.RWMutex } func e InferenceEngine Predict features float32 float64, error { e.mu.RLock defer e.mu.RUnlock input := onnxruntime.NewTensor features outputs, err := e.session.Run input if err = nil { return 0, err } score := outputs 0 .GetData . float32 0 return float64 score , nil } The Go HTTP handler: func h Handler PredictHandler w http.ResponseWriter, r http.Request { var req TransactionRequest if err := json.NewDecoder r.Body .Decode &req ; err = nil { http.Error w, "bad request", 400 return } // Check LRU cache first cacheKey := req.Hash if cached, ok := h.cache.Get cacheKey ; ok { json.NewEncoder w .Encode cached return } // Run ONNX inference features := req.ToFeatures score, err := h.engine.Predict features if err = nil { http.Error w, "inference error", 500 return } result := &PredictionResult{ Score: score, IsFraud: score 0.5, Timestamp: time.Now .UnixNano , } h.cache.Set cacheKey, result json.NewEncoder w .Encode result } The LRU cache was the single biggest performance win. type LRUCache struct { capacity int mu sync.RWMutex items map string list.Element list list.List } func c LRUCache Get key string interface{}, bool { c.mu.RLock defer c.mu.RUnlock if elem, ok := c.items key ; ok { c.list.MoveToFront elem return elem.Value. entry .value, true } return nil, false } Result: 99% cache hit rate, saving ~200μs per decision. In a high-throughput system, 200μs × 71,000 = 14.2 seconds saved per second . That's the compounding power of caching. The hardest part. How do you update an ML model without taking the server down? type ModelManager struct { engine atomic.Pointer InferenceEngine } func m ModelManager HotSwap newModel byte error { newEngine, err := NewInferenceEngine newModel if err = nil { return err } // Atomic swap — zero downtime m.engine.Store newEngine return nil } func m ModelManager GetEngine InferenceEngine { return m.engine.Load } atomic.Pointer from Go 1.19 makes this trivial. No locks. No downtime. The old engine gets garbage collected after the swap. Once you can hot-swap models, A/B testing becomes easy: func h Handler routeRequest r http.Request InferenceEngine { // Hash user ID for consistent routing hash := fnv32 r.Header.Get "X-User-ID" if hash%100 < h.config.ModelBPercentage { return h.modelB.Load } return h.modelA.Load } This lets me gradually roll out new models — 5% → 20% → 50% → 100% — while monitoring Prometheus metrics for drift. Model drift is when real-world data shifts away from your training distribution. I implemented lightweight drift detection: js func d DriftDetector Check features float32 bool { var drift float64 for i, f := range features { deviation := math.Abs float64 f - d.baseline i .Mean normalized := deviation / d.baseline i .Std + 1e-8 drift += normalized } drift /= float64 len features // Alert if drift exceeds threshold return drift d.threshold // threshold: 5e-7 } Load tested with k6 — 200 concurrent VUs, 60 second duration: scenarios: 100.00% 1 scenario, 200 max VUs default: 200 looping VUs for 60s ✓ http req duration.............: avg=4.2ms p 95 =5.8ms ✓ http req failed...............: 0.00% ✓ iterations....................: 4,276,440 ✓ vus...........................: 200 Throughput: 71,274 req/s 71,274 requests per second. p95 at 5.8ms. Zero errors across 7.8M requests. 1. Language choice matters for inference. Python is great for training. Go is great for serving. ONNX bridges the gap — you get the best of both worlds. 2. Cache aggressively. 99% cache hit rate means only 1% of requests actually hit the model. Your throughput scales with your cache, not your model. 3. Atomic operations locks for hot paths. atomic.Pointer for model swapping means zero contention on the critical path. 4. Design for deployability from day one. Zero-downtime deploys aren't an afterthought — they're a core architectural requirement. 5. Monitor everything. Prometheus metrics on every request, drift detection on every prediction. If you can't measure it, you can't improve it. github.com/sameer-sde/sentinel https://github.com/sameer-sde/sentinel If you found this useful, drop a ❤️ and follow for more systems engineering content. I'm a 3rd year CS student at MJCET, Hyderabad — building distributed systems from scratch.