{"slug": "how-i-built-a-real-time-fraud-detection-system-that-handles-71000-rps-at-p95-6ms", "title": "How I Built a Real-Time Fraud Detection System That Handles 71,000 RPS at p95 <6ms", "summary": "A developer built Sentinel, a real-time fraud detection system that processes 71,000 requests per second with p95 latency under 6 milliseconds. The system handles 7.8 million requests with zero errors by serving XGBoost machine learning inference through Go and ONNX Runtime instead of Python, achieving a 99% cache hit rate through an LRU cache that saves approximately 200 microseconds per decision.", "body_md": "A deep dive into building Sentinel — an ML inference pipeline that processes 7.8M requests with zero errors, using XGBoost, ONNX, and Go.\n\nFraud detection is a classic hard problem in systems design. You need to:\n\nI built Sentinel to solve all four of these — and in the process learned more about systems engineering than any course ever taught me.\n\n```\nTransaction Request\n        │\n        ▼\n   Go HTTP Server\n        │\n        ▼\n   LRU Cache ──── Cache Hit? ──► Return Result (< 1ms)\n        │\n     Cache Miss\n        │\n        ▼\n   ONNX Runtime\n        │\n        ▼\n  XGBoost Model\n        │\n        ▼\n  Fraud Score + Decision\n        │\n        ▼\n  Prometheus Metrics\n```\n\nThe key insight: **serve ML inference from Go, not Python**.\n\nI trained XGBoost on the [Kaggle Credit Card Fraud Dataset](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) — 284,807 transactions, heavily imbalanced (only 0.17% fraud).\n\n``` python\nimport xgboost as xgb\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import precision_recall_curve, auc\n\n# Handle class imbalance\nscale_pos_weight = (len(y_train) - y_train.sum()) / y_train.sum()\n\nmodel = xgb.XGBClassifier(\n    n_estimators=300,\n    max_depth=6,\n    learning_rate=0.05,\n    scale_pos_weight=scale_pos_weight,\n    eval_metric='aucpr',\n    use_label_encoder=False\n)\n\nmodel.fit(X_train, y_train,\n    eval_set=[(X_val, y_val)],\n    early_stopping_rounds=20,\n    verbose=False\n)\n```\n\n**Results:**\n\nWhy PR-AUC over ROC-AUC? Because with imbalanced datasets, ROC-AUC is misleading. PR-AUC punishes you for missing fraud cases.\n\nHere's where it gets interesting. Python inference is slow. I needed Go-level performance.\n\n``` python\nfrom skl2onnx import convert_sklearn\nfrom skl2onnx.common.data_types import FloatTensorType\n\n# Export XGBoost → ONNX\ninitial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]\nonnx_model = convert_sklearn(model, initial_types=initial_type)\n\nwith open(\"fraud_model.onnx\", \"wb\") as f:\n    f.write(onnx_model.SerializeToString())\n```\n\nONNX is a universal model format. Once exported, I can serve it from **any language** — in this case, Go using the `onnxruntime-go`\n\nbinding.\n\n```\ntype InferenceEngine struct {\n    session  *onnxruntime.Session\n    mu       sync.RWMutex\n}\n\nfunc (e *InferenceEngine) Predict(features []float32) (float64, error) {\n    e.mu.RLock()\n    defer e.mu.RUnlock()\n\n    input := onnxruntime.NewTensor(features)\n    outputs, err := e.session.Run(input)\n    if err != nil {\n        return 0, err\n    }\n\n    score := outputs[0].GetData().([]float32)[0]\n    return float64(score), nil\n}\n```\n\nThe Go HTTP handler:\n\n```\nfunc (h *Handler) PredictHandler(w http.ResponseWriter, r *http.Request) {\n    var req TransactionRequest\n    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {\n        http.Error(w, \"bad request\", 400)\n        return\n    }\n\n    // Check LRU cache first\n    cacheKey := req.Hash()\n    if cached, ok := h.cache.Get(cacheKey); ok {\n        json.NewEncoder(w).Encode(cached)\n        return\n    }\n\n    // Run ONNX inference\n    features := req.ToFeatures()\n    score, err := h.engine.Predict(features)\n    if err != nil {\n        http.Error(w, \"inference error\", 500)\n        return\n    }\n\n    result := &PredictionResult{\n        Score:     score,\n        IsFraud:   score > 0.5,\n        Timestamp: time.Now().UnixNano(),\n    }\n\n    h.cache.Set(cacheKey, result)\n    json.NewEncoder(w).Encode(result)\n}\n```\n\nThe LRU cache was the single biggest performance win.\n\n```\ntype LRUCache struct {\n    capacity int\n    mu       sync.RWMutex\n    items    map[string]*list.Element\n    list     *list.List\n}\n\nfunc (c *LRUCache) Get(key string) (interface{}, bool) {\n    c.mu.RLock()\n    defer c.mu.RUnlock()\n\n    if elem, ok := c.items[key]; ok {\n        c.list.MoveToFront(elem)\n        return elem.Value.(*entry).value, true\n    }\n    return nil, false\n}\n```\n\n**Result: 99% cache hit rate, saving ~200μs per decision.**\n\nIn a high-throughput system, 200μs × 71,000 = **14.2 seconds saved per second**. That's the compounding power of caching.\n\nThe hardest part. How do you update an ML model without taking the server down?\n\n```\ntype ModelManager struct {\n    engine atomic.Pointer[InferenceEngine]\n}\n\nfunc (m *ModelManager) HotSwap(newModel []byte) error {\n    newEngine, err := NewInferenceEngine(newModel)\n    if err != nil {\n        return err\n    }\n\n    // Atomic swap — zero downtime\n    m.engine.Store(newEngine)\n    return nil\n}\n\nfunc (m *ModelManager) GetEngine() *InferenceEngine {\n    return m.engine.Load()\n}\n```\n\n`atomic.Pointer`\n\nfrom Go 1.19 makes this trivial. No locks. No downtime. The old engine gets garbage collected after the swap.\n\nOnce you can hot-swap models, A/B testing becomes easy:\n\n```\nfunc (h *Handler) routeRequest(r *http.Request) *InferenceEngine {\n    // Hash user ID for consistent routing\n    hash := fnv32(r.Header.Get(\"X-User-ID\"))\n    if hash%100 < h.config.ModelBPercentage {\n        return h.modelB.Load()\n    }\n    return h.modelA.Load()\n}\n```\n\nThis lets me gradually roll out new models — 5% → 20% → 50% → 100% — while monitoring Prometheus metrics for drift.\n\nModel drift is when real-world data shifts away from your training distribution. I implemented lightweight drift detection:\n\n``` js\nfunc (d *DriftDetector) Check(features []float32) bool {\n    var drift float64\n    for i, f := range features {\n        deviation := math.Abs(float64(f) - d.baseline[i].Mean)\n        normalized := deviation / (d.baseline[i].Std + 1e-8)\n        drift += normalized\n    }\n    drift /= float64(len(features))\n\n    // Alert if drift exceeds threshold\n    return drift > d.threshold // threshold: 5e-7\n}\n```\n\nLoad tested with k6 — 200 concurrent VUs, 60 second duration:\n\n```\nscenarios: (100.00%) 1 scenario, 200 max VUs\n  default: 200 looping VUs for 60s\n\n✓ http_req_duration.............: avg=4.2ms  p(95)=5.8ms\n✓ http_req_failed...............: 0.00%\n✓ iterations....................: 4,276,440\n✓ vus...........................: 200\n\nThroughput: 71,274 req/s\n```\n\n**71,274 requests per second. p95 at 5.8ms. Zero errors across 7.8M requests.**\n\n**1. Language choice matters for inference.**\n\nPython is great for training. Go is great for serving. ONNX bridges the gap — you get the best of both worlds.\n\n**2. Cache aggressively.**\n\n99% cache hit rate means only 1% of requests actually hit the model. Your throughput scales with your cache, not your model.\n\n**3. Atomic operations > locks for hot paths.**\n\n`atomic.Pointer`\n\nfor model swapping means zero contention on the critical path.\n\n**4. Design for deployability from day one.**\n\nZero-downtime deploys aren't an afterthought — they're a core architectural requirement.\n\n**5. Monitor everything.**\n\nPrometheus metrics on every request, drift detection on every prediction. If you can't measure it, you can't improve it.\n\n[github.com/sameer-sde/sentinel](https://github.com/sameer-sde/sentinel)\n\n*If you found this useful, drop a ❤️ and follow for more systems engineering content. I'm a 3rd year CS student at MJCET, Hyderabad — building distributed systems from scratch.*", "url": "https://wpnews.pro/news/how-i-built-a-real-time-fraud-detection-system-that-handles-71000-rps-at-p95-6ms", "canonical_source": "https://dev.to/sameer_ahmed_/how-i-built-a-real-time-fraud-detection-system-that-handles-71000-rps-at-p95-6ms-205k", "published_at": "2026-06-03 02:17:08+00:00", "updated_at": "2026-06-03 02:42:17.354785+00:00", "lang": "en", "topics": ["machine-learning", "mlops", "ai-infrastructure", "ai-products", "artificial-intelligence"], "entities": ["Sentinel", "XGBoost", "ONNX", "Go", "Kaggle", "Prometheus", "Credit Card Fraud Dataset"], "alternates": {"html": "https://wpnews.pro/news/how-i-built-a-real-time-fraud-detection-system-that-handles-71000-rps-at-p95-6ms", "markdown": "https://wpnews.pro/news/how-i-built-a-real-time-fraud-detection-system-that-handles-71000-rps-at-p95-6ms.md", "text": "https://wpnews.pro/news/how-i-built-a-real-time-fraud-detection-system-that-handles-71000-rps-at-p95-6ms.txt", "jsonld": "https://wpnews.pro/news/how-i-built-a-real-time-fraud-detection-system-that-handles-71000-rps-at-p95-6ms.jsonld"}}