How I Built a Real-Time Fraud Detection System That Handles 71,000 RPS at p95 <6ms

wpnews.pro

A deep dive into building Sentinel — an ML inference pipeline that processes 7.8M requests with zero errors, using XGBoost, ONNX, and Go.

Fraud detection is a classic hard problem in systems design. You need to:

I built Sentinel to solve all four of these — and in the process learned more about systems engineering than any course ever taught me.

Transaction Request
        │
        ▼
   Go HTTP Server
        │
        ▼
   LRU Cache ──── Cache Hit? ──► Return Result (< 1ms)
        │
     Cache Miss
        │
        ▼
   ONNX Runtime
        │
        ▼
  XGBoost Model
        │
        ▼
  Fraud Score + Decision
        │
        ▼
  Prometheus Metrics

The key insight: serve ML inference from Go, not Python.

I trained XGBoost on the Kaggle Credit Card Fraud Dataset — 284,807 transactions, heavily imbalanced (only 0.17% fraud).

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, auc

scale_pos_weight = (len(y_train) - y_train.sum()) / y_train.sum()

model = xgb.XGBClassifier(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    scale_pos_weight=scale_pos_weight,
    eval_metric='aucpr',
    use_label_encoder=False
)

model.fit(X_train, y_train,
    eval_set=[(X_val, y_val)],
    early_stopping_rounds=20,
    verbose=False
)

Results:

Why PR-AUC over ROC-AUC? Because with imbalanced datasets, ROC-AUC is misleading. PR-AUC punishes you for missing fraud cases.

Here's where it gets interesting. Python inference is slow. I needed Go-level performance.

from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(model, initial_types=initial_type)

with open("fraud_model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

ONNX is a universal model format. Once exported, I can serve it from any language — in this case, Go using the onnxruntime-go

binding.

type InferenceEngine struct {
    session  *onnxruntime.Session
    mu       sync.RWMutex
}

func (e *InferenceEngine) Predict(features []float32) (float64, error) {
    e.mu.RLock()
    defer e.mu.RUnlock()

    input := onnxruntime.NewTensor(features)
    outputs, err := e.session.Run(input)
    if err != nil {
        return 0, err
    }

    score := outputs[0].GetData().([]float32)[0]
    return float64(score), nil
}

The Go HTTP handler:

func (h *Handler) PredictHandler(w http.ResponseWriter, r *http.Request) {
    var req TransactionRequest
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        http.Error(w, "bad request", 400)
        return
    }

    // Check LRU cache first
    cacheKey := req.Hash()
    if cached, ok := h.cache.Get(cacheKey); ok {
        json.NewEncoder(w).Encode(cached)
        return
    }

    // Run ONNX inference
    features := req.ToFeatures()
    score, err := h.engine.Predict(features)
    if err != nil {
        http.Error(w, "inference error", 500)
        return
    }

    result := &PredictionResult{
        Score:     score,
        IsFraud:   score > 0.5,
        Timestamp: time.Now().UnixNano(),
    }

    h.cache.Set(cacheKey, result)
    json.NewEncoder(w).Encode(result)
}

The LRU cache was the single biggest performance win.

type LRUCache struct {
    capacity int
    mu       sync.RWMutex
    items    map[string]*list.Element
    list     *list.List
}

func (c *LRUCache) Get(key string) (interface{}, bool) {
    c.mu.RLock()
    defer c.mu.RUnlock()

    if elem, ok := c.items[key]; ok {
        c.list.MoveToFront(elem)
        return elem.Value.(*entry).value, true
    }
    return nil, false
}

Result: 99% cache hit rate, saving ~200μs per decision.

In a high-throughput system, 200μs × 71,000 = 14.2 seconds saved per second. That's the compounding power of caching.

The hardest part. How do you update an ML model without taking the server down?

type ModelManager struct {
    engine atomic.Pointer[InferenceEngine]
}

func (m *ModelManager) HotSwap(newModel []byte) error {
    newEngine, err := NewInferenceEngine(newModel)
    if err != nil {
        return err
    }

    // Atomic swap — zero downtime
    m.engine.Store(newEngine)
    return nil
}

func (m *ModelManager) GetEngine() *InferenceEngine {
    return m.engine.Load()
}

atomic.Pointer

from Go 1.19 makes this trivial. No locks. No downtime. The old engine gets garbage collected after the swap.

Once you can hot-swap models, A/B testing becomes easy:

func (h *Handler) routeRequest(r *http.Request) *InferenceEngine {
    // Hash user ID for consistent routing
    hash := fnv32(r.Header.Get("X-User-ID"))
    if hash%100 < h.config.ModelBPercentage {
        return h.modelB.Load()
    }
    return h.modelA.Load()
}

This lets me gradually roll out new models — 5% → 20% → 50% → 100% — while monitoring Prometheus metrics for drift.

Model drift is when real-world data shifts away from your training distribution. I implemented lightweight drift detection:

func (d *DriftDetector) Check(features []float32) bool {
    var drift float64
    for i, f := range features {
        deviation := math.Abs(float64(f) - d.baseline[i].Mean)
        normalized := deviation / (d.baseline[i].Std + 1e-8)
        drift += normalized
    }
    drift /= float64(len(features))

    // Alert if drift exceeds threshold
    return drift > d.threshold // threshold: 5e-7
}

Load tested with k6 — 200 concurrent VUs, 60 second duration:

scenarios: (100.00%) 1 scenario, 200 max VUs
  default: 200 looping VUs for 60s

✓ http_req_duration.............: avg=4.2ms  p(95)=5.8ms
✓ http_req_failed...............: 0.00%
✓ iterations....................: 4,276,440
✓ vus...........................: 200

Throughput: 71,274 req/s

71,274 requests per second. p95 at 5.8ms. Zero errors across 7.8M requests.

1. Language choice matters for inference.

Python is great for training. Go is great for serving. ONNX bridges the gap — you get the best of both worlds.

2. Cache aggressively.

99% cache hit rate means only 1% of requests actually hit the model. Your throughput scales with your cache, not your model.

3. Atomic operations > locks for hot paths.

atomic.Pointer

for model swapping means zero contention on the critical path.

4. Design for deployability from day one.

Zero-downtime deploys aren't an afterthought — they're a core architectural requirement.

5. Monitor everything.

Prometheus metrics on every request, drift detection on every prediction. If you can't measure it, you can't improve it.

github.com/sameer-sde/sentinel

If you found this useful, drop a ❤️ and follow for more systems engineering content. I'm a 3rd year CS student at MJCET, Hyderabad — building distributed systems from scratch.

source & further reading

dev.to — original article Your PDFs Are Eating Your LLM's Tokens for Breakfast Hardcoded Secrets: Why AI Code Fails Your First SOC 2 Audit Starting Google's 5-Day AI Vibe Coding Challenge 🚀

How I Built a Real-Time Fraud Detection System That Handles 71,000 RPS at p95 <6ms

Run your AI side-project on zahid.host