cd /news/machine-learning/how-i-built-a-real-time-fraud-detect… · home topics machine-learning article
[ARTICLE · art-19812] src=dev.to pub= topic=machine-learning verified=true sentiment=↑ positive

How I Built a Real-Time Fraud Detection System That Handles 71,000 RPS at p95 <6ms

A developer built Sentinel, a real-time fraud detection system that processes 71,000 requests per second with p95 latency under 6 milliseconds. The system handles 7.8 million requests with zero errors by serving XGBoost machine learning inference through Go and ONNX Runtime instead of Python, achieving a 99% cache hit rate through an LRU cache that saves approximately 200 microseconds per decision.

read4 min publishedJun 3, 2026

A deep dive into building Sentinel — an ML inference pipeline that processes 7.8M requests with zero errors, using XGBoost, ONNX, and Go.

Fraud detection is a classic hard problem in systems design. You need to:

I built Sentinel to solve all four of these — and in the process learned more about systems engineering than any course ever taught me.

Transaction Request
        │
        ▼
   Go HTTP Server
        │
        ▼
   LRU Cache ──── Cache Hit? ──► Return Result (< 1ms)
        │
     Cache Miss
        │
        ▼
   ONNX Runtime
        │
        ▼
  XGBoost Model
        │
        ▼
  Fraud Score + Decision
        │
        ▼
  Prometheus Metrics

The key insight: serve ML inference from Go, not Python.

I trained XGBoost on the Kaggle Credit Card Fraud Dataset — 284,807 transactions, heavily imbalanced (only 0.17% fraud).

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, auc

scale_pos_weight = (len(y_train) - y_train.sum()) / y_train.sum()

model = xgb.XGBClassifier(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    scale_pos_weight=scale_pos_weight,
    eval_metric='aucpr',
    use_label_encoder=False
)

model.fit(X_train, y_train,
    eval_set=[(X_val, y_val)],
    early_stopping_rounds=20,
    verbose=False
)

Results:

Why PR-AUC over ROC-AUC? Because with imbalanced datasets, ROC-AUC is misleading. PR-AUC punishes you for missing fraud cases.

Here's where it gets interesting. Python inference is slow. I needed Go-level performance.

from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(model, initial_types=initial_type)

with open("fraud_model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

ONNX is a universal model format. Once exported, I can serve it from any language — in this case, Go using the onnxruntime-go

binding.

type InferenceEngine struct {
    session  *onnxruntime.Session
    mu       sync.RWMutex
}

func (e *InferenceEngine) Predict(features []float32) (float64, error) {
    e.mu.RLock()
    defer e.mu.RUnlock()

    input := onnxruntime.NewTensor(features)
    outputs, err := e.session.Run(input)
    if err != nil {
        return 0, err
    }

    score := outputs[0].GetData().([]float32)[0]
    return float64(score), nil
}

The Go HTTP handler:

func (h *Handler) PredictHandler(w http.ResponseWriter, r *http.Request) {
    var req TransactionRequest
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        http.Error(w, "bad request", 400)
        return
    }

    // Check LRU cache first
    cacheKey := req.Hash()
    if cached, ok := h.cache.Get(cacheKey); ok {
        json.NewEncoder(w).Encode(cached)
        return
    }

    // Run ONNX inference
    features := req.ToFeatures()
    score, err := h.engine.Predict(features)
    if err != nil {
        http.Error(w, "inference error", 500)
        return
    }

    result := &PredictionResult{
        Score:     score,
        IsFraud:   score > 0.5,
        Timestamp: time.Now().UnixNano(),
    }

    h.cache.Set(cacheKey, result)
    json.NewEncoder(w).Encode(result)
}

The LRU cache was the single biggest performance win.

type LRUCache struct {
    capacity int
    mu       sync.RWMutex
    items    map[string]*list.Element
    list     *list.List
}

func (c *LRUCache) Get(key string) (interface{}, bool) {
    c.mu.RLock()
    defer c.mu.RUnlock()

    if elem, ok := c.items[key]; ok {
        c.list.MoveToFront(elem)
        return elem.Value.(*entry).value, true
    }
    return nil, false
}

Result: 99% cache hit rate, saving ~200μs per decision.

In a high-throughput system, 200μs × 71,000 = 14.2 seconds saved per second. That's the compounding power of caching.

The hardest part. How do you update an ML model without taking the server down?

type ModelManager struct {
    engine atomic.Pointer[InferenceEngine]
}

func (m *ModelManager) HotSwap(newModel []byte) error {
    newEngine, err := NewInferenceEngine(newModel)
    if err != nil {
        return err
    }

    // Atomic swap — zero downtime
    m.engine.Store(newEngine)
    return nil
}

func (m *ModelManager) GetEngine() *InferenceEngine {
    return m.engine.Load()
}

atomic.Pointer

from Go 1.19 makes this trivial. No locks. No downtime. The old engine gets garbage collected after the swap.

Once you can hot-swap models, A/B testing becomes easy:

func (h *Handler) routeRequest(r *http.Request) *InferenceEngine {
    // Hash user ID for consistent routing
    hash := fnv32(r.Header.Get("X-User-ID"))
    if hash%100 < h.config.ModelBPercentage {
        return h.modelB.Load()
    }
    return h.modelA.Load()
}

This lets me gradually roll out new models — 5% → 20% → 50% → 100% — while monitoring Prometheus metrics for drift.

Model drift is when real-world data shifts away from your training distribution. I implemented lightweight drift detection:

func (d *DriftDetector) Check(features []float32) bool {
    var drift float64
    for i, f := range features {
        deviation := math.Abs(float64(f) - d.baseline[i].Mean)
        normalized := deviation / (d.baseline[i].Std + 1e-8)
        drift += normalized
    }
    drift /= float64(len(features))

    // Alert if drift exceeds threshold
    return drift > d.threshold // threshold: 5e-7
}

Load tested with k6 — 200 concurrent VUs, 60 second duration:

scenarios: (100.00%) 1 scenario, 200 max VUs
  default: 200 looping VUs for 60s

✓ http_req_duration.............: avg=4.2ms  p(95)=5.8ms
✓ http_req_failed...............: 0.00%
✓ iterations....................: 4,276,440
✓ vus...........................: 200

Throughput: 71,274 req/s

71,274 requests per second. p95 at 5.8ms. Zero errors across 7.8M requests.

1. Language choice matters for inference.

Python is great for training. Go is great for serving. ONNX bridges the gap — you get the best of both worlds.

2. Cache aggressively.

99% cache hit rate means only 1% of requests actually hit the model. Your throughput scales with your cache, not your model.

3. Atomic operations > locks for hot paths.

atomic.Pointer

for model swapping means zero contention on the critical path.

4. Design for deployability from day one.

Zero-downtime deploys aren't an afterthought — they're a core architectural requirement.

5. Monitor everything.

Prometheus metrics on every request, drift detection on every prediction. If you can't measure it, you can't improve it.

github.com/sameer-sde/sentinel

If you found this useful, drop a ❤️ and follow for more systems engineering content. I'm a 3rd year CS student at MJCET, Hyderabad — building distributed systems from scratch.

── more in #machine-learning 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/how-i-built-a-real-t…] indexed:0 read:4min 2026-06-03 ·