# How I Built a Real-Time Fraud Detection System That Handles 71,000 RPS at p95 <6ms

> Source: <https://dev.to/sameer_ahmed_/how-i-built-a-real-time-fraud-detection-system-that-handles-71000-rps-at-p95-6ms-205k>
> Published: 2026-06-03 02:17:08+00:00

A deep dive into building Sentinel — an ML inference pipeline that processes 7.8M requests with zero errors, using XGBoost, ONNX, and Go.

Fraud detection is a classic hard problem in systems design. You need to:

I built Sentinel to solve all four of these — and in the process learned more about systems engineering than any course ever taught me.

```
Transaction Request
        │
        ▼
   Go HTTP Server
        │
        ▼
   LRU Cache ──── Cache Hit? ──► Return Result (< 1ms)
        │
     Cache Miss
        │
        ▼
   ONNX Runtime
        │
        ▼
  XGBoost Model
        │
        ▼
  Fraud Score + Decision
        │
        ▼
  Prometheus Metrics
```

The key insight: **serve ML inference from Go, not Python**.

I trained XGBoost on the [Kaggle Credit Card Fraud Dataset](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) — 284,807 transactions, heavily imbalanced (only 0.17% fraud).

``` python
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, auc

# Handle class imbalance
scale_pos_weight = (len(y_train) - y_train.sum()) / y_train.sum()

model = xgb.XGBClassifier(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    scale_pos_weight=scale_pos_weight,
    eval_metric='aucpr',
    use_label_encoder=False
)

model.fit(X_train, y_train,
    eval_set=[(X_val, y_val)],
    early_stopping_rounds=20,
    verbose=False
)
```

**Results:**

Why PR-AUC over ROC-AUC? Because with imbalanced datasets, ROC-AUC is misleading. PR-AUC punishes you for missing fraud cases.

Here's where it gets interesting. Python inference is slow. I needed Go-level performance.

``` python
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

# Export XGBoost → ONNX
initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(model, initial_types=initial_type)

with open("fraud_model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())
```

ONNX is a universal model format. Once exported, I can serve it from **any language** — in this case, Go using the `onnxruntime-go`

binding.

```
type InferenceEngine struct {
    session  *onnxruntime.Session
    mu       sync.RWMutex
}

func (e *InferenceEngine) Predict(features []float32) (float64, error) {
    e.mu.RLock()
    defer e.mu.RUnlock()

    input := onnxruntime.NewTensor(features)
    outputs, err := e.session.Run(input)
    if err != nil {
        return 0, err
    }

    score := outputs[0].GetData().([]float32)[0]
    return float64(score), nil
}
```

The Go HTTP handler:

```
func (h *Handler) PredictHandler(w http.ResponseWriter, r *http.Request) {
    var req TransactionRequest
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        http.Error(w, "bad request", 400)
        return
    }

    // Check LRU cache first
    cacheKey := req.Hash()
    if cached, ok := h.cache.Get(cacheKey); ok {
        json.NewEncoder(w).Encode(cached)
        return
    }

    // Run ONNX inference
    features := req.ToFeatures()
    score, err := h.engine.Predict(features)
    if err != nil {
        http.Error(w, "inference error", 500)
        return
    }

    result := &PredictionResult{
        Score:     score,
        IsFraud:   score > 0.5,
        Timestamp: time.Now().UnixNano(),
    }

    h.cache.Set(cacheKey, result)
    json.NewEncoder(w).Encode(result)
}
```

The LRU cache was the single biggest performance win.

```
type LRUCache struct {
    capacity int
    mu       sync.RWMutex
    items    map[string]*list.Element
    list     *list.List
}

func (c *LRUCache) Get(key string) (interface{}, bool) {
    c.mu.RLock()
    defer c.mu.RUnlock()

    if elem, ok := c.items[key]; ok {
        c.list.MoveToFront(elem)
        return elem.Value.(*entry).value, true
    }
    return nil, false
}
```

**Result: 99% cache hit rate, saving ~200μs per decision.**

In a high-throughput system, 200μs × 71,000 = **14.2 seconds saved per second**. That's the compounding power of caching.

The hardest part. How do you update an ML model without taking the server down?

```
type ModelManager struct {
    engine atomic.Pointer[InferenceEngine]
}

func (m *ModelManager) HotSwap(newModel []byte) error {
    newEngine, err := NewInferenceEngine(newModel)
    if err != nil {
        return err
    }

    // Atomic swap — zero downtime
    m.engine.Store(newEngine)
    return nil
}

func (m *ModelManager) GetEngine() *InferenceEngine {
    return m.engine.Load()
}
```

`atomic.Pointer`

from Go 1.19 makes this trivial. No locks. No downtime. The old engine gets garbage collected after the swap.

Once you can hot-swap models, A/B testing becomes easy:

```
func (h *Handler) routeRequest(r *http.Request) *InferenceEngine {
    // Hash user ID for consistent routing
    hash := fnv32(r.Header.Get("X-User-ID"))
    if hash%100 < h.config.ModelBPercentage {
        return h.modelB.Load()
    }
    return h.modelA.Load()
}
```

This lets me gradually roll out new models — 5% → 20% → 50% → 100% — while monitoring Prometheus metrics for drift.

Model drift is when real-world data shifts away from your training distribution. I implemented lightweight drift detection:

``` js
func (d *DriftDetector) Check(features []float32) bool {
    var drift float64
    for i, f := range features {
        deviation := math.Abs(float64(f) - d.baseline[i].Mean)
        normalized := deviation / (d.baseline[i].Std + 1e-8)
        drift += normalized
    }
    drift /= float64(len(features))

    // Alert if drift exceeds threshold
    return drift > d.threshold // threshold: 5e-7
}
```

Load tested with k6 — 200 concurrent VUs, 60 second duration:

```
scenarios: (100.00%) 1 scenario, 200 max VUs
  default: 200 looping VUs for 60s

✓ http_req_duration.............: avg=4.2ms  p(95)=5.8ms
✓ http_req_failed...............: 0.00%
✓ iterations....................: 4,276,440
✓ vus...........................: 200

Throughput: 71,274 req/s
```

**71,274 requests per second. p95 at 5.8ms. Zero errors across 7.8M requests.**

**1. Language choice matters for inference.**

Python is great for training. Go is great for serving. ONNX bridges the gap — you get the best of both worlds.

**2. Cache aggressively.**

99% cache hit rate means only 1% of requests actually hit the model. Your throughput scales with your cache, not your model.

**3. Atomic operations > locks for hot paths.**

`atomic.Pointer`

for model swapping means zero contention on the critical path.

**4. Design for deployability from day one.**

Zero-downtime deploys aren't an afterthought — they're a core architectural requirement.

**5. Monitor everything.**

Prometheus metrics on every request, drift detection on every prediction. If you can't measure it, you can't improve it.

[github.com/sameer-sde/sentinel](https://github.com/sameer-sde/sentinel)

*If you found this useful, drop a ❤️ and follow for more systems engineering content. I'm a 3rd year CS student at MJCET, Hyderabad — building distributed systems from scratch.*