A deep dive into building Sentinel — an ML inference pipeline that processes 7.8M requests with zero errors, using XGBoost, ONNX, and Go.
Fraud detection is a classic hard problem in systems design. You need to:
I built Sentinel to solve all four of these — and in the process learned more about systems engineering than any course ever taught me.
Transaction Request
│
▼
Go HTTP Server
│
▼
LRU Cache ──── Cache Hit? ──► Return Result (< 1ms)
│
Cache Miss
│
▼
ONNX Runtime
│
▼
XGBoost Model
│
▼
Fraud Score + Decision
│
▼
Prometheus Metrics
The key insight: serve ML inference from Go, not Python.
I trained XGBoost on the Kaggle Credit Card Fraud Dataset — 284,807 transactions, heavily imbalanced (only 0.17% fraud).
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, auc
scale_pos_weight = (len(y_train) - y_train.sum()) / y_train.sum()
model = xgb.XGBClassifier(
n_estimators=300,
max_depth=6,
learning_rate=0.05,
scale_pos_weight=scale_pos_weight,
eval_metric='aucpr',
use_label_encoder=False
)
model.fit(X_train, y_train,
eval_set=[(X_val, y_val)],
early_stopping_rounds=20,
verbose=False
)
Results:
Why PR-AUC over ROC-AUC? Because with imbalanced datasets, ROC-AUC is misleading. PR-AUC punishes you for missing fraud cases.
Here's where it gets interesting. Python inference is slow. I needed Go-level performance.
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(model, initial_types=initial_type)
with open("fraud_model.onnx", "wb") as f:
f.write(onnx_model.SerializeToString())
ONNX is a universal model format. Once exported, I can serve it from any language — in this case, Go using the onnxruntime-go
binding.
type InferenceEngine struct {
session *onnxruntime.Session
mu sync.RWMutex
}
func (e *InferenceEngine) Predict(features []float32) (float64, error) {
e.mu.RLock()
defer e.mu.RUnlock()
input := onnxruntime.NewTensor(features)
outputs, err := e.session.Run(input)
if err != nil {
return 0, err
}
score := outputs[0].GetData().([]float32)[0]
return float64(score), nil
}
The Go HTTP handler:
func (h *Handler) PredictHandler(w http.ResponseWriter, r *http.Request) {
var req TransactionRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
http.Error(w, "bad request", 400)
return
}
// Check LRU cache first
cacheKey := req.Hash()
if cached, ok := h.cache.Get(cacheKey); ok {
json.NewEncoder(w).Encode(cached)
return
}
// Run ONNX inference
features := req.ToFeatures()
score, err := h.engine.Predict(features)
if err != nil {
http.Error(w, "inference error", 500)
return
}
result := &PredictionResult{
Score: score,
IsFraud: score > 0.5,
Timestamp: time.Now().UnixNano(),
}
h.cache.Set(cacheKey, result)
json.NewEncoder(w).Encode(result)
}
The LRU cache was the single biggest performance win.
type LRUCache struct {
capacity int
mu sync.RWMutex
items map[string]*list.Element
list *list.List
}
func (c *LRUCache) Get(key string) (interface{}, bool) {
c.mu.RLock()
defer c.mu.RUnlock()
if elem, ok := c.items[key]; ok {
c.list.MoveToFront(elem)
return elem.Value.(*entry).value, true
}
return nil, false
}
Result: 99% cache hit rate, saving ~200μs per decision.
In a high-throughput system, 200μs × 71,000 = 14.2 seconds saved per second. That's the compounding power of caching.
The hardest part. How do you update an ML model without taking the server down?
type ModelManager struct {
engine atomic.Pointer[InferenceEngine]
}
func (m *ModelManager) HotSwap(newModel []byte) error {
newEngine, err := NewInferenceEngine(newModel)
if err != nil {
return err
}
// Atomic swap — zero downtime
m.engine.Store(newEngine)
return nil
}
func (m *ModelManager) GetEngine() *InferenceEngine {
return m.engine.Load()
}
atomic.Pointer
from Go 1.19 makes this trivial. No locks. No downtime. The old engine gets garbage collected after the swap.
Once you can hot-swap models, A/B testing becomes easy:
func (h *Handler) routeRequest(r *http.Request) *InferenceEngine {
// Hash user ID for consistent routing
hash := fnv32(r.Header.Get("X-User-ID"))
if hash%100 < h.config.ModelBPercentage {
return h.modelB.Load()
}
return h.modelA.Load()
}
This lets me gradually roll out new models — 5% → 20% → 50% → 100% — while monitoring Prometheus metrics for drift.
Model drift is when real-world data shifts away from your training distribution. I implemented lightweight drift detection:
func (d *DriftDetector) Check(features []float32) bool {
var drift float64
for i, f := range features {
deviation := math.Abs(float64(f) - d.baseline[i].Mean)
normalized := deviation / (d.baseline[i].Std + 1e-8)
drift += normalized
}
drift /= float64(len(features))
// Alert if drift exceeds threshold
return drift > d.threshold // threshold: 5e-7
}
Load tested with k6 — 200 concurrent VUs, 60 second duration:
scenarios: (100.00%) 1 scenario, 200 max VUs
default: 200 looping VUs for 60s
✓ http_req_duration.............: avg=4.2ms p(95)=5.8ms
✓ http_req_failed...............: 0.00%
✓ iterations....................: 4,276,440
✓ vus...........................: 200
Throughput: 71,274 req/s
71,274 requests per second. p95 at 5.8ms. Zero errors across 7.8M requests.
1. Language choice matters for inference.
Python is great for training. Go is great for serving. ONNX bridges the gap — you get the best of both worlds.
2. Cache aggressively.
99% cache hit rate means only 1% of requests actually hit the model. Your throughput scales with your cache, not your model.
3. Atomic operations > locks for hot paths.
atomic.Pointer
for model swapping means zero contention on the critical path.
4. Design for deployability from day one.
Zero-downtime deploys aren't an afterthought — they're a core architectural requirement.
5. Monitor everything.
Prometheus metrics on every request, drift detection on every prediction. If you can't measure it, you can't improve it.
github.com/sameer-sde/sentinel
If you found this useful, drop a ❤️ and follow for more systems engineering content. I'm a 3rd year CS student at MJCET, Hyderabad — building distributed systems from scratch.