I Built a Fraud Detection System That Catches 99.76% of Fraud — Here's Everything I Learned

wpnews.pro

There is a number that haunts every fraud detection engineer: 0.13%.

That is the fraud rate in the PaySim dataset — 8,213 fraudulent transactions buried inside 6,362,620 legitimate ones. It sounds small. It is not. At that ratio, a model that predicts "legitimate" for every single transaction achieves 99.87% accuracy — and catches exactly zero fraud.

This is the problem I set out to solve with TrustGuard AI, a course project that turned into one of the most technically demanding things I have built. By the end of it, our deployed XGBoost model achieves AUC-ROC of 0.9995 and Recall of 0.9976 — meaning it catches 99.76% of all fraud on a 6.3 million row test set. It also explains every single prediction using SHAP, and grounds each fraud alert in real State Bank of Pakistan regulatory documents through a RAG pipeline.

This article is the full story — what worked, what broke, and why accuracy is the wrong metric for fraud detection.

Before writing a single line of code, I want to be clear about why standard accuracy is useless here.

The dataset has 6,362,620 transactions. Of those, 8,213 are fraud. If I build a model that always predicts "legitimate," here is its scorecard:

Accuracy  = 99.87%
Precision = 0
Recall    = 0
F1        = 0

A perfect-looking accuracy score on a model that is completely blind to fraud. This is why TrustGuard optimises for Recall (catching fraud), Average Precision / AUPRC (area under the precision-recall curve), and AUC-ROC — not accuracy. Accuracy is literally a deceptive metric on imbalanced data.

TrustGuard uses the PaySim synthetic dataset — a mobile money transaction log generated by a multi-agent simulation calibrated against real financial data. It spans 30 simulated days at hourly granularity.

Property	Value
Total Transactions	6,362,620
Fraud Cases	8,213
Fraud Rate	0.13%
Transaction Types	CASH_OUT, TRANSFER, PAYMENT, DEBIT, CASH_IN

One of the first insights from EDA: fraud is not spread across all transaction types. It is confined exclusively to CASH_OUT

and TRANSFER

. This makes structural sense — fraud follows the account-drain pattern: transfer funds to a mule account, then cash out. PAYMENT

, DEBIT

, and CASH_IN

are clean.

This single observation shaped the entire feature engineering approach.

At 0.13% fraud rate, SMOTE alone is not enough. Here is why.

With 5-fold cross-validation, some training folds can contain fewer than 10 actual fraud samples. SMOTE generates synthetic minority samples by interpolating between existing ones — but if there are only a handful of real fraud cases in a fold, SMOTE degenerates. The synthetic samples cluster too tightly and the model learns nothing useful.

TrustGuard uses a two-stage imbalance strategy:

Before any train-test split, I apply a deterministic fraud injection step:

TRANSFER

and CASH_OUT

transactionsamount = oldbalanceOrg

(full account drain)newbalanceOrig = 0

balanceDiff

and amount_ratio

Result: Fraud rate rises from 0.13% → 1.26%.

The ablation study confirmed this was the single most important component in the pipeline. Removing it dropped CV F1 from 0.947 to 0.671 — a 29% relative reduction.

After the 80/20 stratified train-test split, SMOTE (sampling_strategy=0.3

) is applied inside an ImbPipeline per cross-validation fold. This is critical — SMOTE is fitted only on the training portion of each fold. The validation fold never sees synthetic samples. This prevents data leakage.

The final training distribution: 23.07% fraud.

Stage	Fraud Rate
Original Dataset	0.13%
After Fraud Simulation	1.26%
After SMOTE (training folds)	23.07%

After cleaning, 12 features go into the model. The two most important are engineered:

** balanceDiff** =

oldbalanceOrg − newbalanceOrig − amount

This detects balance inconsistencies. In a legitimate transaction, money flows normally. In an account-drain fraud, this value becomes anomalous.

** amount_ratio** =

amount / (oldbalanceOrg + 1)

This approaches 1.0 in full account-drain attacks. For routine transfers it stays near zero.

The ablation confirmed their necessity: removing both dropped Test F1 from 0.5533 to 0.1538 and Test AP from 0.7317 to 0.6061. Without them, the model is nearly blind.

All four models were trained identically inside an ImbPipeline(SMOTE → StandardScaler → Classifier)

with 5-fold stratified cross-validation.

Cross-Validation Results:

Model	CV F1	CV AUC-ROC
XGBoost	0.949 ± 0.020	1.000 ± 0.000
Neural Network	0.793 ± 0.061	0.999 ± 0.000
Random Forest	0.711 ± 0.007	0.999 ± 0.000
Logistic Regression	0.249 ± 0.003	0.977 ± 0.001

Test Set Results:

Model	Test Recall	Test AUC	Test Avg Precision
XGBoost	0.9976	0.9995	0.9358
Random Forest	0.9976	0.9995	0.8870
Neural Network	0.9732	0.9983	0.7081
Logistic Regression	0.9860	0.9946	0.5567

XGBoost dominates across every metric. Its Test Average Precision (0.9358) is 3.8× higher than Logistic Regression at comparable recall. XGBoost was selected for deployment.

n_estimators	CV F1
100	0.921
200	0.938
300	0.949

More trees, lower learning rate (0.05), better generalisation. Max depth of 6 over 8 to avoid overfitting on fold-specific patterns.

A fraud detection system that says "this transaction is fraud" without explaining why is not useful to an analyst — and not acceptable to a regulator.

TrustGuard implements SHAP TreeExplainer, which computes exact Shapley values for each prediction. For every flagged transaction, a waterfall plot shows exactly which features pushed the prediction toward fraud and by how much.

For a sample transaction flagged at 94% fraud probability:

amount_ratio ≈ 1.0

→ largest push toward fraud (full drain detected)type_TRANSFER

→ second largest pushbalanceDiff

→ third largest pushThis tells the analyst: this transaction looks like fraud because it drained an account completely via a transfer operation. That is auditable and defensible.

This is the part most fraud detection tutorials skip entirely.

A model flagging a transaction at 97% is useful. A model that also cites the specific SBP regulatory provision being violated is operationally deployable.

Pipeline architecture:

all-MiniLM-L6-v2

(384-dimensional dense retriever)ms-marco-MiniLM-L-6-v2

)Results: Average Precision@5 = 0.855 across 10 retrieval queries. Zero hallucinations across all four high-risk transaction evaluations.

Condition	CV F1	Test F1	Test AP
No Fraud Simulation (SMOTE only)	0.671	0.6247	0.9363
Full Pipeline (baseline)	0.947	0.5533	0.7317
No SMOTE	0.947	0.9132	0.9639
SMOTE ratio = 0.3 (selected)	0.947	0.5557	0.7322
SMOTE ratio = 0.5	0.947	0.5280	0.7180
No Engineered Features	0.636	0.1538	0.6061
With Engineered Features (full)	0.947	0.5533	0.7317

Three takeaways: the Fraud Simulation Engine is irreplaceable, engineered features are critical, and no SMOTE gives better precision but worse robustness.

Metric	Value
AUC-ROC	0.9995
Recall	0.9976
Average Precision (AUPRC)	0.9358
Fraud cases caught (of 8,213)	8,190
Fraud cases missed	23
RAG hallucinations	0
Retrieval Precision@5	0.855

Python · XGBoost · Scikit-learn · imbalanced-learn · SHAP · ChromaDB · sentence-transformers · rank-bm25 · CrossEncoder · GPT-4o-mini · Streamlit · Pandas · NumPy

Live demo: trustguard-ai-fraud-detection-c7um3xntqvxthahgld5ucm.streamlit.app

Source code: github.com/whozahm3d/trustguard-ai-fraud-detection

If you found this useful, the repo is public — feedback, issues, and stars are all welcome.

source & further reading

dev.to — original article Scrape any company's job postings — Greenhouse, Lever & Ashby, with one API call The OpenAI/Hugging Face Incident is a Wake-Up Call for Model Eval Security MCP vs. Agent Skills: A Decision Framework for Context Engineering

I Built a Fraud Detection System That Catches 99.76% of Fraud — Here's Everything I Learned

Run your AI side-project on zahid.host