# I Built a Fraud Detection System That Catches 99.76% of Fraud — Here's Everything I Learned

> Source: <https://dev.to/whozahm3d/i-built-a-fraud-detection-system-that-catches-9976-of-fraud-heres-everything-i-learned-55h3>
> Published: 2026-06-06 14:58:44+00:00

There is a number that haunts every fraud detection engineer: **0.13%**.

That is the fraud rate in the PaySim dataset — 8,213 fraudulent transactions buried inside 6,362,620 legitimate ones. It sounds small. It is not. At that ratio, a model that predicts "legitimate" for every single transaction achieves **99.87% accuracy** — and catches exactly zero fraud.

This is the problem I set out to solve with TrustGuard AI, a course project that turned into one of the most technically demanding things I have built. By the end of it, our deployed XGBoost model achieves **AUC-ROC of 0.9995** and **Recall of 0.9976** — meaning it catches 99.76% of all fraud on a 6.3 million row test set. It also explains every single prediction using SHAP, and grounds each fraud alert in real State Bank of Pakistan regulatory documents through a RAG pipeline.

This article is the full story — what worked, what broke, and why accuracy is the wrong metric for fraud detection.

Before writing a single line of code, I want to be clear about why standard accuracy is useless here.

The dataset has 6,362,620 transactions. Of those, 8,213 are fraud. If I build a model that always predicts "legitimate," here is its scorecard:

```
Accuracy  = 99.87%
Precision = 0
Recall    = 0
F1        = 0
```

A perfect-looking accuracy score on a model that is completely blind to fraud. This is why TrustGuard optimises for **Recall** (catching fraud), **Average Precision / AUPRC** (area under the precision-recall curve), and **AUC-ROC** — not accuracy. Accuracy is literally a deceptive metric on imbalanced data.

TrustGuard uses the **PaySim synthetic dataset** — a mobile money transaction log generated by a multi-agent simulation calibrated against real financial data. It spans 30 simulated days at hourly granularity.

| Property | Value |
|---|---|
| Total Transactions | 6,362,620 |
| Fraud Cases | 8,213 |
| Fraud Rate | 0.13% |
| Transaction Types | CASH_OUT, TRANSFER, PAYMENT, DEBIT, CASH_IN |

One of the first insights from EDA: **fraud is not spread across all transaction types**. It is confined exclusively to `CASH_OUT`

and `TRANSFER`

. This makes structural sense — fraud follows the account-drain pattern: transfer funds to a mule account, then cash out. `PAYMENT`

, `DEBIT`

, and `CASH_IN`

are clean.

This single observation shaped the entire feature engineering approach.

At 0.13% fraud rate, SMOTE alone is not enough. Here is why.

With 5-fold cross-validation, some training folds can contain fewer than 10 actual fraud samples. SMOTE generates synthetic minority samples by interpolating between existing ones — but if there are only a handful of real fraud cases in a fold, SMOTE degenerates. The synthetic samples cluster too tightly and the model learns nothing useful.

TrustGuard uses a **two-stage imbalance strategy**:

Before any train-test split, I apply a deterministic fraud injection step:

`TRANSFER`

and `CASH_OUT`

transactions`amount = oldbalanceOrg`

(full account drain)`newbalanceOrig = 0`

`balanceDiff`

and `amount_ratio`

**Result:** Fraud rate rises from 0.13% → 1.26%.

The ablation study confirmed this was the single most important component in the pipeline. Removing it dropped CV F1 from 0.947 to 0.671 — a 29% relative reduction.

After the 80/20 stratified train-test split, SMOTE (`sampling_strategy=0.3`

) is applied **inside an ImbPipeline per cross-validation fold**. This is critical — SMOTE is fitted only on the training portion of each fold. The validation fold never sees synthetic samples. This prevents data leakage.

The final training distribution: **23.07% fraud**.

| Stage | Fraud Rate |
|---|---|
| Original Dataset | 0.13% |
| After Fraud Simulation | 1.26% |
| After SMOTE (training folds) | 23.07% |

After cleaning, 12 features go into the model. The two most important are engineered:

** balanceDiff** =

`oldbalanceOrg − newbalanceOrig − amount`

This detects balance inconsistencies. In a legitimate transaction, money flows normally. In an account-drain fraud, this value becomes anomalous.

** amount_ratio** =

`amount / (oldbalanceOrg + 1)`

This approaches 1.0 in full account-drain attacks. For routine transfers it stays near zero.

The ablation confirmed their necessity: removing both dropped Test F1 from 0.5533 to 0.1538 and Test AP from 0.7317 to 0.6061. Without them, the model is nearly blind.

All four models were trained identically inside an `ImbPipeline(SMOTE → StandardScaler → Classifier)`

with 5-fold stratified cross-validation.

**Cross-Validation Results:**

| Model | CV F1 | CV AUC-ROC |
|---|---|---|
| XGBoost | 0.949 ± 0.020 | 1.000 ± 0.000 |
| Neural Network | 0.793 ± 0.061 | 0.999 ± 0.000 |
| Random Forest | 0.711 ± 0.007 | 0.999 ± 0.000 |
| Logistic Regression | 0.249 ± 0.003 | 0.977 ± 0.001 |

**Test Set Results:**

| Model | Test Recall | Test AUC | Test Avg Precision |
|---|---|---|---|
| XGBoost | 0.9976 | 0.9995 | 0.9358 |
| Random Forest | 0.9976 | 0.9995 | 0.8870 |
| Neural Network | 0.9732 | 0.9983 | 0.7081 |
| Logistic Regression | 0.9860 | 0.9946 | 0.5567 |

XGBoost dominates across every metric. Its Test Average Precision (0.9358) is 3.8× higher than Logistic Regression at comparable recall. XGBoost was selected for deployment.

| n_estimators | CV F1 |
|---|---|
| 100 | 0.921 |
| 200 | 0.938 |
| 300 | 0.949 |

More trees, lower learning rate (0.05), better generalisation. Max depth of 6 over 8 to avoid overfitting on fold-specific patterns.

A fraud detection system that says "this transaction is fraud" without explaining why is not useful to an analyst — and not acceptable to a regulator.

TrustGuard implements **SHAP TreeExplainer**, which computes exact Shapley values for each prediction. For every flagged transaction, a waterfall plot shows exactly which features pushed the prediction toward fraud and by how much.

For a sample transaction flagged at 94% fraud probability:

`amount_ratio ≈ 1.0`

→ largest push toward fraud (full drain detected)`type_TRANSFER`

→ second largest push`balanceDiff`

→ third largest pushThis tells the analyst: *this transaction looks like fraud because it drained an account completely via a transfer operation*. That is auditable and defensible.

This is the part most fraud detection tutorials skip entirely.

A model flagging a transaction at 97% is useful. A model that also cites the specific SBP regulatory provision being violated is operationally deployable.

**Pipeline architecture:**

`all-MiniLM-L6-v2`

(384-dimensional dense retriever)`ms-marco-MiniLM-L-6-v2`

)**Results:** Average Precision@5 = 0.855 across 10 retrieval queries. Zero hallucinations across all four high-risk transaction evaluations.

| Condition | CV F1 | Test F1 | Test AP |
|---|---|---|---|
| No Fraud Simulation (SMOTE only) | 0.671 | 0.6247 | 0.9363 |
| Full Pipeline (baseline) | 0.947 | 0.5533 | 0.7317 |
| No SMOTE | 0.947 | 0.9132 | 0.9639 |
| SMOTE ratio = 0.3 (selected) | 0.947 | 0.5557 | 0.7322 |
| SMOTE ratio = 0.5 | 0.947 | 0.5280 | 0.7180 |
| No Engineered Features | 0.636 | 0.1538 | 0.6061 |
| With Engineered Features (full) | 0.947 | 0.5533 | 0.7317 |

Three takeaways: the Fraud Simulation Engine is irreplaceable, engineered features are critical, and no SMOTE gives better precision but worse robustness.

| Metric | Value |
|---|---|
| AUC-ROC | 0.9995 |
| Recall | 0.9976 |
| Average Precision (AUPRC) | 0.9358 |
| Fraud cases caught (of 8,213) | 8,190 |
| Fraud cases missed | 23 |
| RAG hallucinations | 0 |
| Retrieval Precision@5 | 0.855 |

Python · XGBoost · Scikit-learn · imbalanced-learn · SHAP · ChromaDB · sentence-transformers · rank-bm25 · CrossEncoder · GPT-4o-mini · Streamlit · Pandas · NumPy

**Live demo:** [trustguard-ai-fraud-detection-c7um3xntqvxthahgld5ucm.streamlit.app](https://trustguard-ai-fraud-detection-c7um3xntqvxthahgld5ucm.streamlit.app/)

**Source code:** [github.com/whozahm3d/trustguard-ai-fraud-detection](https://github.com/whozahm3d/trustguard-ai-fraud-detection)

If you found this useful, the repo is public — feedback, issues, and stars are all welcome.
