{"slug": "i-built-a-fraud-detection-system-that-catches-99-76-of-fraud-here-s-everything-i", "title": "I Built a Fraud Detection System That Catches 99.76% of Fraud — Here's Everything I Learned", "summary": "A developer built TrustGuard AI, a fraud detection system that catches 99.76% of fraudulent transactions on the PaySim dataset, achieving an AUC-ROC of 0.9995 and Recall of 0.9976 on a 6.3 million row test set. The system uses a two-stage imbalance strategy, including deterministic fraud injection and SMOTE, to address the dataset's 0.13% fraud rate where standard accuracy metrics are deceptive. TrustGuard also explains each prediction using SHAP and grounds fraud alerts in regulatory documents through a RAG pipeline.", "body_md": "There is a number that haunts every fraud detection engineer: **0.13%**.\n\nThat is the fraud rate in the PaySim dataset — 8,213 fraudulent transactions buried inside 6,362,620 legitimate ones. It sounds small. It is not. At that ratio, a model that predicts \"legitimate\" for every single transaction achieves **99.87% accuracy** — and catches exactly zero fraud.\n\nThis is the problem I set out to solve with TrustGuard AI, a course project that turned into one of the most technically demanding things I have built. By the end of it, our deployed XGBoost model achieves **AUC-ROC of 0.9995** and **Recall of 0.9976** — meaning it catches 99.76% of all fraud on a 6.3 million row test set. It also explains every single prediction using SHAP, and grounds each fraud alert in real State Bank of Pakistan regulatory documents through a RAG pipeline.\n\nThis article is the full story — what worked, what broke, and why accuracy is the wrong metric for fraud detection.\n\nBefore writing a single line of code, I want to be clear about why standard accuracy is useless here.\n\nThe dataset has 6,362,620 transactions. Of those, 8,213 are fraud. If I build a model that always predicts \"legitimate,\" here is its scorecard:\n\n```\nAccuracy  = 99.87%\nPrecision = 0\nRecall    = 0\nF1        = 0\n```\n\nA perfect-looking accuracy score on a model that is completely blind to fraud. This is why TrustGuard optimises for **Recall** (catching fraud), **Average Precision / AUPRC** (area under the precision-recall curve), and **AUC-ROC** — not accuracy. Accuracy is literally a deceptive metric on imbalanced data.\n\nTrustGuard uses the **PaySim synthetic dataset** — a mobile money transaction log generated by a multi-agent simulation calibrated against real financial data. It spans 30 simulated days at hourly granularity.\n\n| Property | Value |\n|---|---|\n| Total Transactions | 6,362,620 |\n| Fraud Cases | 8,213 |\n| Fraud Rate | 0.13% |\n| Transaction Types | CASH_OUT, TRANSFER, PAYMENT, DEBIT, CASH_IN |\n\nOne of the first insights from EDA: **fraud is not spread across all transaction types**. It is confined exclusively to `CASH_OUT`\n\nand `TRANSFER`\n\n. This makes structural sense — fraud follows the account-drain pattern: transfer funds to a mule account, then cash out. `PAYMENT`\n\n, `DEBIT`\n\n, and `CASH_IN`\n\nare clean.\n\nThis single observation shaped the entire feature engineering approach.\n\nAt 0.13% fraud rate, SMOTE alone is not enough. Here is why.\n\nWith 5-fold cross-validation, some training folds can contain fewer than 10 actual fraud samples. SMOTE generates synthetic minority samples by interpolating between existing ones — but if there are only a handful of real fraud cases in a fold, SMOTE degenerates. The synthetic samples cluster too tightly and the model learns nothing useful.\n\nTrustGuard uses a **two-stage imbalance strategy**:\n\nBefore any train-test split, I apply a deterministic fraud injection step:\n\n`TRANSFER`\n\nand `CASH_OUT`\n\ntransactions`amount = oldbalanceOrg`\n\n(full account drain)`newbalanceOrig = 0`\n\n`balanceDiff`\n\nand `amount_ratio`\n\n**Result:** Fraud rate rises from 0.13% → 1.26%.\n\nThe ablation study confirmed this was the single most important component in the pipeline. Removing it dropped CV F1 from 0.947 to 0.671 — a 29% relative reduction.\n\nAfter the 80/20 stratified train-test split, SMOTE (`sampling_strategy=0.3`\n\n) is applied **inside an ImbPipeline per cross-validation fold**. This is critical — SMOTE is fitted only on the training portion of each fold. The validation fold never sees synthetic samples. This prevents data leakage.\n\nThe final training distribution: **23.07% fraud**.\n\n| Stage | Fraud Rate |\n|---|---|\n| Original Dataset | 0.13% |\n| After Fraud Simulation | 1.26% |\n| After SMOTE (training folds) | 23.07% |\n\nAfter cleaning, 12 features go into the model. The two most important are engineered:\n\n** balanceDiff** =\n\n`oldbalanceOrg − newbalanceOrig − amount`\n\nThis detects balance inconsistencies. In a legitimate transaction, money flows normally. In an account-drain fraud, this value becomes anomalous.\n\n** amount_ratio** =\n\n`amount / (oldbalanceOrg + 1)`\n\nThis approaches 1.0 in full account-drain attacks. For routine transfers it stays near zero.\n\nThe ablation confirmed their necessity: removing both dropped Test F1 from 0.5533 to 0.1538 and Test AP from 0.7317 to 0.6061. Without them, the model is nearly blind.\n\nAll four models were trained identically inside an `ImbPipeline(SMOTE → StandardScaler → Classifier)`\n\nwith 5-fold stratified cross-validation.\n\n**Cross-Validation Results:**\n\n| Model | CV F1 | CV AUC-ROC |\n|---|---|---|\n| XGBoost | 0.949 ± 0.020 | 1.000 ± 0.000 |\n| Neural Network | 0.793 ± 0.061 | 0.999 ± 0.000 |\n| Random Forest | 0.711 ± 0.007 | 0.999 ± 0.000 |\n| Logistic Regression | 0.249 ± 0.003 | 0.977 ± 0.001 |\n\n**Test Set Results:**\n\n| Model | Test Recall | Test AUC | Test Avg Precision |\n|---|---|---|---|\n| XGBoost | 0.9976 | 0.9995 | 0.9358 |\n| Random Forest | 0.9976 | 0.9995 | 0.8870 |\n| Neural Network | 0.9732 | 0.9983 | 0.7081 |\n| Logistic Regression | 0.9860 | 0.9946 | 0.5567 |\n\nXGBoost dominates across every metric. Its Test Average Precision (0.9358) is 3.8× higher than Logistic Regression at comparable recall. XGBoost was selected for deployment.\n\n| n_estimators | CV F1 |\n|---|---|\n| 100 | 0.921 |\n| 200 | 0.938 |\n| 300 | 0.949 |\n\nMore trees, lower learning rate (0.05), better generalisation. Max depth of 6 over 8 to avoid overfitting on fold-specific patterns.\n\nA fraud detection system that says \"this transaction is fraud\" without explaining why is not useful to an analyst — and not acceptable to a regulator.\n\nTrustGuard implements **SHAP TreeExplainer**, which computes exact Shapley values for each prediction. For every flagged transaction, a waterfall plot shows exactly which features pushed the prediction toward fraud and by how much.\n\nFor a sample transaction flagged at 94% fraud probability:\n\n`amount_ratio ≈ 1.0`\n\n→ largest push toward fraud (full drain detected)`type_TRANSFER`\n\n→ second largest push`balanceDiff`\n\n→ third largest pushThis tells the analyst: *this transaction looks like fraud because it drained an account completely via a transfer operation*. That is auditable and defensible.\n\nThis is the part most fraud detection tutorials skip entirely.\n\nA model flagging a transaction at 97% is useful. A model that also cites the specific SBP regulatory provision being violated is operationally deployable.\n\n**Pipeline architecture:**\n\n`all-MiniLM-L6-v2`\n\n(384-dimensional dense retriever)`ms-marco-MiniLM-L-6-v2`\n\n)**Results:** Average Precision@5 = 0.855 across 10 retrieval queries. Zero hallucinations across all four high-risk transaction evaluations.\n\n| Condition | CV F1 | Test F1 | Test AP |\n|---|---|---|---|\n| No Fraud Simulation (SMOTE only) | 0.671 | 0.6247 | 0.9363 |\n| Full Pipeline (baseline) | 0.947 | 0.5533 | 0.7317 |\n| No SMOTE | 0.947 | 0.9132 | 0.9639 |\n| SMOTE ratio = 0.3 (selected) | 0.947 | 0.5557 | 0.7322 |\n| SMOTE ratio = 0.5 | 0.947 | 0.5280 | 0.7180 |\n| No Engineered Features | 0.636 | 0.1538 | 0.6061 |\n| With Engineered Features (full) | 0.947 | 0.5533 | 0.7317 |\n\nThree takeaways: the Fraud Simulation Engine is irreplaceable, engineered features are critical, and no SMOTE gives better precision but worse robustness.\n\n| Metric | Value |\n|---|---|\n| AUC-ROC | 0.9995 |\n| Recall | 0.9976 |\n| Average Precision (AUPRC) | 0.9358 |\n| Fraud cases caught (of 8,213) | 8,190 |\n| Fraud cases missed | 23 |\n| RAG hallucinations | 0 |\n| Retrieval Precision@5 | 0.855 |\n\nPython · XGBoost · Scikit-learn · imbalanced-learn · SHAP · ChromaDB · sentence-transformers · rank-bm25 · CrossEncoder · GPT-4o-mini · Streamlit · Pandas · NumPy\n\n**Live demo:** [trustguard-ai-fraud-detection-c7um3xntqvxthahgld5ucm.streamlit.app](https://trustguard-ai-fraud-detection-c7um3xntqvxthahgld5ucm.streamlit.app/)\n\n**Source code:** [github.com/whozahm3d/trustguard-ai-fraud-detection](https://github.com/whozahm3d/trustguard-ai-fraud-detection)\n\nIf you found this useful, the repo is public — feedback, issues, and stars are all welcome.", "url": "https://wpnews.pro/news/i-built-a-fraud-detection-system-that-catches-99-76-of-fraud-here-s-everything-i", "canonical_source": "https://dev.to/whozahm3d/i-built-a-fraud-detection-system-that-catches-9976-of-fraud-heres-everything-i-learned-55h3", "published_at": "2026-06-06 14:58:44+00:00", "updated_at": "2026-06-06 15:11:58.986932+00:00", "lang": "en", "topics": ["machine-learning", "artificial-intelligence", "ai-products", "ai-research", "mlops"], "entities": ["PaySim", "TrustGuard AI", "XGBoost", "SHAP", "State Bank of Pakistan"], "alternates": {"html": "https://wpnews.pro/news/i-built-a-fraud-detection-system-that-catches-99-76-of-fraud-here-s-everything-i", "markdown": "https://wpnews.pro/news/i-built-a-fraud-detection-system-that-catches-99-76-of-fraud-here-s-everything-i.md", "text": "https://wpnews.pro/news/i-built-a-fraud-detection-system-that-catches-99-76-of-fraud-here-s-everything-i.txt", "jsonld": "https://wpnews.pro/news/i-built-a-fraud-detection-system-that-catches-99-76-of-fraud-here-s-everything-i.jsonld"}}