Build Your Own Transaction Foundation Model for Financial Intelligence

NVIDIA released a developer example for building a transaction foundation model using accelerated computing, demonstrating a near-50% lift in fraud detection Average Precision over an XGBoost baseline. The model, pre-trained on unlabeled transaction sequences, can be applied to tasks including fraud detection, credit scoring, and lifetime value prediction.

Every swipe, transfer, and payment on a modern financial network encodes a pattern of human behavior. Transaction data is one of the richest signals an enterprise owns. Yet most production use cases for such tabular data still depend on hand-engineered features and rule sets that are brittle, expensive to maintain, and blind to the sequential structure inside a customer history. Foundation models, pre-trained on large volumes of unlabeled transaction sequences, change this equation by producing general-purpose representations of financial behavior that transfer across a wide array of downstream tasks. A single backbone covers fraud detection, credit scoring, lifetime value prediction, segmentation, personalized recommendations, recurrent-transaction detection, and more. The industry signal is strong and accelerating. Innovative financial firms are training transformer-based models on billions of transactions, reporting double-digit relative lifts on production-scale tasks while simultaneously streamlining operations. See Stripe’s payments foundation model https://stripe.com/us/newsroom/news/sessions-2025 , Nubank’s NuFormer https://arxiv.org/abs/2507.23267 , Visa’s TransactionGPT https://arxiv.org/abs/2511.08939 , Mastercard’s large tabular model https://www.mastercard.com/global/en/news-and-trends/stories/2026/mastercard-new-generative-ai-model.html , Revolut’s PRAGMA https://arxiv.org/abs/2604.08649 , Plaid’s transaction foundation model https://plaid.com/blog/building-transaction-foundation-model-intelligent-finance/ , and more. The NVIDIA Build Your Own Transaction Model developer example https://build.nvidia.com/nvidia/build-your-own-transaction-foundation-model walks through how to build a transaction foundation model end-to-end using accelerated computing. You will progress through five steps in this workflow: - GPU-accelerated data processing with NVIDIA CUDA-X library cuDF https://developer.nvidia.com/topics/ai/data-science/cuda-x-data-science-libraries/cudf - Custom tokenization with NVIDIA CUDA-X libraries cuDF https://developer.nvidia.com/topics/ai/data-science/cuda-x-data-science-libraries/cudf and cuML https://developer.nvidia.com/topics/ai/data-science/cuda-x-data-science-libraries/cuml - Transformer decoder model pretraining from scratch with NVIDIA NeMo AutoModel https://docs.nvidia.com/nemo/automodel/latest/index.html open library, part of NVIDIA NeMo framework https://github.com/NVIDIA-NeMo/ - Extracting learned embeddings - Augmenting a downstream fraud classifier with embeddings By the end, you will reproduce a near-50% lift in Average Precision “AP” — the area under the precision-recall curve—capturing how well the model ranks fraud across all operating thresholds , over a strong XGBoost https://www.nvidia.com/en-us/glossary/xgboost/ baseline on the IBM TabFormer https://github.com/IBM/TabFormer fraud dataset. Figure 1, below, shows the end-to-end pipeline. Why transformers fit transaction histories Large language models learn from sequences of words. During pretraining, a model sees text and learns that words, phrases, and sentences carry meaning through order and context. A transaction foundation model applies the same principle to financial behavior. A sequence such as “paycheck deposit, grocery purchase, transit fare, recurring subscription, card-present restaurant payment” carries information that no single transaction row can express alone. Transformers are well suited to this structure because self-attention can connect events that sit far apart in history. A fraudulent transaction may only look suspicious when paired with a recent travel pattern or a sudden burst of small authorizations. Traditional tabular features can approximate these patterns, but engineers must decide which windows, aggregates, and rules to build up front. A pretrained transformer learns those relationships directly from the sequence. This approach complements other NVIDIA financial AI workflows, including the NVIDIA AI Blueprint for financial fraud detection https://developer.nvidia.com/blog/supercharging-fraud-detection-in-financial-services-with-graph-neural-networks/ using graph neural networks GNNs . GNNs capture relationships across connected entities such as accounts, merchants, devices, and transactions. Transaction foundation models focus on behavioral histories within a customer or account sequence. In practice, both methods produce rich embeddings with complementary information that pair naturally. Load the data and set a baseline Notebook 01 dataset baseline.ipynb loads the IBM TabFormer dataset https://github.com/IBM/TabFormer , roughly 24.4M synthetic card transactions with a ~0.12% fraud rate, directly into GPU memory with cuDF https://developer.nvidia.com/topics/ai/data-science/cuda-x-data-science-libraries/cudf . The dataset splits are partitioned temporally by cumulative transaction count: the first 80% of transactions by date is used for training; the next 10% becomes validation; and the final 10% becomes test. These splits therefore occupy disjoint and ordered time windows, preventing data leakage and reflecting real-world production environments. With the splits in place, the notebook trains an XGBoost classifier utilizing native GPU acceleration with tree method="hist" and device="cuda" on a 1M-row balanced training sample. Evaluation runs on a 100k stratified holdout that preserves the realistic ~0.1% fraud prevalence. The baseline numbers set the bar for the rest of the tutorial: - Test ROC-AUC: 0.9885 - Test AP: 0.1238 Pay attention to AP rather than ROC-AUC. Under 0.1% class imbalance, ROC-AUC saturates quickly and hides meaningful differences in high scoring regions. AP measures across the full recall curve and responds to improvements where they matter operationally. Every subsequent model in this tutorial is judged by AP first. Tokenize transactions on the GPU General-purpose LLM tokenizers waste capacity on tabular financial data. For example, a byte pair encoding BPE tokenizer splits a single transaction into roughly 39 subword tokens, where most encode commas and dollar signs rather than behavior. Notebook 02 seq preproc tokenization.ipynb introduces a custom domain tokenizer that converts each transaction into roughly 12 semantic tokens with a much smaller vocabulary 6,251 symbols vs. 50,257 from BPE . In addition to token information density, this efficiency also enables more than 3x the number of transactions for a set token budget. Practically speaking, a model with a context window of 4,092 can fit a history of ~315 transactions from the domain tokenizer and only ~102 transactions from a BPE tokenizer. Figure 2, below, compares token counts per transaction between the two tokenization methods on the same records. The domain tokenizer is implemented in src/tokenizer/financial pipeline.py https://github.com/NVIDIA-AI-Blueprints/transaction-foundation-model/blob/main/src/tokenizer/financial pipeline.py . This flexible pipeline handles amount binning, merchant hashing, hour-of-day and day-of-week, month, card identity, chip type, ZIP3 and state, and customer identity. Every step runs on the GPU through cuDF. The tokenizer can be readily adapted to different transaction schema by adding or replacing individual steps in the modular pipeline. Each step implements a small BaseTokenizer https://github.com/NVIDIA-AI-Blueprints/transaction-foundation-model/blob/main/src/tokenizer/base.py interface, so extending coverage to new fields such as device ID or beneficiary country takes just a short subclass. Pretrain with NeMo AutoModel NeMo AutoModel is a Pytorch-native open-source training library under the NVIDIA NeMo Framework, designed to streamline and scale training and finetuning for LLMs and VLMs. Notebook 03 foundation model training.ipynb pretrains a decoder-only foundation model on the tokenized corpus using causal language modeling. The objective is simple — to predict the next token given every previous token — but the supervision signal is dense. Every position in a sequence contributes a gradient, so a single packed transaction sequence yields thousands of next-event predictions. The model is a compact Llama decoder defined in configs/pretrain financial decoder.yaml https://github.com/NVIDIA-AI-Blueprints/transaction-foundation-model/blob/main/configs/pretrain financial decoder.yaml : - ~29M parameters - Hidden size 512, 8 transformer layers - Grouped-Query Attention with 8 query heads and 2 KV heads - 8,192-token RoPE context window - SwiGLU activation, RMSNorm, domain vocabulary of 6,251 tokens NeMo AutoModel handles the rest of the stack. Kick off a single-GPU sanity run. python scripts/train decoder model.py \ --config configs/pretrain financial decoder.yaml \ --step scheduler.max steps 30 The 30-step demo drops training loss from ln 6251 ≈8.74 the random-guess baseline for this vocabulary to around 6.0. To scale the same run to eight GPUs, simply prefix the command with torchrun --nproc-per-node=8 —no changes to the script or distributed boilerplate required. Multi-node scaling is straightforward as well. NeMo AutoModel wires up FSDP2 sharding, mixed precision, gradient accumulation, and checkpoint consolidation from the YAML. Checkpoints land as standard safetensors files, which means the trained backbone loads with a one-liner anywhere HuggingFace Transformers is installed: python from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from pretrained "models/decoder-foundation-model" The repository ships a full checkpoint trained for 3,000 steps, which Notebooks 04 and 05 load; the 30-step test is for demonstrative and validation purposes. To swap architectures, edit model. target and model.config. target in the YAML. Any HuggingFace-compatible decoder is designed to drop in without training-code changes. Extract embeddings at scale Notebook 04 inference embedding extraction.ipynb turns the pretrained backbone into a feature extractor. It loads the checkpoint with AutoModelForCausalLM , requests output hidden states=True , and pools the final hidden layer down to a 512-dim vector per user history. For decoder-only models with causal attention, only the final position has observed the entire sequence while earlier positions are blind to later tokens. Last-token pooling therefore picks the most informative location in the sequence. The implementation in src/decoder inference.py https://github.com/NVIDIA-AI-Blueprints/transaction-foundation-model/blob/main/src/decoder inference.py uses the attention mask to find the last non-pad token per row and gathers its hidden state. The extraction loop is a single call: embeddings = inference.extract embeddings batched padded ids, batch size=1024, show progress=True The notebook extracts and saves train, validation, and test embeddings as .npy files. Additionally, a metadata.json describing shapes and row alignment is saved, which is later used in Notebook 05 to join embeddings back to the associated raw tabular features. Figure 3, below, shows a 3D UMAP projection of 50k validation embeddings, colored by merchant industry category and zip code. Visible clusters in each field confirm that the backbone has learned semantically coherent representations without ever seeing any target labels during pretraining. Figure 3. 3D UMAP projection of 50,000 validation-set transaction embeddings. Points colored by merchant industry and user zip code each show clear behavioral clusters in the learned representation space Measure lift on a downstream task Notebook 05 xgboost fraud detection.ipynb answers the billion dollar question: Can transaction foundation model embeddings move downstream metrics? It trains three GPU XGBoost classifiers and evaluates all of them on the same 100k stratified test set: - Raw—13 hand-engineered tabular features the baseline from Step 1 - Embeddings—512-dim foundation-model vectors compressed to 64d with PCA ~78% variance retained - Combined—raw features concatenated with the 64d embeddings, 77d total Table 1, below, summarizes the test results. | Raw baseline | 13 | 0.9885 | 0.1238 | | Embeddings only | 64 | 0.8775 | 0.0123 | | Combined | 77 | 0.9925 | 0.1755 | Table 1. Downstream fraud-detection results on the TabFormer temporal test split. The combined model delivers a +0.41% ROC-AUC lift and a +41.76% AP lift over the raw-feature baseline The combined model lifts ROC-AUC by 0.41% and AP by 41.76% over the baseline. That AP delta is the operational win: a review team with fixed daily capacity catches materially more fraud at the same workload. Embeddings encode the user’s transaction history and provide predictive power, but underperform the baseline as lone features. The combined model leverages event-level information from the raw tabular row and sequence-level historical context from embeddings that were learned during pretraining. Figure 4, below, shows the comparison visually. Figure 4. Side-by-side comparison of test ROC-AUC and test AP for the three downstream models. The combined model raw features + foundation-model embeddings wins on both metrics Customize the developer example The repository is structured so that each component is swappable independently: — Tokenizer: Adapt the pipeline in src/tokenizer/ https://github.com/NVIDIA-AI-Blueprints/transaction-foundation-model/tree/main/src/tokenizer to any transaction schema by adding or replacing steps. Each step is a small subclass of BaseTokenizer , so supporting new fields such as device fingerprint, beneficiary country, and merchant country is a short addition. — Model architecture: Edit model. target and model.config. target in the training YAML to point at any HuggingFace-compatible decoder. The rest of the training pipeline using NeMo data loader, FSDP2, checkpointing, evaluation stays put. — Downstream task: Replace XGBoost with any model that consumes fixed-length feature vectors. Churn prediction, customer segmentation, lifetime value regression, next-best-action ranking, and credit scoring all fit the same embedding-plus-head pattern. The developer example is designed to extend to labels other than fraud as well, exhibiting foundational capabilities. Swap Is Fraud? in Step 5, above, for any event label that aligns with the user histories encoded by the backbone. Get started You now have a reference path from raw transaction logs to a pretrained foundation model that augments a downstream classifier, accelerated end-to-end with NVIDIA. The three components — a custom tokenizer, a transformer decoder backbone, and an embedding-driven XGBoost head — together deliver a near-50% AP lift over a strong industry standard baseline on the TabFormer fraud benchmark. Visit build.nvidia.com https://build.nvidia.com/nvidia/build-your-own-transaction-foundation-model to deploy the notebook in a GPU-accelerated environment via NVIDIA Launchable https://brev.nvidia.com/launchable/deploy?launchableID=env-3CBNhPVekCGLYa412iKiXGqMwVJ or your own environment via GitHub repository https://github.com/NVIDIA-AI-Blueprints/transaction-foundation-model .