NVIDIA Blackwell Sets STAC-AI Record for LLM Inference in Finance

NVIDIA's Blackwell platform set a new STAC-AI benchmark record for large language model inference in financial services, outperforming prior results on the STAC-AI LANG6 test. The benchmark, developed by the Strategic Technology Analysis Center, evaluates end-to-end retrieval-augmented generation and LLM inference performance using financial datasets based on EDGAR filings. The record demonstrates Blackwell's capability to accelerate AI-driven trading analysis and investment strategy automation for the finance industry.

Large language models LLMs are revolutionizing the financial trading landscape by enabling sophisticated analysis of vast amounts of unstructured data to generate actionable trading insights. These advanced AI systems can process financial news, social media sentiment, earnings reports, and market data to predict stock price movements and automate investment strategies with unprecedented accuracy. The Strategic Technology Analysis Center STAC https://stacresearch.com/ has been developing benchmarks for the workloads key to the financial industry for over 15 years. They have developed the STAC-AI benchmark to help companies assess the end-to-end retrieval-augmented generation RAG https://www.nvidia.com/en-us/glossary/retrieval-augmented-generation/ and LLM inference https://www.nvidia.com/en-us/glossary/ai-inference/ pipeline. This post presents the results achieved on the STAC-AI LANG6 benchmark across multiple NVIDIA platforms. We will also share some recommendations on how any user can benchmark NVIDIA TensorRT LLM https://github.com/NVIDIA/TensorRT-LLM according to the specifications of their dataset. STAC-AI LANG6 Inference-Only Benchmark In the broader context of a RAG pipeline, STAC-AI LANG6 https://docs.stacresearch.com/ai is the part of the benchmark focusing on LLM inference performance. The benchmark tests the hardware and software stack on the Llama 3.1 8B Instruct https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct and Llama 3.1 70B Instruct https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct models in combination with the following custom datasets: EDGAR4 : The prompts are summarizations of the relationship of a company to one of various physical and financial concepts such as commodities, currencies, interest rates, and real estate sectors . It uses EDGAR 10‑K paragraphs from a single security filing for a single year. The input/output sequence length aims to model medium-length requests. EDGAR5 : Questions covering several different aspects of a complete 10‑K filing. The document type is the complete text of a single EDGAR 10‑K filing. The input/output sequence length aims to model long-context requests. These datasets, based on EDGAR filings, model medium and long-context summarization for financial trading and investment advice use cases. The prompts ask the model to perform analysis and summarization of annual reports 10-K filings for thousands of public companies over the past five years. The benchmark also tests two different inference scenarios, batch mode and interactive mode: Batch offline mode : All requests are given at once, and all responses are collected at once. Only throughput is measured. Interactive online mode : Requests arrive at pseudo-random times. The mean arrival rate λ the average number of requests the system receives every second can be set to model different usage scenarios. The benchmark collects metrics such as reaction time RT , total words per second WPS , and Output Rate WPS/user , but does not set any constraint on them. RT is analogous to time to first token TTFT in other benchmarks, and Output Rate to words/second/user. Note that interactive mode does not cover the combination of Llama 3.1 70B Instruct with EDGAR5. The benchmark checks the quality of the output and word count with respect to a control set of LLM-generated responses. While other benchmarks allow all preprocessing, an important differentiator of STAC-AI is the need to apply chat templates and tokenize requests during inference. Real deployments may prefer to have this work done on the server side to protect their system prompts, thus imposing more load on the CPU. Hardware and software stack This post highlights STAC-AI audits run for an on-premises NVIDIA Hopper https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/ -based server submitted by HPE, an on-premises NVIDIA RTX PRO 6000 Blackwell Server Edition https://www.nvidia.com/en-us/data-center/rtx-pro-6000-blackwell-server-edition/ system submitted by Supermicro and Red Hat, and NVIDIA HGX B200 https://www.nvidia.com/en-us/data-center/hgx/ on Lambda https://lambda.ai/1-click-clusters . - The HPE ProLiant Compute DL384 Gen12 https://www.hpe.com/us/en/compute/proliant-dl384-gen12.html , powered by the NVIDIA GH200 Grace Hopper Superchip https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/ , provides an efficient single-server solution. To see detailed results, refer to the STAC report on HPE ProLiant DL384 Gen12 server with two NVIDIA GH200 NVL2 Superchips https://docs.stacresearch.com/HPE250907a . - A cloud-based instance provided by Lambda, based on NVIDIA HGX B200 https://www.nvidia.com/en-us/data-center/hgx/ . The system uses eight NVIDIA Blackwell https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/ B200 GPUs in an HGX platform, connected with NVIDIA NVLink and NVIDIA NVSwitch https://www.nvidia.com/en-us/data-center/nvlink/ for high-speed GPU-to-GPU communication. Each NVIDIA Blackwell B200 GPU includes 180 GB of HBM3e memory and 8 TB/s of memory bandwidth for large-model inference. For detailed benchmark results, see the Llama 3.1 8B and Llama 3.1 70B companion STAC report on Lambda 1-Click Cluster Cloud Instance with NVIDIA B200 SXM6 Blackwell Series GPUs https://docs.stacresearch.com/LMBD260507 . - Another on‑premises option is the Supermicro AS -5126GS-TNRT https://www.supermicro.com/en/products/system/gpu/5u/as-5126gs-tnrt in the two NVIDIA RTX PRO 6000 Blackwell Server Edition configuration, which pairs two Blackwell GPUs in a single server for AI development and deployment. Each RTX PRO 6000 Blackwell GPU includes 96 GB of memory, supplying the node with substantial aggregate GPU memory for larger models, larger batch sizes, or more concurrent jobs within the same system footprint. For details about the results, see the STAC report on Supermicro SuperServer SYS-222C-TN with two NVIDIA RTX PRO 6000 Blackwell Series GPUs https://www.supermicro.com/thought-leadership/STAC-AI-Audited-Report-SMCI260303.pdf . The full stack was deployed on Red Hat OpenShift, demonstrating the containerized Kubernetes platform introduces no measurable overhead for GPU-intensive LLM inference workloads. As the benchmark requires post-training quantization as part of the benchmarking procedure, the models were quantized using NVIDIA TensorRT Model Optimizer https://github.com/NVIDIA/Model-Optimizer . To leverage the most performant kernels available for each deployment, quantization was performed to FP8 on NVIDIA Hopper and to NVFP4 https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/ on NVIDIA Blackwell. To achieve the best performance for both Hopper and Blackwell, TensorRT LLM inference framework was used for efficient model execution. These quantized models were run using TensorRT LLM PyTorch runtime for a familiar, native PyTorch development experience while maintaining peak performance. Benchmarking results on STAC-AI LANG6 Benchmarking results for both batch mode and interactive mode are detailed in this section. Batch mode For batch mode, NVIDIA Blackwell https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/ delivers significant speedups in all scenarios. Table 1 shows the WPS and requests per second RPS achieved. Model | Dataset | 2 NVIDIA GH200 144 GB TensorRT LLM FP8 | NVIDIA HGX B200 TensorRT LLM NVFP4 | 2 NVIDIA RTX PRO 6000 NVFP4 | ||| | WPS | RPS | WPS | RPS | WPS | RPS | || | Llama 3.1 8B | EDGAR4 | 8,237 | 51.5 | 52.823 | 311 | 5,500 | 32.9 | | EDGAR5 | 304 | 0.784 | 2,220 | 5.64 | 138 | 0.345 | | | Llama 3.1 70B | EDGAR4 | 1,071 | 6.77 | 12,040 | 76.2 | 831 | 5.26 | | EDGAR5 | 41.4 | 0.119 | 350 | 1.07 | 13 | 0.04 | Table 1. STAC-AI batch mode results across all model and dataset combinations The full reports with more details across both interactive and batch modes can be found in the reports published by STAC. Single-GPU performance was also derived to account for the different number of GPUs on each system. Although STAC-AI does not measure per-GPU performance, the results shown in Figure 1 illustrate the throughput difference between single GPUs from each of the systems. Interactive mode The balance between token economics dependent on throughput and user experience dependent on interactivity metrics such as RT and WPS/user is a crucial factor in modern LLM inference. Interactive mode showcases the tradeoff across the interactivity-throughput Pareto front by selecting a range of arrival rates. Interactivity is measured by both RT and WPS/user. To facilitate visualization, the inverse of WPS/user, defined as interword latency IWL , or \ \frac{1}{WPS/user}\ , is used. In the graphs we use the 95th percentile of both metrics. As seen in Figure 2, the NVIDIA HGX B200 system achieves a better tradeoff between throughput and both RT and IWL across the board. IWL solid, lower is better and RT dashed, lower is better are plotted versus interactive-mode throughput across model/dataset scenarios. How to benchmark TensorRT LLM with your custom data While the STAC benchmark uses proprietary data and metrics, you can benchmark TensorRT LLM against models tailored to your specific dataset characteristics. This tutorial walks you through quantizing a model, preparing your dataset, and running performance benchmarks—all customized for your use case. Prerequisites: - A Docker image that includes TensorRT LLM TensorRT LLM Release https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release , for example . - An NVIDIA GPU that is large enough to serve your model at the desired quantization level. You can find a support matrix for quantization in TensorRT LLM documentation https://nvidia.github.io/TensorRT-LLM/latest/features/quantization.html hardware-support-matrix . - A Hugging Face account and token, along with access to the gated models of Llama 3.1 8B Instruct https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct or Llama 3.1 70B Instruct https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct . You can set the HF TOKEN environment variable to your token, and all subsequent commands will use this token. Step 1: Launch the container The containers maintained by NVIDIA contain all of the required dependencies pre-installed. Change into an empty directory with enough space for the models and their quantizations. You can start the container on a machine with NVIDIA GPUs with the following command. Make sure you specify your Hugging Face token. docker run-it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \ --gpus=all \ -u $ id -u :$ id -g \ -e USER=$ id -un \ -e HOME=/tmp \ -e TRITON CACHE DIR=/tmp/.triton \ -e TORCHINDUCTOR CACHE DIR=/tmp/.inductor cache \ -e HF HOME=/workspace/model cache \ -e HF TOKEN=<your huggingface token \ --volume "$ pwd ":/workspace \ --workdir /workspace \ nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc2 Step 2: Clone the repositories Model quantization reduces model size and improves inference speed. Use NVIDIA Model Optimizer to quantize Llama 3.1 8B Instruct to NVFP4 format. First, clone the Model Optimizer repository for the quantization example: git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git -b 0.37.0 Step 3: Quantize the model Next, execute the Hugging Face example script with the chosen model and quantization format—in this case, Llama 3.1 8B Instruct using NVFP4 quantization. bash TensorRT-Model-Optimizer/examples/llm ptq/scripts/huggingface example.sh \ --model meta-llama/Llama-3.1-8B-Instruct \ --quant nvfp4 Step 4: Generate synthetic data Use the benchmark utility to generate a synthetic dataset with the token distribution needed for a task. This example creates 30,000 requests with a fixed input sequence length of 2,048, and an output sequence length of 128. Nonzero standard deviations better approximate real traffic, if you have access to that information. trtllm-bench \ --model meta-llama/Llama-3.1-8B-Instruct \ prepare-dataset \ --output dataset 2048 128.json \ token-norm-dist \ --input-mean 2048 \ --output-mean 128 \ --input-stdev 0 \ --output-stdev 0 \ --num-requests 30000 Step 5: Run the benchmark The trt-llm bench command can run the generated requests in an offline fashion, sending all requests at once to TensorRT LLM runtime closely matching STAC-AI’s batch mode . While some options are available in the CLI API, the full LLM API can be accessed through a YAML file https://nvidia.github.io/TensorRT-LLM/commands/trtllm-serve/run-benchmark-with-trtllm-serve.html about-extra-llm-api-options passed with the extra llm api options parameter. For the purposes of this example, enable CUDA Graphs padding. To learn about more options, see the TensorRT LLM API Reference https://nvidia.github.io/TensorRT-LLM/latest/llm-api/reference.html . cat llm options.yml << 'EOF' cuda graph config: enable padding: True EOF Finally, run the benchmark, specifying the model, the dataset, and the options: trtllm-bench \ --model meta-llama/Llama-3.1-8B-Instruct \ --model path /workspace/TensorRT-Model-Optimizer/examples/llm ptq/saved models Llama-3 1-8B-Instruct nvfp4 \ throughput \ --dataset dataset 2048 128.json \ --backend pytorch \ --extra llm api options llm options.yml This will output various metrics such as the request throughput, the tokens/second/GPU, and more. Get started with TensorRT LLM benchmarking NVIDIA HGX B200 on Lambda significantly advanced performance on the STAC-AI LANG6 benchmark for LLM inference in financial services. NVIDIA Blackwell delivered up to 2.8x the performance of previous architectures, achieving both higher throughput and consistently maintaining superior interactivity. The NVIDIA RTX PRO 6000 Blackwell results highlight the flexibility of the Blackwell platform. Running on Red Hat OpenShift, the two-GPU Supermicro system delivered competitive LLM inference performance. This means organizations can right-size their deployment, from a space- and cost-efficient server to a full-scale data center node, maintaining the performance benefits of Blackwell NVFP4 precision. Alongside the new record, NVIDIA Hopper continues to deliver strong, valuable results for LLM inference workloads. Even more than three years after its initial release, Hopper proves highly effective in both batch and interactive inference scenarios, maintaining good performance metrics even at high throughput, and confirming its continued relevance for financial institutions. To set up and run your own performance evaluations, explore the TensorRT LLM Benchmarking Guide https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html .