Deep Learning Inference: PyTorch, ONNX, and TensorRT Explained

A developer built a custom Inference Optimization Engine on an NVIDIA RTX 4050 GPU to analyze how PyTorch, ONNX, and TensorRT interact with hardware, revealing that model deployment and optimization constitute 80% of the work. The project benchmarks latency and throughput using metrics like P99 to identify bottlenecks, emphasizing that raw model accuracy is insufficient for real-time applications.

If you are learning Machine Learning, you have probably lived this exact scenario: You spend hours cleaning a dataset, you build a PyTorch model, you run model.fit , and after watching the progress bar for hours, your model finally reaches 95% accuracy. But if you take that 95% accurate model and deploy it in a real-time web app, a trading algorithm, or a self-driving car, it might take five seconds just to process a single image. The user gets frustrated, the trade is missed, the car crashes. WHY? Because training the model is only 20% of the work. The other 80% is Deployment and Optimization . Recently, I built a custom Inference Optimization Engine on an NVIDIA RTX 4050 GPU to understand exactly how PyTorch , ONNX and TensorRT interact with hardware. We are going to cover everything from basics to system-level optimizations, memory bandwidth, and the reality of dependency. What is inference? Training: Computationally intensive process, model iteratively processes the data in epochs, calculating errors and updating weights and biases through back propagation to minimize those errors. Training requires significant compute resources GPUs/TPUs , large amounts of memory, and extended processing time. Model Inference: The execution phase where new, unseen live data is passed through the trained model to generate an immediate prediction, classification, or output. The model’s parameters are completely “frozen”. No parameter updates, back propagation, or learning occur. The system applies the pre-calculated mathematical operations to the incoming data. System architecture is optimized for low latency fast response times and high throughput processing large volumes of requests efficiently . What is Optimization? Deep learning models are massive mathematical calculators. Optimization is the engineering process of making the math run faster, use less memory, and consume less memory in cloud server fees, all without sacrificing the accuracy you achieved during training. What is Benchmarking and why do we do it? Benchmarking is the act of measuring your model’s performance under controlled conditions to find bottlenecks . Is GPU too slow? Is the CPU struggling to feed data? Is motherboard limiting transfer speeds? You can not fix what you have not measured. When profiling a deep learning model for production deployment, performance is determined by structural metrics rather than framework-level abstractions. Evaluating a model requires analyzing the distribution of execution times under specific hardware constraints. Latency Execution Time : The exact amount of time it takes for one single piece of data such as an image or an audio clip to go through model and get an answer. It is measured in Milliseconds ms . Throughput Volume and Capacity : The total amount of data the system can process in a given time frame. It is measured in Samples per second or frame per second . It depends heavily on Batch Size how many items you can process at once . If you process one image at a time, the model starts instantly low latency . If you wait to group 32 images together, the model takes a little longer to finish high latency . However, doing in batch allows the GPU to work much more efficiently. The Averages: Why we use P99 latency? - Averages hide rare but massive delays caused by system hiccups like computer pausing to move memory around, or the GPU getting momentarily stuck . But we have to look at the “percentiles”: Why this matters: If your model has an average latency of 20 ms, it looks great but what if 1 out of every 100 requests gets stuck and takes 800ms? The average completely hides this flaw. P99 forces you to look at that 800 ms delay. When you build a model in PyTorch, it starts out extremely “heavy”. Let’s understand: FP32 : When PyTorch trains a model, it uses 32-bit floating-point numbers FP32 . This means every single number inside your model has a massive, highly detailed decimal place like 3.14159265….. . We use FP32 during training because the model must make microscopic adjustments to its math. It is like editing an uncompressed 4K movie, you need every single pixel of detail so you do not ruin the footage. FP16: Once model is fully trained, it does not need that microscopic level of detail just to make a final prediction. We can chop those long decimals in half, converting them to 16-bit numbers. This is like compressing massive 4K video file into 1080p stream. Tensor Cores: In modern NVIDIA RTX graphics card, chopping data down to FP16 unlocks a hidden power - With single line of code - model.half we can put model on faster lane. To understand why PyTorch models run slowly, we have to look at how Python communicates with graphic card. It comes down to a problem called Framework Overhead . Standard PyTorch Eager Mode : When you build a standard PyTorch model, it runs step-by-step in a mode called “Eager Mode”. Python reads the first layer of neural network, then it sends that specific math problem to GPU and waits, GPU finishes the math and tells Python it is done. Python reads the second layer, sends it to the GPU and waits again. The math happening inside GPU is fast. However, the constant pausing to communicate back and forth between Python and GPU slows everything down. This is called “Framework Overhead.” Compiled Mode torch.compile : When you use torch.compile model , PyTorch stops acting step-by-step. Instead, it looks ahead at your entire neural network before starts running. If it sees three math operations that always happens right after each other, it melts them together into a single, massive GPU instruction. This is called When I bench-marked standard PyTorch Eager Mode on my RTX 4050, the step-by-step communication delays held it back to about 938 images per second . Simply adding that one torch.compile line fused the operations together and rocketed the speed to over 1,215 images per second . PyTorch is built on Python, which is an amazing but very “heavy” language. To run a PyTorch model, you need to install a massive stack of underlying libraries. If you are running code on a massive cloud server, this is fine. But if you want to deploy AI onto a mobile app, or hyper fast c++ server? You cannot fit a 2GB Python environment onto a smart doorbell. To solve this, ONNX Open Neural Network Exchange is used. It is an open standard format for machine learning models. With single line of code, you can export PyTorch model into an .onnx file. When you export a model, ONNX extracts exactly two things: 1. The raw mathematical flowchart Architecture . , 2. The trained numbers Weights . What is left is a clean, static, universal blueprint. Because it no longer depends on Python, you can hand this file to a C++ program, a rust server, or even a Javascript browser, and they will all know how to read it and execute the math. You have officially freed your model from its training environment. Model Topology | Execution Engine / Framework | Hardware Target | Mean Batch Latency ms | Throughput Samples/Sec || :--- | :--- | :--- | :--- | :--- || SmallCNN | PyTorch Eager Mode | CUDA RTX 4050 | ~14.40 | ~2,222.22 || SmallCNN | ONNX Runtime | CUDA RTX 4050 | 14.55 | 2,197.97 || SmallCNN | ONNX Runtime | CPU | 14.68 | 2,179.07 || ResNet18 | PyTorch Compiled max-autotune | CUDA RTX 4050 | ~26.33 | ~1,215.34 || ResNet18 | PyTorch Eager Mode | CUDA RTX 4050 | ~34.10 | ~938.41 || ResNet18 | ONNX Runtime | CUDA RTX 4050 | 301.79 | 106.03 || ResNet18 | ONNX Runtime | CPU Fallback | 343.79 | 93.08 || MobileNetV2 | PyTorch Eager Mode | CUDA RTX 4050 | ~42.50 | ~752.94 || MobileNetV2 | ONNX Runtime | CUDA RTX 4050 | 95.56 | 334.85 || MobileNetV2 | ONNX Runtime | CPU | 96.62 | 331.17 We now have our universal file, to run it we use an engine called ONNX Runtime, which is an industry standard written in hyper-performance C++. Here is what happened when I tested my ResNet18 model on RTX 4050: Why did the highly optimized C++ engine run 10 times slower? This is the moment I discovered the trap in ML deployment: The PCIe Bottleneck. These results were specific to my benchmark setup and likely indicate data-transfer overheads or a non-optimal ONNX Runtime configuration rather than an inherent limitation of ONNX Runtime itself. When you feed an image to basic Python script, that image data lives in your computer’s standard memory the CPU RAM . But the math has to be done on the graphics card the GPU VRAM . To process the image, the computer has to copy the data, push it across the physical wires connecting your motherboard to your GPU, let the GPU do the math, and then send answer all the way back across the same path. The GPU was doing the math instantly, but the data was getting stuck in a massive physical traffic jam on motherboard. For my final test, I tried to use NVIDIA’s ultimate weapon: TensorRT . TensorRT doesn’t just run the model; it scans your exact physical graphics card and rebuilds the math specifically for your exact silicon chip. But when I ran the code, my terminal spat out a massive wall of red text: Error: libcudnn.so.8 cannot open shared object file Instead of blazing-fast speeds, my system crashed. This is the infamous “Dependency Hell.” The TensorRT engine was looking for a very specific, low-level Linux file libcudnn.so.8 to run the math, but my computer's internal folder paths were misaligned. The wrench was missing from the toolbox. But this “failure” taught me valuable lesson: Training a neural network is only the beginning of the machine learning lifecycle. While model architecture, loss functions, and training strategies receive most of the attention, real-world AI systems are ultimately constrained by hardware, memory movement, software dependencies, and deployment environments. Building this inference benchmarking suite taught me that performance optimization extends far beyond achieving high accuracy. Concepts such as latency, throughput, FP16 precision, Tensor Cores, PCIe data transfer, and runtime selection can have a significant impact on how a model behaves in production. In several cases, the bottleneck was not the neural network itself, but the surrounding system responsible for moving data and executing the model efficiently. One of the most valuable lessons came from failure rather than success. While experimenting with TensorRT, a missing cuDNN dependency prevented the engine from running. Debugging that issue highlighted an often-overlooked reality of machine learning engineering: production systems depend on an entire software ecosystem, not just a trained model. Reliable deployment requires reproducible environments, careful dependency management, and robust fallback mechanisms. Perhaps the biggest takeaway is that optimization should be driven by measurement, not assumptions. Benchmarking revealed where time was actually being spent, exposed hidden bottlenecks, and provided a clear path for improvement. Understanding these system-level behaviors is what transforms a machine learning practitioner into an engineering-minded developer capable of building scalable and efficient AI applications. As I continue exploring inference optimization, I plan to investigate advanced TensorRT workflows, I/O Binding strategies, quantization techniques, and containerized deployment pipelines. If you have only worked on model training so far, I highly recommend exporting a model to ONNX, profiling it on your own hardware, and analyzing where the time really goes. The insights gained from that process are often far more valuable than another percentage point of model accuracy. The complete benchmarking framework, source code, and experiments are available on GitHub for anyone interested in reproducing the results or extending the work further. Deep Learning Inference: PyTorch, ONNX, and TensorRT Explained https://pub.towardsai.net/deep-learning-inference-pytorch-onnx-and-tensorrt-explained-0850d6c794f1 was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.