{"slug": "profiling-in-pytorch-part-1-a-beginner-s-guide-to-torch-profiler", "title": "Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler", "summary": "PyTorch released a beginner's guide to its torch.profiler tool, starting with profiling a simple matrix multiplication and addition operation on an A100 GPU. The guide walks through reading profiler traces and tables to understand the chain from Python calls to CUDA kernels, and examines what changes when using torch.compile. This is the first part of a series aimed at lowering the steep learning curve for profiling neural network code.", "body_md": "Updated • 20\n\n# Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler\n\n[Update on GitHub](https://github.com/huggingface/blog/blob/main/torch-profiler.md)\n\nWhat you cannot profile, you cannot optimize.\n\nWhether you are trying to squeeze more tokens per second out of a Large Language Model (LLM), shave milliseconds off inference, or just understand why your training loop runs slower than the spec sheet promises, the path eventually runs through profiling.\n\nThe catch is that profiling has a **steep** on-ramp. The traces are dense walls of colored rectangles. The events carry intimidating names. Most tutorials assume you can already read them. So even when we *know* we should be profiling, opening a trace can feel like a chore best left for later (or for someone else). This post, and the series it kicks off, is our attempt to lower that on-ramp.\n\nThis is the opening post of **Profiling in PyTorch**, a series where we slowly build the skill of reading profiler traces and use it to drive optimization. The plan:\n\n**Part 1 (this post):** start with the simplest possible operation, a matrix multiplication followed by a bias add, and learn how to read what the profiler hands back.**Part 2:** scale up to`nn.Linear`\n\nand a small MLP, use the traces to motivate optimizations, and peek at the`kernels`\n\nunderneath.**Part 3:** put it all together on Large Language Models with`transformers`\n\n.\n\nWe document the journey from a beginner's point of view. No prerequisites apart from basic PyTorch. Treat this as a leisurely read with some \"Aha!\" moments. The structure of the post is intentionally question-led: we open a trace, ask \"wait, why is *that* happening?\", and chase the answer until something clicks. By the end you should know:\n\n- how to set up\n`torch.profiler`\n\nand what it actually hands back, - how to read the profiler table and the trace (CPU lane, GPU lane, and the suspicious gaps in between),\n- the chain of events from a Python call all the way down to a CUDA kernel,\n- what changes (and, more interestingly, what does\n**not** change) when you slap`torch.compile`\n\non top.\n\nBefore we begin, two definitions that will make everything below read better:\n\n- A GPU\n**kernel** is a program that runs in parallel on many threads of the GPU. - The CPU\n**schedules and launches** these kernels.\n\nYou don't usually have to write GPU kernels yourself; when you use a PyTorch operation, it is automatically translated to one or more kernels that do the job on GPU.\n\nWith those two ideas in your back pocket, let's start asking questions.\n\nHere is the entire script that we use for the post:\n\n[. We recommend opening this script in a separate tab and walk through the code step by step. We use the]`01_matmul_add.py`\n\n`NVIDIA A100-SXM4-80GB`\n\nGPU to run the scripts.\n\n## The matrix multiplication and addition operation\n\nAs correctly [quipped by Dr. Sara Hooker](https://youtu.be/7knwihgj0fU?si=uvzGH-J9bsCHP4Nn&t=2199), just as we are primarily made up of water, Deep Neural Networks are primarily made up of matrix multiplies. As fundamental as they are, it would be a shame to start our profiling journey with anything else.\n\n``` python\ndef fn(x, w, b):\n  return torch.add(torch.matmul(x, w), b)\n```\n\nThe matrix addition along with the matrix multiplication mimics how weights and biases interact in a neuron. This addition (pun intended) will help us understand how it paves the way for compilation\n\n[later in the post].\n\nTo profile, we will be using the `torch.profiler`\n\nmodule. The steps involved are:\n\n- Have the\n[code to profile ready](https://huggingface.co/datasets/ariG23498/profiling-pytorch/blob/main/01_matmul_add.py#L26-L27)(here`def fn`\n\n, which wraps the matrix multiplication and matrix addition) [Annotate](https://huggingface.co/datasets/ariG23498/profiling-pytorch/blob/main/01_matmul_add.py#L32)the algorithm. While this is completely optional, we recommend doing this. The`record_function`\n\nannotates our function as`matmul_add`\n\n, which will be easy to navigate in the traces (as we note later)\n\n``` python\ndef step():\n  with torch.profiler.record_function(\"matmul_add\"):\n    return fn(x, w, b)\n```\n\n- Wrap the code with the\n`torch.profiler.profile`\n\n[context manager](https://huggingface.co/datasets/ariG23498/profiling-pytorch/blob/main/01_matmul_add.py#L53-L62)\n\n```\n  with torch.profiler.profile(\n    activities=[\n        torch.profiler.ProfilerActivity.CPU,  # the cpu activities\n        torch.profiler.ProfilerActivity.CUDA, # the gpu activities\n    ],\n  ) as prof:\n    # it is recommended to run events multiple times to warm up the GPUs\n    for _ in range(5):\n      step()\n      prof.step()\n```\n\n- Export the\n[profile](https://huggingface.co/datasets/ariG23498/profiling-pytorch/blob/main/01_matmul_add.py#L70)\n\n```\n# the profiler table\nprof.key_averages().table(sort_by=\"cuda_time_total\", row_limit=15)\n\n# the profiler trace\nprof.export_chrome_trace(trace_path)\n```\n\nThe profiler exports two distinct artifacts:\n\n- The profiler table: Provides the statistical summary of the algorithm. It answers \"What is taking the most time\". This becomes really helpful to figure out hotspots. A hotspot would be events that take the most amount of time, can be a bottleneck of the pipeline, or an event that is triggered a lot of times.\n- The profiler trace: Provides the temporal execution view. Answers \"When and Why an operation happened\", depicting the activities taking place on the CPU and the GPU. This is helpful when we want to investigate the kernel(s) that were launched, any delays in launching them, any overlap between CPU and GPU activities, etc.\n\nLet's see the two in action with our first execution. ([Here is the entire 01_matmul_add.py script](https://huggingface.co/datasets/ariG23498/profiling-pytorch/blob/main/01_matmul_add.py))\n\nIt is recommended to run this script on a machine with a GPU.\n\n```\nuv run 01_matmul_add.py --size 64\n```\n\nIf you run the above script (on a GPU machine) you will find a folder `traces/01_matmul_add`\n\nwith the two artifacts:\n\n```\n64_bf16_cold_eager.json\n64_bf16_cold_eager.txt\n```\n\nThe `.txt`\n\nfile holds the profiler table. Upon opening the file, as shown in Figure 1, one would be greeted with a big table with the first column consisting of the events that were triggered inside the scope of profile.\n\nThe other columns are related to the time the event takes on the CPU or GPU or any other device(s) specified in `activities`\n\nwithin `torch.profiler.profile`\n\n. Look at which events take the most amount of time, and try to intuitively understand if that event should in fact take that time. It is also important to look at the column \"# of Calls\" which dictates how many times the event was triggered.\n\nWhile we are at it, let's also talk about \"Self CPU/CUDA\" vs \"CPU/CUDA total\". The \"Self\" columns measure time spent only inside the event itself, excluding its children. The \"total\" columns include the event and all of its children together. So if you look at the \"CPU total\" of `matmul_add`\n\n, it consists of the time it took on self plus the children events it triggered. This is an important nuance to note.\n\nIf you look at the last two lines out of the table you would notice that the profiler tells us that\n\n```\nSelf CPU time total: 2.314ms\nSelf CUDA time total: 23.104us\n```\n\nThe CPU time is in `ms`\n\nwhile the GPU time is in `us`\n\n. To put things in perspective, the time spent on GPUs (the kernel `ampere_bf16_s16816gemm...`\n\n) is less than 1% of the time spent on the CPU (the `matmul_add`\n\noperation). The GPU stays idle most of the time, which is an immediate red flag. The reason this happens is that the GPU can compute a small matmul very quickly, so our code spends most of the time preparing the kernels, launching them on the GPU, sending the data to multiply and gathering the results. This concept is known as an *overhead-bound* algorithm.\n\nThe easiest way to move out of this regime is to use bigger matrix multiplications.\n\n```\nuv run 01_matmul_add.py --size 4096\n```\n\nThe last two lines in Figure 2 are:\n\n```\nSelf CPU time total: 4.908ms\nSelf CUDA time total: 4.495ms\n```\n\nBoth times are in ms, which means we have materialized more GPU time just by increasing the size of the matrix multiplications. If you look at Figure 2 you would also notice that the most CUDA time is now taken by the GPU kernel (`ampere_bf16_s16816gemm_..`\n\n) and not by the CPU operation that launched it (`matmul_add`\n\n). This means that we were indeed able to move from overhead bound to compute bound.\n\nWe now move into visualising the dispatch chain, which lives inside the `.json`\n\nartifacts. You can upload them to [Perfetto UI](https://ui.perfetto.dev) and see the traces, or you can use `uvx trace-util traces -b traces`\n\nto generate the Perfetto links directly.\n\n## 64x64 traces\n\nIn Figure 3, we see the profiler trace for the matrix multiplication and addition. Here the bar width indicates the duration of an event, the vertical nesting is the call hierarchy, the CPU lane denotes the events that happen on the CPU, while the GPU lane shows the actual kernel executions. One might also notice the empty spaces which are the waiting or idle time.\n\nThe script was run with default configurations which are:\n\n- size 64: The inputs, weights and biases are sized (64, 64)\n- dtype bf16: The data type is bfloat16\n- no compile: We have not compiled the torch operations\n- no warmup: We have not warmed up the GPU before profiling\n\nWith Perfetto we suggest using the keyboard for quicker access to the trace. One could use \"W A S D\" for navigating the trace.\n\nThere are two lanes in Figure 4, one for the CPU activity and one for the GPU activity. In the CPU lane one would notice three profile steps (starting from `ProfilerStep#2`\n\n). This comes from the `schedule`\n\n.\n\n```\nschedule = torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1)\n```\n\nThe `wait`\n\nskips noisy initializations (`ProfilerStep#0`\n\n), `warmup`\n\nruns through the profiler without recording (`ProfilerStep#1`\n\n), and `active`\n\nis what shows up in trace. One can find the schedule being used in the [script here](https://huggingface.co/datasets/ariG23498/profiling-pytorch/blob/main/01_matmul_add.py#L58).\n\nLet's put on our detective hats and investigate the trace and ask some questions.\n\n### Why does the ProfilerStep#2 take so long?\n\nIn Figure 5, we notice that `ProfileStep#2`\n\ntakes more time compared to the other steps, and upon looking closely you would see a similar pattern with the `matmul_add`\n\nannotation as well. The smoking gun is inside the annotation, not the annotation itself:\n\n| Step | `matmul_add` start |\n`aten::matmul` start |\ngap |\n|---|---|---|---|\n| #2 | 138.736 | 366.493 | 227.757 µs |\n| #3 | 517.926 | 523.447 | 5.521 µs |\n| #4 | 610.039 | 614.527 | 4.488 µs |\n\nThat ~228 µs shown in Figure 6 is the \"dead window\" between entering `record_function(\"matmul_add\")`\n\nand PyTorch actually dispatching `aten::matmul`\n\n. This can happen for multiple reasons, including workspace allocations, [cuBLAS](https://developer.nvidia.com/cublas) (NVIDIA’s proprietary, GPU-accelerated library for performing fundamental linear algebra operations) heuristics, or lazy module loading. We can either look away or run [some more warmup steps](https://huggingface.co/datasets/ariG23498/profiling-pytorch/blob/main/01_matmul_add.py#L35-L39) before we profile (which is the standard)\n\nIn terms of profiling, warmup is when you run the events a couple of times before actually profiling it. The pre-work done by the GPU (including the above pointers) are one time efforts which we do not want to profile. In our example, we have two warmup stages, one where we actually loop over the function before entering the profiler, and two inside the profiler which is achieved by the `warmup`\n\nargument. In this section, we have enabled the actual iterations along with the schedule.\n\n```\nuv run 01_matmul_add.py --warmup\n```\n\n[Perfetto Trace for 64x64 with Warmup](https://ui.perfetto.dev/#!/?url=https://huggingface.co/buckets/ariG23498/traces/resolve/01_matmul_add/64_bf16_warm_eager.json)\n\nIn Figure 7 we see that each profile step takes a similar time, but this does not mean we were able to optimize the one time overheads. We warmed up the runs so that the overheads were not profiled. We think that closing this section abruptly without a hint to solving this would do injustice to the reader, so here is a [link](https://pytorch.org/blog/accelerating-generative-ai-2/) to read about further optimizing launch overheads.\n\n### Why is there an offset of ~2.5 ms between the CPU and GPU lanes?\n\nIn Figure 8, we see that the CPU and GPU lanes have an offset of around 2.5 ms: this is the delay after the CPU submits the CUDA kernels and the time they actually start executing. One might think the warmup stage combined with the schedule's `wait`\n\nand `warmup`\n\nshould keep a GPU busy and would diminish the offset.\n\nTo uncover what is really happening, let's change our schedule a little:\n\n```\n- schedule = torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1)\n+ schedule = torch.profiler.schedule(wait=0, warmup=0, active=3, repeat=1)\n```\n\nFigure 9 shows us that there is an `Activity Buffer Request`\n\nin the GPU lane before any operation. Let's zoom in a little more.\n\nUpon zooming into the GPU trace, we notice that the matmul and add kernels for `ProfileStep#0`\n\n(the CPU trace of which is not visible in the Figure) happen one after the other, while the kernels for `ProfileStep#1`\n\nhave a window in between. The best explanation for this is that there was an overflow of buffers, and another buffer request (a request to allocate some memory on the GPU VRAM) was issued during the kernel execution.\n\nThe best way to rule out other possibilities is to profile for more iterations and see whether a similar window appears in other parts of the trace. To do that we run with `active=20`\n\n.\n\nAs shown in Figure 11, we see a similar trend in `ProfileStep#1`\n\n. This is aligned with our previous findings, and we can safely conclude that it was indeed another buffer request.\n\n### The chain of events\n\nIn Figure 12, we see the nested CPU calls. This is an important visualization, where one gets to understand what a chain of dispatch really looks like.\n\nWe begin with `ProfileStep#<id>`\n\nwhich encapsulates the profiling step. Due to us annotating the step, we see the `matmul_add`\n\nrow. The `matmul_add`\n\nconsists of two `aten`\n\ncalls, one for matrix multiplication and one for matrix addition.\n\nThe `aten::matmul`\n\nis the [ATen-level](https://github.com/pytorch/pytorch/tree/main/aten/src/ATen) dispatch that those user-facing PyTorch matmul calls land on. `aten::mm`\n\nis the 2D matrix-matrix multiply backend.\n\nIt is very interesting to note how PyTorch calls `aten::bmm`\n\n(batched matrix multiplication) if we add the batch axis to our matrices. Let's take a detour and see the `aten::bmm`\n\nin action.\n\n```\n- x = torch.randn(args.size, args.size, device=device, dtype=dtype)\n- w = torch.randn( args.size, args.size, device=device, dtype=dtype)\n- b = torch.randn(args.size, args.size, device=device, dtype=dtype)\n\n+ # adding a batch size of 8\n+ x = torch.randn(8, args.size, args.size, device=device, dtype=dtype)\n+ w = torch.randn(8, args.size, args.size, device=device, dtype=dtype)\n+ b = torch.randn(8, args.size, args.size, device=device, dtype=dtype)\n```\n\nIn Figure 13, upon adding the batch axis to the inputs, `aten::matmul`\n\nnow encapsulates a bunch of other prerequisite CUDA runtime calls along with `aten::bmm`\n\n(instead of `aten::mm`\n\n). This also hints at the heuristics that cuBLAS needs to do in order to dispatch the right (most suitable) kernel for the program.\n\nIn the rest of the post, we will be working with simple 2D matrices, unless otherwise mentioned.\n\n### Why does matmul have an extra CUDA runtime call?\n\nWe notice that for `aten::mm`\n\nthere are two CUDA Runtime calls, namely `cudaOccupancyMaxActiveBlocksPerMultiprocessor`\n\n(boxed in Figure 14) and `cudaLaunchKernel`\n\n, while for `aten::add`\n\nthere is only the `cudaLaunchKernel`\n\n.\n\n`cudaOccupancyMaxActiveBlocksPerMultiprocessor`\n\nis a planning call and is purely CPU side. It asks: \"given a kernel function, a chosen block size, and a chosen dynamic shared memory size, how many blocks of this kernel can simultaneously reside on one SM (Streaming Multiprocessor)?\"\n\nThis begs the question, why do we need planning for matmul and not for add?\n\nTo understand this, we have to look at the kernel's resource footprint. If you click on the GPU kernels, you will be able to inspect the resource footprint for the respective kernel.\n\nIn Figure 15, we note that for matrix multiplication the `registers per thread`\n\nand `shared memory`\n\nare dynamic (based on the size of the matrix). cuBLAS ships hundreds of kernel variants, and each has a heuristic-driven launch path that needs runtime information about hardware capacity. The occupancy query is part of that heuristic. Conceptually, we can think of GPU-accelerated matmuls as [working on independent tiles](https://alvinwan.com/how-to-tile-matrix-multiplication/): how many tiles we use and how big each tile needs to be depends on the matrices and the hardware. Modern algorithms are way more complicated than that, but this is still a good reference framework.\n\nFrom Figure 16 we see that the footprint of addition says 32 registers and zero shared memory. That fits trivially. There's nothing to query, because no hardware resource is going to limit occupancy. The kernel is, by design, resource-light.\n\nYou can use this as a quick diagnostic when reading any trace. Scan the CPU lane for\n\n`cudaOccupancyMaxActiveBlocksPerMultiprocessor`\n\n. Each occurrence flags a \"heavyweight, adaptively launched\" kernel, usually a GEMM (GEneral Matrix Multiplication), conv, or similar. The kernels without a preceding occupancy query are the elementwise/reduction crowd that PyTorch launches mechanically.\n\n### Why is cudaDeviceSynchronize so big (~1.78 ms)?\n\n`cudaDeviceSynchronize`\n\nblocks the CPU until all GPU work on this device finishes. The profiler emits this sync at the end of the active window to flush events. Without it, kernel timings would be missing.\n\nA 1.78 ms sync covering 26 µs of real GPU work tells you this run was 98% idle. That's the textbook overhead-bound symptom.\n\n## 4096x4096 traces\n\nWe already know from the profiler table analysis (above) that providing bigger matrices to our algorithm moves it out from the overhead-bound region to being compute-bound.\n\nLet's run the command and dive deeper into the traces.\n\n```\nuv run 01_matmul_add.py --size 4096 --warmup\n```\n\n### Why does the same kernel take more time compared to others?\n\nIn Figure 17, we notice that the matmul kernel for `ProfileStep#3`\n\ntakes longer on the GPU than the other steps. This is particularly interesting to note, because the other kernels launched were the exact same, which means there were no cuBLAS heuristics involved. There are no scheduling gaps, the CPU launches are normal, and it is not a profiler artifact.\n\nThis trace in Figure 17 makes a useful point that's easy to miss in idealized examples: kernel runtimes are not constants, even on the same hardware environment running identical code on identical data.\n\nLet's make this more concrete by modifying the script a little. We run the iteration 20 times, capturing each of the steps.\n\n```\n- schedule = torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1)\n+ schedule = torch.profiler.schedule(wait=0, warmup=0, active=20, repeat=1)\n\n- for _ in range(5):\n+ for _ in range(20):\n```\n\nFigure 18 reveals a similar finding. While each kernel was the exact same, they time differently. The different compute times can be blamed on a bunch of reasons:\n\n- GPU clocks on idle and boost\n- GPU heating\n- GPU power management\n- Driver side housekeeping\n\nA reader who only saw the average would conclude that a matmul took ~1 ms (mean of 5 = 1084 µs); a reader who looked at the trace would see that the matmul takes ~580 µs except when the GPU throws a fit. Those are very different mental models, and only one of them is correct.\n\n## Let's see some torch compile at work\n\nWorking with `torch.compile`\n\nhas always amazed us. One writes normal eager PyTorch code, but PyTorch tries to capture tensor-heavy regions, turn them into graphs, optimize them, and run generated code. The default backend is usually `TorchInductor`\n\n, and the broad pipeline is:\n\n`TorchDynamo`\n\ncaptures Python execution into an FX graph`AOTAutograd`\n\nprepares forward/backward graphs when gradients are involved`Inductor`\n\nlowers the graph into optimized CPU or GPU code.\n\nIn this section, we talk about compilation and look at the profiler traces.\n\n```\nuv run 01_matmul_add.py --size 4096 --warmup --compile\n```\n\nThe `args.compile`\n\nflag triggers the following code:\n\n``` python\ndef fn(x, w, b):\n  return torch.add(torch.matmul(x, w), b)\n\nfn = torch.compile(fn) if args.compile else fn\n```\n\nIn Figure 19, we see the new CPU rows named `Torch-Compiled Region: 0/0`\n\nwhich points us to the compiled functions being used.\n\n### Did we fuse the matmul and add kernels into one?\n\nLooking at Figure 20 we ask the question, did we actually fuse the multiplication and addition operations together into one?\n\nThis is operator fusion at the graph level. Inductor took our `torch.add(torch.matmul(x, w), b)`\n\nand rewrote it into a single `aten::addmm(b, x, w)`\n\ncall. The important thing to note here is that it did **not** produce a **new** fused CUDA kernel. The actual GPU work is still `ampere_bf16_s16816gemm_bf16_128x256_ldg8_f2f_stages_64x3_nn`\n\n, the same cuBLAS kernel eager mode used. So the \"fusion\" here is at the dispatcher level, not at the kernel level.\n\nPyTorch provides the\n\n[function that does what we did into two steps, that is multiply and add. We encourage the reader to look at the traces of this function and comment your observations in the comments below!]`torch.addmm`\n\n### torch.compile's runtime architecture\n\nWhile we know in theory what happens when we compile our functions it is equally important to see it in action. Let's look at the CPU-side hierarchy which reflects `torch.compile`\n\n's runtime architecture.\n\n**TorchDynamo Cache Lookup** is where Dynamo checks that the current call still matches what was compiled with the same input shapes, dtypes, devices, and tensor metadata. If anything mismatched, Dynamo would recompile. This cost is paid every call, even after compilation.\n\n**Torch-Compiled Region** is the wrapper that \"enters\" the compiled version. **AOTDispatcher Runtime Wrapper Prologue** is AOT Autograd's runtime wrapper. Even though we don't need gradients here, AOTDispatcher is always in the stack handling tensor metadata, view tracking, and would set up the backward pass if `requires_grad`\n\nwere true.\n\n**## Call CompiledFxGraph ** is where the actual generated code runs. The string after \"CompiledFxGraph\" is the content hash of the FX graph. It's the same across all three active steps, confirming cache hits.\n\nYou can find the generated code on disk under\n\n`/tmp/torchinductor_<user>/fxgraph`\n\nkeyed by this hash, useful when you want to read the Triton/C++ that Inductor actually produced.\n\n### Do the CUDA launches go down by half?\n\n| Figure 21: Each compiled step still launches two GPU kernels, a Device-to-Device memcpy and the GEMM |\n\nLooking at the traces in Figure 21, we were really happy to notice only one `cudaLaunchKernel`\n\nper step. This observation was directly contradicting what we were seeing in the GPU trace. There were still two kernels being launched per step, namely the `Memcpy DtoD (Device -> Device)`\n\nand the GEMM. Going back to the CPU trace, we noticed that we had completely missed the `cudaMemcpyAsync`\n\ndispatch.\n\n`addmm`\n\ncomputes `out = α·A·B + β·C`\n\n, and cuBLAS's GEMM-with-bias-add epilogue writes into a destination buffer that needs to already contain the bias. An epilogues can be thought of all the operations that happen *after* a GEMM. In the world of deep-learning we constantly come up with GEMM-Epilogues like activations, bias addition, normalization and many more. This is why there are cuBLAS GEMM-with- kernel variants.\n\nIf you use different\n\n`mode`\n\ns for`torch.compile`\n\nyou would notice different kernel variants being launched. You can try it for yourself and add a comment below about your observations!\n\nSo Inductor's generated code does:\n\n`out = copy(C)`\n\n← that's the DtoD memcpy (32 MB, takes ~33 µs)`out = α·(A·B) + β·out`\n\n← GEMM with`α=β=1`\n\n, fusing the bias add into the writeback\n\nThe result is mathematically still the same. The bias add isn't free, as we pay a memcpy upfront plus a slightly more expensive GEMM epilogue.\n\nThe fusion one might have hoped for, where `x·w + b`\n\n(here `out = α·A·B + β·C`\n\n) collapses into a single kernel with no extra memory traffic, isn't what happened. Inductor preserved the two memory-touching operations, it just relabeled the bias copy as a memcpy and the addition as a GEMM epilogue.\n\nA truly fused implementation would skip the memcpy. That's what FlashAttention-style hand-written kernels do, and what Inductor can do via Triton codegen, but for a `4096×4096 bf16 matmul`\n\n, Inductor evidently decided \"use cuBLAS, do the bias via epilogue setup\" was the best path.\n\n### CPU overhead went up, not down\n\nThis is the easiest thing to miss when comparing an eager and a compiled run:\n\n| step | eager dur (ms) | compile dur (ms) |\n|---|---|---|\n| #2 | 0.1 | 0.2 |\n| #3 | 0.07 | 0.1 |\n| #4 | 0.07 | 0.1 |\n\nCompile is roughly 2× more expensive on the CPU per step. That's because every call walks the full Dynamo > AOTAutograd > Inductor stack, on top of the same `aten::addmm`\n\ndispatch we have anyway. The compile pipeline is built for ML models with dozens of ops where the per-call overhead amortizes (for a single op it's a tax).\n\n`torch.compile`\n\nhas a`mode`\n\nargument. It is for the reader to take home as an assignment to read the documentation and come up with a`mode`\n\nthat could take the CPU overhead down. 🤗\n\n## Trace reading cheatsheet\n\nA quick reference for the patterns we walked through. The idea is: if you see this in a trace, this is what it usually means.\n\n### Profiler table\n\n| What you see | What it usually means |\n|---|---|\n`Self CPU time total` ≫ `Self CUDA time total` (CPU in ms, GPU in µs) |\nOverhead-bound. The CPU spends more time dispatching than the GPU spends computing. Make the work bigger (larger matrices, batched ops) or fuse calls. |\n`Self CPU time total` ≈ `Self CUDA time total` , both in ms |\nCompute-bound. The GPU is the bottleneck, which is usually what you want. |\nOne event dominates `CUDA total` |\nThat's your hotspot. Start the optimization there. |\nOne event has a huge `# of Calls` |\nA potential bottleneck even if each call is cheap. Check whether it can be fused or batched. |\n`CPU total` ≫ `Self CPU` for a row |\nMost of the cost lives in children. Drill into the nested events, not the parent. |\n\n### CPU lane\n\n| What you see | What it usually means |\n|---|---|\nFirst `ProfileStep` much wider than the rest |\nCold-start overhead: workspace allocation, cuBLAS heuristics, lazy module loading. Add warmup iterations and/or the schedule's `warmup` argument. |\nBig gap between `record_function(\"...\")` start and the first `aten::*` inside it |\nSame cold-start tax, just zoomed in. The annotation entered, but the dispatch hadn't happened yet. |\n`cudaOccupancyMaxActiveBlocksPerMultiprocessor` before a `cudaLaunchKernel` |\nA heavyweight, adaptively-launched kernel (GEMM, conv, etc.). cuBLAS is asking the driver how many blocks fit on an SM so it can pick a kernel variant. |\n`cudaLaunchKernel` with no preceding occupancy query |\nAn elementwise or reduction kernel with a fixed, resource-light footprint. Nothing to plan. |\nA long `cudaDeviceSynchronize` at the end of the active window |\nThe profiler flushing events. Its duration is mostly the GPU finishing pending work, not a real CPU cost. A sync covering tiny GPU work is a classic overhead-bound symptom. |\nA `cudaMemcpyAsync` you didn't write |\nOften a hidden Device-to-Device copy. Common when `addmm` seeds its destination buffer with the bias before the GEMM epilogue. |\n\n### GPU lane\n\n| What you see | What it usually means |\n|---|---|\n`Activity Buffer Request` on the GPU lane |\nThe profiler is allocating/refilling its own event buffer. The first one usually accounts for the initial CPU↔GPU lane offset. |\n| A gap between two kernels in a single step | Likely another buffer request mid-execution. Confirm by running more iterations: if it appears only once, it's the profiler, not your code. |\n| The same kernel timing differently across steps | GPU clocks, thermals, power management, driver housekeeping. Read the trace, not just the mean. |\nA kernel named like `ampere_bf16_s16816gemm_...` |\nThe actual cuBLAS GPU work for a matmul. The kernel name is typically the same in eager and compiled mode for the same shapes/dtypes. |\n`Memcpy DtoD` immediately before a GEMM |\nThe bias copy for an `addmm` epilogue. The \"fusion\" is at the dispatcher level, not in the kernel. |\n\n### Dispatch chain\n\n| What you see | What it usually means |\n|---|---|\n`ProfileStep#N` → `<record_function name>` → `aten::*` → `aten::mm` / `aten::bmm` / `aten::add` |\nThe canonical nested call hierarchy. Self time excludes children; Total time includes them. |\n`aten::matmul` resolving to `aten::mm` |\n2D × 2D matrix multiply. |\n`aten::matmul` resolving to `aten::bmm` (with extra CUDA runtime calls) |\nBatched matmul on 3D+ tensors. cuBLAS does more heuristic work to pick the variant. |\n`aten::addmm(b, x, w)` instead of a separate `aten::add` + `aten::mm` pair |\nOperator fusion at the dispatcher level. The GPU kernel is still the same GEMM, with the bias add folded into the epilogue. |\n\n### torch.compile\n\n| What you see | What it usually means |\n|---|---|\nA `Torch-Compiled Region: K/M` row in the CPU lane |\nYou're inside a compiled function. |\n`TorchDynamo Cache Lookup` on every step |\nDynamo is verifying shapes/dtypes/devices match the cached compile. Paid on every call, even after compilation. |\n`AOTDispatcher Runtime Wrapper Prologue` even with no grads |\nAOTAutograd's runtime wrapper is always in the stack, handling tensor metadata and view tracking. |\n`## Call CompiledFxGraph <hash>` with the same hash across steps |\nCache hits on the generated code. The generated source lives under `/tmp/torchinductor_<user>/fxgraph/<hash>` . |\nPer-step CPU time higher under `torch.compile` than eager for a tiny op |\nExpected. The Dynamo → AOTAutograd → Inductor stack is a tax that only amortizes over many ops. |\n\n## Conclusion\n\nWe started with a tiny `matmul + add`\n\nand used it as an excuse to learn how to read a PyTorch profiler. Along the way we picked up a few mental models that travel well to bigger workloads. This was the first stop in the **Profiling PyTorch** series. In the posts that follow, we will gradually leave this two-op toy behind and walk up the ladder of complexity, looking at larger building blocks and, eventually, real models.\n\nThanks to [Noe Flandre](https://huggingface.co/NoeFlandre), [Suvaditya Mukherjee](https://huggingface.co/suvadityamuk), and [Vidit Ostwal](https://huggingface.co/ViditOstwal) for their reviews on the early draft of the post!", "url": "https://wpnews.pro/news/profiling-in-pytorch-part-1-a-beginner-s-guide-to-torch-profiler", "canonical_source": "https://huggingface.co/blog/torch-profiler", "published_at": "2026-05-29 00:00:00+00:00", "updated_at": "2026-05-29 10:42:07.126508+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "neural-networks", "artificial-intelligence", "ai-tools"], "entities": ["PyTorch", "GitHub", "Hugging Face", "Large Language Model"], "alternates": {"html": "https://wpnews.pro/news/profiling-in-pytorch-part-1-a-beginner-s-guide-to-torch-profiler", "markdown": "https://wpnews.pro/news/profiling-in-pytorch-part-1-a-beginner-s-guide-to-torch-profiler.md", "text": "https://wpnews.pro/news/profiling-in-pytorch-part-1-a-beginner-s-guide-to-torch-profiler.txt", "jsonld": "https://wpnews.pro/news/profiling-in-pytorch-part-1-a-beginner-s-guide-to-torch-profiler.jsonld"}}