Score B (65)

Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

23 天前13 viewsSource: HuggingFace Blog

Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

Published May 29, 2026
Update on GitHub

Thumbnail of the blog post

What you cannot profile, you cannot optimize.

Whether you are trying to squeeze more tokens per second out of a Large Language Model (LLM), shave milliseconds off inference, or just understand why your training loop runs slower than the spec sheet promises, the path eventually runs through profiling.

The catch is that profiling has a steep on-ramp. The traces are dense walls of colored rectangles. The events carry intimidating names. Most tutorials assume you can already read them. So even when we know we should be profiling, opening a trace can feel like a chore best left for later (or for someone else). This post, and the series it kicks off, is our attempt to lower that on-ramp.

This is the opening post of Profiling in PyTorch, a series where we slowly build the skill of reading profiler traces and use it to drive optimization. The plan:

  1. Part 1 (this post): start with the simplest possible operation, a matrix multiplication followed by a bias add, and learn how to read what the profiler hands back.
  2. Part 2: scale up to nn.Linear and a small MLP, use the traces to motivate optimizations, and peek at the kernels underneath.
  3. Part 3: put it all together on Large Language Models with transformers.

We document the journey from a beginner's point of view. No prerequisites apart from basic PyTorch. Treat this as a leisurely read with some "Aha!" moments. The structure of the post is intentionally question-led: we open a trace, ask "wait, why is that happening?", and chase the answer until something clicks. By the end you should know:

  • how to set up torch.profiler and what it actually hands back,
  • how to read the profiler table and the trace (CPU lane, GPU lane, and the suspicious gaps in between),
  • the chain of events from a Python call all the way down to a CUDA kernel,
  • what changes (and, more interestingly, what does not change) when you slap torch.compile on top.

Before we begin, two definitions that will make everything below read better:

  1. A GPU kernel is a program that runs in parallel on many threads of the GPU.
  2. The CPU schedules and launches these kernels.

You don't usually have to write GPU kernels yourself; when you use a PyTorch operation, it is automatically translated to one or more kernels that do the job on GPU.

With those two ideas in your back pocket, let's start asking questions.

Here is the entire script that we use for the post: 01_matmul_add.py. We recommend opening this script in a separate tab and walk through the code step by step. We use the NVIDIA A100-SXM4-80GB GPU to run the scripts.

The matrix multiplication and addition operation

As correctly quipped by Dr. Sara Hooker, just as we are primarily made up of water, Deep Neural Networks are primarily made up of matrix multiplies. As fundamental as they are, it would be a shame to start our profiling journey with anything else.

def fn(x, w, b):
  return torch.add(torch.matmul(x, w), b)

The matrix addition along with the matrix multiplication mimics how weights and biases interact in a neuron. This addition (pun intended) will help us understand how it paves the way for compilation later in the post.

To profile, we will be using the torch.profiler module. The steps involved are:

  1. Have the code to profile ready (here def fn, which wraps the matrix multiplication and matrix addition)
  2. Annotate the algorithm. While this is completely optional, we recommend doing this. The record_function annotates our function as matmul_add, which will be easy to navigate in the traces (as we note later)
def step():
  with torch.profiler.record_function("matmul_add"):
    return fn(x, w, b)
  1. Wrap the code with the torch.profiler.profile context manager
  with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,  
        torch.profiler.ProfilerActivity.CUDA, 
    ],
  ) as prof:
    
    for _ in range(5):
      step()
      prof.step()
  1. Export the profile

prof.key_averages().table(sort_by="cuda_time_total", row_limit=15)


prof.export_chrome_trace(trace_path)

The profiler exports two distinct artifacts:

  1. The profiler table: Provides the statistical summary of the algorithm. It answers "What is taking the most time". This becomes really helpful to figure out hotspots. A hotspot would be events that take the most amount of time, can be a bottleneck of the pipeline, or an event that is triggered a lot of times.
  2. The profiler trace: Provides the temporal execution view. Answers "When and Why an operation happened", depicting the activities taking place on the CPU and the GPU. This is helpful when we want to investigate the kernel(s) that were launched, any delays in launching them, any overlap between CPU and GPU activities, etc.

Let's see the two in action with our first execution. (Here is the entire 01_matmul_add.py script)

It is recommended to run this script on a machine with a GPU.

uv run 01_matmul_add.py --size 64

If you run the above script (on a GPU machine) you will find a folder traces/01_matmul_add with the two artifacts:

64_bf16_cold_eager.json
64_bf16_cold_eager.txt
Profiler table for matmul add on 64 sized matrices
Figure 1: Profiler table for matmul add on 64 sized matrices

The .txt file holds the profiler table. Upon opening the file, as shown in Figure 1, one would be greeted with a big table with the first column consisting of the events that were triggered inside the scope of profile.

The other columns are related to the time the event takes on the CPU or GPU or any other device(s) specified in activities within torch.profiler.profile. Look at which events take the most amount of time, and try to intuitively understand if that event should in fact take that time. It is also important to look at the column "# of Calls" which dictates how many times the event was triggered.

While we are at it, let's also talk about "Self CPU/CUDA" vs "CPU/CUDA total". The "Self" columns measure time spent only inside the event itself, excluding its children. The "total" columns include the event and all of its children together. So if you look at the "CPU total" of matmul_add, it consists of the time it took on self plus the children events it triggered. This is an important nuance to note.

If you look at the last two lines out of the table you would notice that the profiler tells us that

Self CPU time total: 2.314ms
Self CUDA time total: 23.104us

The CPU time is in ms while the GPU time is in us. To put things in perspective, the time spent on GPUs (the kernel ampere_bf16_s16816gemm...) is less than 1% of the time spent on the CPU (the matmul_add operation). The GPU stays idle most of the time, which is an immediate red flag. The reason this happens is that the GPU can compute a small matmul very quickly, so our code spends most of the time preparing the kernels, launching them on the GPU, sending the data to multiply and gathering the results. This concept is known as an overhead-bound algorithm.

The easiest way to move out of this regime is to use bigger matrix multiplications.

uv run 01_matmul_add.py --size 4096 
Profiler table for matmul add algorithm on 4096 sized matrices
Figure 2: Profiler table for matmul add on 4096 sized matrices

The last two lines in Figure 2 are:

Self CPU time total: 4.908ms
Self CUDA time total: 4.495ms

Both times are in ms, which means we have materialized more GPU time just by increasing the size of the matrix multiplications. If you look at Figure 2 you would also notice that the most CUDA time is now taken by the GPU kernel (ampere_bf16_s16816gemm_..) and not by the CPU operation that launched it (matmul_add). This means that we were indeed able to move from overhead bound to compute bound.

We now move into visualising the dispatch chain, which lives inside the .json artifacts. You can upload them to Perfetto UI and see the traces, or you can use uvx trace-util traces -b traces to generate the Perfetto links directly.

64x64 traces

PyTorch profiler trace of a 64×64 bf16 matmul followed by an add on a CUDA GPU
Figure 3: Profiler trace for matmul and add on 64 sized matrices

In Figure 3, we see the profiler trace for the matrix multiplication and addition. Here the bar width indicates the duration of an event, the vertical nesting is the call hierarchy, the CPU lane denotes the events that happen on the CPU, while the GPU lane shows the actual kernel executions. One might also notice the empty spaces which are the waiting or idle time.

The script was run with default configurations which are:

  • size 64: The inputs, weights and biases are sized (64, 64)
  • dtype bf16: The data type is bfloat16
  • no compile: We have not compiled the torch operations
  • no warmup: We have not warmed up the GPU before profiling

With Perfetto we suggest using the keyboard for quicker access to the trace. One could use "W A S D" for navigating the trace.

PyTorch profiler trace with the CPU lane and GPU lane labelled side by side in Perfetto
Figure 4: The CPU and GPU lanes of a PyTorch profiler trace

There are two lanes in Figure 4, one for the CPU activity and one for the GPU activity. In the CPU lane one would notice three profile steps (starting from ProfilerStep#2). This comes from the schedule.

schedule = torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1)

The wait skips noisy initializations (ProfilerStep#0), warmup runs through the profiler without recording (ProfilerStep#1), and active is what shows up in trace. One can find the schedule being used in the script here.

Let's put on our detective hats and investigate the trace and ask some questions.

Why does the ProfilerStep#2 take so long?

ProfileStep#2 in a PyTorch profiler trace appears wider than ProfileStep#3 and ProfileStep#4
Figure 5: ProfileStep#2 is visibly wider than the steps that follow it

In Figure 5, we notice that ProfileStep#2 takes more time compared to the other steps, and upon looking closely you would see a similar pattern with the matmul_add annotation as well. The smoking gun is inside the annotation, not the annotation itself:

Step matmul_add start aten::matmul start gap
#2 138.736 366.493 227.757 µs
#3 517.926 523.447 5.521 µs
#4 610.039 614.527 4.488 µs
228 microsecond gap between record_function matmul_add and the aten::matmul dispatch in profile step 2
Figure 6: The ~228 µs dead window between record_function("matmul_add") and aten::matmul

That ~228 µs shown in Figure 6 is the "dead window" between entering record_function("matmul_add") and PyTorch actually dispatching aten::matmul. This can happen for multiple reasons, including workspace allocations, cuBLAS (NVIDIA’s proprietary, GPU-accelerated library for performing fundamental linear algebra operations) heuristics, or lazy module loading. We can either look away or run some more warmup steps before we profile (which is the standard)

In terms of profiling, warmup is when you run the events a couple of times before actually profiling it. The pre-work done by the GPU (including the above pointers) are one time efforts which we do not want to profile. In our example, we have two warmup stages, one where we actually loop over the function before entering the profiler, and two inside the profiler which is achieved by the warmup argument. In this section, we have enabled the actual iterations along with the schedule.

uv run 01_matmul_add.py --warmup

Perfetto Trace for 64x64 with Warmup

PyTorch profiler trace after warmup steps where ProfileStep#2 no longer shows cold-start overhead
Figure 7: After warming up, every profile step takes a similar amount of time

In Figure 7 we see that each profile step takes a similar time, but this does not mean we were able to optimize the one time overheads. We warmed up the runs so that the overheads were not profiled. We think that closing this section abruptly without a hint to solving this would do injustice to the reader, so here is a link to read about further optimizing launch overheads.

Why is there an offset of ~2.5 ms between the CPU and GPU lanes?

2.32 millisecond offset between the CPU lane and the GPU lane in a PyTorch profiler trace
Figure 8: The ~2.5 ms offset between the CPU and GPU lanes

In Figure 8, we see that the CPU and GPU lanes have an offset of around 2.5 ms: this is the delay after the CPU submits the CUDA kernels and the time they actually start executing. One might think the warmup stage combined with the schedule's wait and warmup should keep a GPU busy and would diminish the offset.

To uncover what is really happening, let's change our schedule a little:

- schedule = torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1)
+ schedule = torch.profiler.schedule(wait=0, warmup=0, active=3, repeat=1)
PyTorch profiler trace with wait=0 warmup=0 showing an Activity Buffer Request between steps
Figure 9: With wait=0 and warmup=0, the trace reveals an Activity Buffer Request

Figure 9 shows us that there is an Activity Buffer Request in the GPU lane before any operation. Let's zoom in a little more.

gap between matmul and add CUDA kernels caused by profiler buffer request
Figure 10: A gap appears between the matmul and add kernels on profile step 1

Upon zooming into the GPU trace, we notice that the matmul and add kernels for ProfileStep#0 (the CPU trace of which is not visible in the Figure) happen one after the other, while the kernels for ProfileStep#1 have a window in between. The best explanation for this is that there was an overflow of buffers, and another buffer request (a request to allocate some memory on the GPU VRAM) was issued during the kernel execution.

The best way to rule out other possibilities is to profile for more iterations and see whether a similar window appears in other parts of the trace. To do that we run with active=20.

PyTorch profiler trace of 20 active iterations confirming the buffer-request gap only appears once
Figure 11: With 20 active steps the gap only shows up once, confirming it is a buffer request

As shown in Figure 11, we see a similar trend in ProfileStep#1. This is aligned with our previous findings, and we can safely conclude that it was indeed another buffer request.

The chain of events

Read the full original article:

HuggingFace Blog
#性能分析#入门指南