The Microsecond Lie: Why your Go timers are lying about the GPU

The article explains that CPU-side timers in Go are unreliable for measuring GPU kernel execution time because CUDA kernel launches are asynchronous, meaning the CPU only measures the time to enqueue the task rather than the actual GPU computation. By implementing CUDA Events—hardware markers placed directly into the GPU stream—the author discovered that the true GPU compute time was 2.7 times slower than the CPU timer suggested. The piece emphasizes the importance of using hardware events for accurate measurement in Go-based AI infrastructure to avoid misleading performance metrics and optimize real system latency.

TL;DR: I thought my CUDA kernel was running in 160 microseconds. I was wrong. Here is how I used CUDA Events in pure Go to find the real hardware time, and why CPU-side timers are the wrong tool for GPU forensics. I wrapped my kernel launch in a standard Go time.Since start block and saw 162 microseconds. I thought I had built a speed demon. Then I implemented real GPU Events and found the truth. When you launch a CUDA kernel, it is completely asynchronous. The CPU doesn't wait for the GPU to finish; it just puts the task in a queue a Stream and returns control to your Go program immediately. My 162-microsecond measurement wasn't measuring the math. It was only measuring how long it took the Go runtime to talk to the NVIDIA driver and enqueue the job. The GPU hadn't even finished the first row of the matrix before my timer stopped. To find the real numbers, I had to implement CUDA Events. These are markers you place directly into the hardware stream. The GPU itself records a timestamp when it reaches the marker, bypassing the CPU clock entirely. I ran a 10M element vector addition on an RTX 4070 Ti. Here is what the hardware actually said: The hardware compute time was 2.7x slower than what my CPU timers led me to believe. Measuring this accurately required adding NewEvent , Record , and ElapsedTime to the gocudrv package. Since we aren't using cgo, I had to bind the cuEventElapsedTime symbols manually and handle the C-to-Go float32 conversion. Here is what the "truth-telling" code looks like now: // 1. Create the hardware stopwatches start, := ctx.NewEvent stop, := ctx.NewEvent // 2. Place markers in the stream start.Record stream fn.LaunchOn ctx, stream, cfg, args... stop.Record stream // 3. Wait for the STOP marker to be reached stop.Synchronize ctx // 4. Get the hardware duration duration, := start.Elapsed stop fmt.Printf "Actual GPU time: %v\n", duration As we move toward Go-based AI infrastructure, we have to be careful about "Measurement Drift." If you are building an inference gateway or a real-time image processor in Go, using CPU timers will make your P99s look incredible on paper while your users experience mysterious latency. You can't optimize what you can't measure. If you aren't using hardware events, you are just measuring the speed of your request queue, not the speed of your product. Now that I have a microsecond-accurate stopwatch, I can finally start optimizing the data path. I'm currently working on CUDA Graphs to reduce that 160µs enqueuing overhead by bundling complex task topologies into a single hardware command. If you're interested in the forensics of low-level Go or want to help build the cgo-free bridge, check out the progress on GitHub.