Optimizing a CUDA FSST decompression kernel Polar Signals storage engineer wrote a CUDA kernel to decompress FSST-encoded strings on GPUs, but initial performance was worse than CPU decompression. Using the company's GPU profiler with PC sampling, they identified warp stalls and memory throttles as bottlenecks. The work is part of Vortex file format development to enable high-throughput GPU decompression and fused query execution. My Journey Optimizing a CUDA Kernel with Polar Signals A storage engineer's descent into warp stalls, memory throttles, and aligned stores For Polar Signals’ latest hackathon, I decided to write a CUDA kernel and use our GPU profiler https://www.polarsignals.com/blog/posts/2026/06/10/nvidia-cuda-pc-sampling to make it as fast as I could. I mostly work on our storage layer and have had? zero GPU programming experience, so this was a great opportunity to learn more about CUDA programming and GPU performance in general. In this blog post, I’ll walk through how I built a string decompression CUDA kernel and how I thought about performance optimizations at each step. The Plan First up was deciding what to build. Following tutorials can be educational but I think the best way to learn is to implement something you actually need in production. Thankfully one of the goals of Vortex, the new file format we’ve been using at Polar Signals https://www.polarsignals.com/blog/posts/2025/11/25/interface-parquet-vortex , is to ship compressed data to GPUs for high-throughput decompression and fused query execution. These GPU kernels are an active area of development, and there was a need for a kernel to decompress FSST-encoded strings. As a bonus, a lot of the kernel launching scaffolding was already in place, so I could focus specifically on just the kernel. What is FSST FSST https://https://www.vldb.org/pvldb/vol13/p2649-boncz.pdf is a string compression algorithm that learns up to 255 frequently occurring short symbol sequences up to 8 bytes , and replaces each with a 1-byte code. The 256th code is a special “escape” code that denotes that the next byte in the compressed sequence should be interpreted literally. FSST compression is particularly effective for strings that differ slightly but share common patterns e.g. URLs, log lines, JSON . Importantly, FSST allows random-access decompression and as an extension, enables strings to be decompressed in parallel since the context required to decompress any string is the offset, length, and the static shared symbol table. Vortex already has an FSST encoding and implements compression and decompression on the CPU, so I set up a benchmark that decompresses a corpus of 10 million ClickBench-style URLs to use as a reference: All of these benchmarks were run on an NVIDIA DGX Spark. At 20 CPU cores, this number lines up with the decompression throughput of 1-3GB/s per core cited in the paper. Writing a kernel is easy, optimizing it is hard On to CUDA. Vortex uses the cudarc https://docs.rs/cudarc/latest/cudarc/ crate with a nice Rust API so it’s easy to copy over the buffers the CUDA kernel should work on. The initial implementation of the kernel shipped over each buffer from the pre-existing encoding from the CPU to the GPU and has each GPU thread decompress a string at a time. Omitting the boilerplate index computation to determine which strings a thread will decompress, the meat of the work is pretty straightforward: Pretty simple stuff. Writing the kernel had been easier than I expected. I was feeling good until I ran the benchmark: I didn’t expect to hit the 270GB/s theoretical DGX Spark memory bandwidth throughput on the first try, but I also didn’t expect the kernel to perform worse than on-CPU decompression. Profiling CUDA Kernels This seemed like a good time to dogfood our GPU profiler with PC sampling enabled https://www.polarsignals.com/blog/posts/2026/06/10/nvidia-cuda-pc-sampling to see if I could find anything obvious: Note that all profiles in this post are inverted in order to group by stall reason. The profiles produced by PC sampling are very similar to off-CPU profiles: every sample records why progress isn’t happening. The SASS instructions have an associated stall reason that describes why the warp group of 32 threads was stalled. The not issued suffix isolates scheduling cycles where the warp scheduler could not issue an instruction at all. With the help of our MCP server I highly recommend using it if you’re not familiar with the code you’re profiling , I managed to decipher the two main stall reasons: long scoreboard where we’re basically waiting for a memory load before overwhelmingly an ISETP instruction i nteger set p redicate; predicate evaluation on that data dependency and lg throttle which implies we’re issuing way too many l ocal or g lobal memory load/store instructions and are stalled waiting for the instruction queue to drain. This profile illustrates one of the most important lessons I took away about GPU performance engineering: it’s all about memory. I found this talk https://www.youtube.com/watch?v=3l10o0DYJXg by Stephen Jones NVIDIA System Architect to be a wonderful explanation of why GPU performance is all about hiding memory latency, or as he puts it: “Where is my data?” Where is my data? The long scoreboard stall reason is a large part of the profile, and reducing these seems like a question of reducing and/or amortizing memory loads, especially in our hot loop. Two candidates stand out: the byte-at-a-time code load from compressed data and the static symbol table accesses. Packing the compressed data load into as many bytes as possible requires some pointer arithmetic since CUDA does not allow issuing a misaligned multi-byte load. The trick here is to use register scratch space and issue the largest aligned load to get to u64 alignment and load a u64 at a time from then on. Unfortunately, that didn’t seem to move the needle at all, likely because the GPU is fetching a full cache line’s worth of data 64 bytes regardless of whether one or eight bytes were requested, so these loads seem to already be optimized. This change also seems to introduce some previously unseen instructions like BSSY and BSYNC which are used to handle warp divergence threads in a warp doing different things . Given that threads in a warp must fundamentally execute in lockstep, these instructions are not a good sign: The next candidate was putting the symbol table into shared memory, which is cheaper to access than global memory. However, there was no improvement with this change either. As the profile tells us the change seems to have reduced long scoreboard stalls in exchange for mio throttles which in GPU architecture is upstream of the lg throttles and the long scoreboard stalls, so it looks like we just shifted the bottleneck upstream: A little bit of research produces a convincing theory: since the symbol reads are inherently random, multiple threads in a warp request different addresses from shared memory and this causes these accesses to be serialized. This particular issue is known as bank conflict latency https://modal.com/gpu-glossary/perf/bank-conflict . After these changes I came to the conclusion that it’s tough to beat the GPU’s L1 cache-line amortization and random access reads are not a fit for shared memory. I’m very likely missing something, and there might be some optimizations to do around fetching compressed bytes before the hot loop, but at this point I decided to move on to the next bottleneck. Where is my data? try number two Loads were difficult to optimize at least to my untrained eye so the next logical step was moving to optimizing stores. Theoretically, loads are harder to optimize because if you request a specific byte, it is very likely that you will request the next, contiguous byte. GPUs and many other things use this information for implicit optimization. Stores are a different story. When you write a byte, the next byte to be written is unknown. The main local to global memory stores the FSST kernel is doing is writing the decompressed bytes a byte at a time. The optimization for this path is very similar to the aligned loads described above: write the decompressed result byte by byte into registers and emit the widest possible aligned store when enough data has been buffered. The code is here https://github.com/vortex-data/vortex/pull/7776/changes diff-eede4b074f3aff6ab480c206cae3f92afa021f890f71172443944a4719e9fdbeR86 if you’re feeling brave, but the animation below illustrates this idea well enough. This optimization showed a dramatic improvement and finally beats the CPU implementation. This profile comparison also shows that the lg throttle bottleneck was greatly reduced: Warp Divergence and GSST GSST https://dl.acm.org/doi/epdf/10.1145/3719330.3721228 is a paper that modifies the FSST encoding for more efficient execution on the GPU. I didn’t use it as a reference for the optimizations we covered since a goal of the hackathon was to dogfood the Polar Signals GPU Profiler, but there is a lot of overlap between the optimizations mentioned in the paper and the ones we covered. One thing profiling/PC sampling doesn’t show is the warp execution efficiency. As mentioned before, all threads in a warp must execute in lockstep so if one thread has finished execution or will execute a different instruction, it is masked as inactive during an execution cycle which reduces execution efficiency. One further optimization I would not have discovered with profiling is the split kernel optimization from the GSST paper. Since threads are decompressing variable-length strings, some threads in a warp might be inactive after decompressing short strings while others are still decompressing the longer strings. The optimization is to use pre-computed splits in the compressed data and even out the workload across threads to increase execution efficiency. Computing this split metadata on the CPU side and shipping it off to the GPU results in another nice performance boost: However, this improvement is more illustrative than anything else, since the CPU pre-computation is something to be avoided in production scenarios. The whole point of GSST is that this sort of metadata is pre-computed at write time. It’s good motivation to introduce a GPU-optimized encoding though. Final thoughts Writing and optimizing the FSST CUDA kernel in Vortex was both frustrating and fun. GPUs are really finicky but reward you with outstanding performance once you understand the architecture and how it responds to the kernel. I realized that a large part of performance engineering on GPUs is about throughput: figuring out how to turn on the memory taps so the data flows as freely as possible from one side of the GPU to the other while keeping the threads as busy as possible in order to hide the memory latency. The result of the hackathon was good enough to PR https://github.com/vortex-data/vortex/pull/7776 and is now generally available in Vortex. Although introducing GSST is the long term goal, the FSST decompression we covered in this blog post achieves 40% of measured device bandwidth via cuMemcpy, 30% of theoretical and is 3x faster than nvcomp’s zstd https://developer.nvidia.com/nvcomp on the same corpus of data this is fundamentally due to FSST's decompression parallelizability : Thanks for reading and feel free to point out anything I did wrong or could have done better. We’re also still in the early days of GPU profiling, so we'd love to hear how we can make Polar Signals' GPU profiler fit your workflow better. Keep up with Polar Signals Receive new posts, product updates, and insights on performance engineering straight to your inbox.