# Optimizing a Neural Reconstruction Pipeline Using NVIDIA Nsight Developer Tools

> Source: <https://developer.nvidia.com/blog/optimizing-a-neural-reconstruction-pipeline-using-nvidia-nsight-developer-tools/>
> Published: 2026-06-30 16:00:00+00:00

[NVIDIA Ominverse NuRec](https://docs.nvidia.com/nurec/) is a neural reconstruction pipeline for building high-fidelity 3D representations of real-world environments from multisensor data such as cameras and lidar. It is used to reconstruct dynamic scenes captured by [autonomous vehicle (AV)](https://www.nvidia.com/en-us/solutions/autonomous-vehicles/) and robotics platforms into simulation-ready digital environments that can be rendered, replayed, and analyzed inside [NVIDIA Omniverse](https://www.nvidia.com/en-us/omniverse/) and related simulation workflows.

These reconstructions play a critical role in the development of [physical AI](https://www.nvidia.com/en-us/glossary/generative-physical-ai/) and autonomous systems. Engineers can capture a real-world driving or robotics scenario, reconstruct the environment, and then inspect or replay the scene. This enables them to better understand model behavior, validate perception results, generate synthetic viewpoints, or create training data for downstream machine learning workflows.

NuRec combines neural rendering techniques such as Gaussian splatting with GPU-accelerated rendering and simulation pipelines to produce highly realistic scene reconstructions. However, this level of fidelity comes with significant computational cost. Reconstruction and rendering workloads involve large volumes of sensor data, complex PyTorch-based training loops, and highly specialized CUDA kernels that push GPU resources heavily.

This post walks through an example to showcase how to optimize the NuRec neural reconstruction pipeline using [NVIDIA Nsight Developer Tools](https://developer.nvidia.com/tools-overview).

## Solving performance optimization challenges

Performance is critical for NuRec workflows because reconstruction turnaround time directly impacts engineering productivity. A common workflow involves identifying an interesting or problematic AV run—for example, a scenario where the perception or planning stack behaved unexpectedly—and launching a reconstruction so engineers can inspect the scene as quickly as possible. Waiting several hours for reconstruction slows iteration and debugging velocity significantly.

At the start of this optimization effort, reconstructing even relatively short captures could take from over an hour to several hours depending on the scene and configuration. The team’s long-term goal is much more ambitious: real-time reconstruction performance, where a 30-second capture can be reconstructed in approximately 30 seconds.

Performance also matters beyond reconstruction itself. Once scenes have been reconstructed, rendering-only workflows may generate massive numbers of frames for reinforcement learning (RL), synthetic data generation (SDG), and large-scale simulation. At this scale, even modest performance improvements can translate directly into substantial reductions in GPU time and infrastructure cost.

To tackle these challenges, NVIDIA profiling and optimization tools were used, primarily [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems) and [NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute), to analyze the NuRec workload, identify bottlenecks across the software stack, and iteratively optimize both the application-level workflow and the underlying CUDA kernels.

## Profiling and optimization using Nsight Systems

[Nsight Systems](https://developer.nvidia.com/nsight-systems) is a platform profiling tool to help you visualize and understand the performance behavior and resource utilization of workloads, including CPU, GPU, storage, networking, and more. The first step in many performance optimization workflows is to run an Nsight Systems profile to establish a baseline and try to identify some initial bottlenecks or key areas for improvement.

With the goal of optimizing the training loop, we used the Nsight Systems built-in function support and [NVIDIA Tools Extension SDK (NVTX)](https://github.com/NVIDIA/NVTX) included in PyTorch to zoom into a single iteration of the forward pass shown in Figure 1. The initial assumption was that the rendering kernel would take most of the runtime and would be the best starting point for optimization. However, the CUDA HW timeline at the top revealed that the majority of time the GPU was underutilized or not used at all. Notice the lack of blue on the top row. The application was also using many more tiny kernels than was expected.

After this initial realization, it was important to drill deeper into the phases of the forward pass to identify where time was being spent and what phases were underutilizing the GPU. Additional NVTX annotations were added to the code to delineate various phases and functions. A new profile (Figure 2) showed that `collect_gaussian_parameters`

was taking the majority of the time before rendering even started and is called multiple times in each forward pass.

Digging even deeper revealed the `interpolate`

function taking the plurality of the time (4.148 ms) and calling many small kernels and memory operations that bogged down the GPU, as seen in the bottom CUDA API row in Figure 3.

We dug into the code under the interpolate function and focused on fusing the small kernels and submitting larger chunks of work to the GPU. We were able to condense all of this work into a single kernel that reduced the interpolate function from 4.184 ms to 83.81 us (Figure 4). This is nearly a 50x speedup.

Next we identified long `cudaStreamSynchroniz`

e APIs (visible as green bars on the timeline) that were delaying the CPU from enqueuing many small kernels while the GPU was active. This resulted in patchy GPU utilization shown in the top CUDA HW row after the synchronize API returned as the small kernels were scheduled and launched (Figure 5).

After removing one synchronization point, others down the line would become the bottleneck. This process was continued until enough were removed that the CPU could efficiently enqueue work while the GPU was busy. This allowed the tiny kernels to run compactly because they were no longer CPU launch-time bound.

Reducing the time spent collecting the parameters and removing synchronization points that were causing bottlenecks enabled digging into some kernel optimizations. Nsight Systems enables you to identify which kernels are the hottest. The `renderBackward`

kernel was clearly the top candidate in this case.

## Kernel optimization using Nsight Compute

[Nsight Compute](https://developer.nvidia.com/nsight-compute) is the best tool for profiling and optimizing individual kernels. It can automatically replay kernels to collect large amounts of performance data at very fine granularities using various types of hardware counters, software patching, and instrumentation. It includes a built-in rule system and guided analysis to help users identify and understand issues.

The `renderBackward`

kernel is used in both camera and lidar data processing. Profiling multiple instances of this kernel with Nsight Compute revealed that it has only ~15% occupancy and the behavior and resource requirements of this kernel differ significantly depending on which of these inputs is being processed.

The longest three `renderBackward`

kernels are from lidar data and the other three are from camera data. Despite these differences, both were allocating 167 registers per thread (Figure 8).

Setting the top lidar kernel as an [Nsight Compute baseline](https://docs.nvidia.com/nsight-compute/NsightCompute/index.html#id8) and comparing a camera kernel automatically revealed that while both had the vast majority of accesses in shared memory, the camera kernels were making ~75% fewer requests even though both lidar and camera instances of the kernel were allocating the same amount of shared memory per block statically.

Noting these behavior differences between whether the `renderBackward`

kernel was used for camera or lidar data, and the fact that register and shared memory allocations were static and identical for both, the next step was to try splitting the kernel depending on whether it was processing camera or lidar data.

For each version of the kernel, the team experimented and tuned register allocations with the [ launch_bounds](https://docs.nvidia.com/cuda/cuda-programming-guide/05-appendices/cpp-language-extensions.html#launch-bounds) qualifier and the amount of shared memory we were allocating per block. The

[runtime API was used to set the preference of both kernels to have a larger shared memory and smaller L1 cache.](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html#group__CUDART__EXECUTION)

`cudaFuncSetCacheConfig`

After this testing and optimization, the lidar and camera kernels decreased their register allocation needs from 167 to 64 and 128 respectively and both were able to run efficiently with about half of the originally allocated shared memory. This improved occupancy from ~15% to between 30-50% and overall runtime significantly, with the longest lidar kernel decreasing from 31 ms to 18 ms.

But there is still room for improvement. The next issue identified, which is being worked on at the time of publication, is long-tail effects in the kernel caused by a workload imbalance. This can be seen in the PM Sampling section of Nsight Compute (Figure 11). The first half of the kernel shows an average of 32 active warps that begin to taper off and for the last several milliseconds there is less than one active warp per cycle. Ideally, all the warps would be active for the entirety of the kernel.

## Get started with NVIDIA Nsight Developer Tools

Performance analysis and optimization is an iterative process that consists of running a profile, identifying a problem, fixing it, and starting again. You can use tools like Nsight Systems and Nsight Compute to make this entire process easier for developing and optimizing on NVIDIA GPUs. Both tools are free—download [Nsight Systems](https://developer.nvidia.com/nsight-systems) and [Nsight Compute](https://developer.nvidia.com/nsight-compute) and try them with your own use case. If you have questions or want to share what you find, leave a comment on the [NVIDIA Developer Forums](https://forums.developer.nvidia.com/c/developer-tools/106).

### Acknowledgments

*Special thanks to NVIDIA contributors Francois Trudel, Joey Lai, and Rodolfo Lima. *