{"slug": "an-interactive-pytorch-debugger-that-looks-deep-inside-your-neural-net", "title": "An interactive PyTorch debugger that looks deep inside your neural net", "summary": "Nansense, a new interactive PyTorch debugger, allows developers to pause training, step batch-by-batch, and time-travel to different epochs while visualizing activations, gradients, weights, and optimizer state. The tool helps diagnose neural network failures by inspecting tensors on demand, avoiding the infeasibility of persisting gigabytes of data per batch.", "body_md": "*Don't guess why your neural network fails to learn. Instead, have a look inside.*\n\n## demo.webm\n\n*Video 1. The main nansense UI. Clicking the layers in the architecture shows activation/gradient maps. The size of the receptive field can be measured by perturbing the input image and measuring the diff. Watched layers collect histograms and min/max activating pixel statistics for interpretability. You can even run deep dream at any point during the training run to visualize what exactly each neuron is looking for.*\n\n*Nansense* is a PyTorch debugger that visualizes activations, gradients, weights, optimizer state and various statistics. You can **pause, step batch-by-batch, and time-travel to a different epoch while training**, and see exactly what every layer is doing.\n\nHere's how *nansense* can help:\n\n**See what is actually going on**.[Visualize activations and gradients](#visualize-activations-and-gradients-throughout-training),[find image patches with minimal or maximal activation for a given channel](#minmax-activation-patches)and[simulate what each neuron is searching for (deep dream)](#simulate-what-a-neuron-is-searching-for-deep-dream)**Spot optimization bottlenecks**.[Discover insufficient receptive fields](#measure-receptive-field-of-a-neuron),[measure neuron death](#investigate-dead-neurons),[discover padding artifacts](#padding-jump-target)and[spot gradient underflow](#spot-gradient-underflow)\n\nYou can easily try out the [examples](#run-examples) yourself. Or wire it into your own training loop. Adding nansense support is just a few lines of code. Here's an example for integrating with [raw PyTorch](#wire-it-into-your-loop-raw-pytorch) and with [Lightning](#wire-it-into-your-loop-pytorch-lightning).\n\nLoggers like Weights & Biases and TensorBoard record scalar curves of loss and accuracy that you scroll through after the run. Nansense works inside the live training loop instead: it pauses so you can step batch-by-batch and time-travel while inspecting the activations, gradients, weights and optimizer state of every layer. You can even run experiments like deep dream or Grad-CAM on the paused model to probe what a given neuron has learned.\n\nPersisting all this data on disk is infeasible, as a single batch of activations and gradients can easily be several gigabytes. Nansense sidesteps that by pausing and inspecting the tensors on demand, instead of writing everything to disk.\n\nA layer's activations (top row) and gradients (bottom row) for a single input. Here, an image of a paraglider passes through an intermediate batch normalization layer. Each column is a channel, drawn on a diverging red/blue scale. Step through training to watch what each channel responds to and how strong the backward signal reaching it is.\n\n*Figure 1. Intermediate layer's activations and gradient from an image of a golf ball. Each column is a separate channel. Due to the next layer being a ReLU, the gradient exists only where the activation is positive.*\n\n*Figure 2. Activations of a CIFAR10 trained network layer, with the input shown for comparison as the rightmost image. The augmentation used here zero-pads on the left and bottom of the image, which lights up as strong edge activations on every channel. Maybe use reflection padding next time?*\n\nFor any channel, nansense collects the input patches that drove it to its strongest (and weakest) responses over an epoch. Reading off the gallery is the quickest way to tell what a specific neuron has learned to detect.\n\n*Figure 3. For each of the 6 first channels/neurons in a specific layer, the 4 strongest activating patches from the training set have been collected. The heatmap coloring shows the activation strength. As an example, both `CHANNEL 1` and `CHANNEL 4` both seem to be optimized for detecting french horns, however `CHANNEL 1` is more centered on the instrument itself, while `CHANNEL 4` seems to also be activated by human faces. See also Figure 5.*\n\nDeep dream optimizes the input itself to maximally excite a chosen neuron, synthesizing the pattern it is looking for.\n\n*Figure 4. Deep dream on exactly the same channels/neurons that were used to select maximally activating patches for Figure 4. `CHANNEL 0` creates a lot of vertical red structures, loosely resembling the typical gas station presented in figure 4. In `CHANNEL 1` we can yellowish curved structures, picked up from french horns. `3` and `5` have circular structures with dots inside, analogous to golf balls.*\n\nAny layer can be visualized this way, but here we use the network's final output layer, where the result is easiest to interpret. On MNIST, it produces ghostly digits between 0 and 9.\n\n*Figure 5. Deep dream on the final layer of a lenet network on the mnist dataset.*\n\nThose numbers look strange because deep dream does not necessarily make the features realistic; it maximizes them. A good example is the number 4: there are many different ways you could combine these strokes into a 4, which is why it excites the neuron even more than a typical 4 would.\n\nHere's a visualization of other layers:\n\n## demo_deep_dream.webm\n\nTo measure the receptive field of a neuron, *nansense* has support for perturbing a single pixel, and watching the diff between the original propagate through the neural network.\n\n*Figure 6. Here we perturb a single pixel of an image, and visualize how the perturbation transmits through the network. As we go deeper down the layers, the diff spreads throughout most of the image, which indicates a reasonably healthy receptive field (at least some part of the network can see the whole image).*\n\n*Nansense* can measure each channel's activation and gradient distribution over a full epoch. This makes it easy to discover optimization problems, such as some neurons being driven to zero.\n\n*Figure 7. The activation histogram of a dead channel in a layer. Apparently all activations are negative, which causes the next ReLU layer to clamp everything to zero. Because this eliminates any gradients, the channel will likely never recover from this state.*\n\nIn low-precision training (fp16) a layer's gradients can collapse into the *subnormal* range (below the dtype's smallest normal value) where precision drains toward zero and the layer's learning quality quietly drops. *nansense* checks activations and gradients for NaNs, infinities and this subnormal/overflow band every few batches, and pauses with a warning banner once a meaningful share of a layer's gradient magnitude lands there.\n\nThe examples run with [uv](https://docs.astral.sh/uv/getting-started/installation), a fast Python package manager. `uv`\n\ndoes not pollute your other Python environments, and automatically installs the necessary packages when running a script.\n\n```\n# Install uv:\ncurl -LsSf https://astral.sh/uv/install.sh | sh\n```\n\nPick the dependency group that matches your hardware and pass it as `--group`\n\n:\n\n| Group | Hardware |\n|---|---|\n`cpu` |\nNo GPU, CPU-only, any platform |\n`cuda-legacy` |\nOlder NVIDIA GPUs: Maxwell, Pascal, Volta (CUDA 12.6) |\n`cuda` |\nCurrent NVIDIA GPUs: Turing through Blackwell (CUDA 13.0) |\n`rocm` |\nAMD GPUs (ROCm 7.2) |\n\nThen launch any example; the requirements, datasets and any pretrained networks are downloaded automatically, and the UI serves on `--nansense-port`\n\n.\n\n```\n# `examples/standard/main.py` is a good starting point for mnist, cifar10 and imagenette. Use `--dataset` and `--model` for different combinations.\nuv run --group [group] examples/standard/main.py --nansense-port 8080\n\n# More exotic, but harder to interpret tasks:\nuv run --group [group] examples/game_of_life/main.py --nansense-port 8080\nuv run --group [group] examples/audio_keywords/main.py --nansense-port 8080\nuv run --group [group] examples/depth_make3d/main.py --nansense-port 8080\n\n# Multi-input demo: a 5-channel image + a flat stats vector. Shows the input\n# pane's input picker, the `input_transform` for non-RGB images, and the\n# flat-input strip.\nuv run --group [group] examples/multimodal/main.py --nansense-port 8080\n```\n\nA focused browser tab opens automatically at the boxed URL it prints (open it yourself if your environment has no browser); training pauses on the first batch. Drive it from the top bar. See the [UI tutorial](#ui-tutorial) for more info.\n\nIf you hit out-of-memory errors, lower `--batch-size`\n\n. If training is slow and you have GPU VRAM left, increase `--batch-size`\n\n. Both memory and training speed can be improved with `--dtype bf16`\n\n(older GPUs don't support it).\n\n*Figure 8. Main view of the UI, with stepping controls, architecture, individual activations/gradients, inputs and input controls. Each layer in the architecture can be clicked to open the respective layer card.*\n\nWhen a session starts, nansense serves a web page and pauses on the first batch.\nYou drive the run from the top bar: **Step Batch** advances one batch, **Run**\nruns to the end and then pauses, and **Stop** pauses a free-running session. The\ndropdown next to Step Batch steps a whole epoch or up to a custom point.\n\n**Time Travel** jumps back to the start of any cached epoch. It is enabled once\nthe training loop is wrapped in a [restorer](#wire-it-into-your-loop-raw-pytorch),\nwhich checkpoints each epoch start to disk.\n\nThe left pane shows the model as a clickable architecture graph. Click a node to\n**watch** that layer: its activations and gradients appear as a card, and from\nthat point on every batch feeds them into running statistics. Watched views\nrefresh on every pause and, while training runs, on the cadence set under\n*Update frequency* in the settings.\n\nWatching slows down the training and consumes memory, so\nit's generally better to watch only a number of layers at a\ntime. Open a watched layer's **stats view** for a closer look:\na histogram of its activation and gradient values over the epoch (down to a\nsingle channel), and a gallery of the input patches that drove each channel to\nits most extreme responses. Its **Current batch** phase shows the last captured\nbatch's distribution for *any* layer, watched or not, and the top bar's stats\nbutton pauses or resumes collection without hiding the cards.\n\nEach layer card has an **Experiment** button. On the experiment page, pick a\nmethod (deep dream, or a Captum attribution: Grad-CAM, Neuron Gradient, Neuron\nIntegrated Gradients, Occlusion), set its parameters, and run it on the layer.\nExperiments run between batches, so training must be paused; results show one\ncard per input sample.\n\nThe right sidebar controls which input the layer views are computed from. A\nmodel with several inputs gets an **Input** dropdown to choose which one the\npane shows and perturbs; a non-RGB image needs an `input_transform`\n\nto display\n(see the [Python API](#python-api)), and a flat `(N, C)`\n\ninput shows as a\nclickable per-feature strip. **Select sample in batch** picks which sample of\nthe current batch to show. The\nviews follow the live training batch by default; **Pin** freezes the current\nbatch as a fixed input that nansense re-runs at every update, so you can watch\none input's activations evolve as training proceeds and across time travel, and\n**Forward mode** (Unchanged / Eval / Train) sets how BatchNorm and dropout\nbehave on those re-runs.\n\n**Perturb** lets you click pixels to edit the input; nansense re-runs the model\nand the layer cards switch to the diff, so you can trace a single changed pixel\nthrough the network.\n\nThe settings dialog records any view to an MP4, one frame per visualization\nupdate, written under `nansense_recordings/`\n\n. Start a recording with a layer\nwatched or an experiment open, then save or discard it from the same dialog.\n\n```\npip install nansense\n```\n\nNote:Install your PyTorch build first (see[pytorch.org]) so your CUDA / ROCm / CPU choice is preserved: nansense bundles`captum`\n\nfor the experiment page's attribution methods, and captum needs torch ≥ 2.3, so a pre-existing torch keeps`pip`\n\nfrom pulling a default CPU build.`pip install lightning`\n\nadditionally enables`nansense.lightning`\n\n. Runs on Python 3.10–3.14.\n\n``` python\nimport torch\nimport nansense\n\n# Init model, optimizer, criterion, dataloaders\nmodel = ...\noptimizer = ...\ncriterion = ...\ntrain_dl, val_dl = ...\n\n# Setup UI. The schedule is discovered as you train (phase names and batch\n# counts are learned from the loop below); no need to declare them up front.\nsession = nansense.start(model, optimizer=optimizer, port=8080, enabled=True)\n\n# Time travel needs an epoch cache. `session.epochs(50)` iterates like\n# `range(50)` but checkpoints each epoch start; wrap each iteration's body in\n# `with session.restore_point():` so a UI-requested jump can unwind it and\n# re-enter at a different epoch. Without this loop, training runs once through\n# and the Time Travel button is disabled.\nfor epoch in session.epochs(50, cache_dir=\".nansense_cache\"):\n    with session.restore_point():\n        # Training batch iteration\n        for inputs, targets in session.batches(train_dl, phase=\"train\"):\n            optimizer.zero_grad()  # keep zero_grad at the beginning of the batch\n            loss = criterion(model(inputs), targets)  # as nansense reads .grad when\n            loss.backward()  # the batch exits, so zeroing after step() would\n            optimizer.step()  # leave the weight-gradient views empty.\n        # Validation batch iteration ...\n\n# Close the UI (the served page stays up for post-mortem browsing)\nsession.close()\n```\n\nSee the [Python API](#python-api) for more information.\n\n``` python\nimport lightning as L\nfrom nansense.lightning import NansenseCallback, fit_with_time_travel\n\n# PyTorch Lightning modules\nmodule = ...\ndatamodule = ...\n\n# `model=\"net\"` is the attribute path to the network inside your LightningModule, e.g. module.net\ncallback = NansenseCallback(port=8080, model=\"net\", enabled=True)\n\n# Time travel consumes the running fit, so the trainer comes from a factory:\n# fit_with_time_travel builds a fresh Trainer for each jump-and-replay attempt.\ntrainer_factory = lambda: L.Trainer(max_epochs=50)\nfit_with_time_travel(trainer_factory, module, datamodule=datamodule, callback=callback)\n```\n\nSee the [Python API](#python-api) for more information.\n\n`nansense.start(model, ...)`\n\ncreates the `Session`\n\nand, when `port=`\n\nis given,\nserves the UI. The arguments worth knowing:\n\n`optimizer`\n\n(optional): adds per-parameter optimizer state and live hyperparameters to the weights page.`scheduler`\n\n(optional): lets time-travel checkpoints restore the LR schedule.`enabled`\n\n:`False`\n\nmakes the session a near-zero-overhead no-op, so you can leave the wiring in place and switch the UI off with one flag.`port`\n\n/`host`\n\n/`open_browser`\n\n: serve the UI immediately (the banner and auto-opened tab are skipped if a concurrent session already holds the port); omit`port`\n\nand call`nansense.serve(session, port=...)`\n\nseparately for finer control.`input_mean`\n\n/`input_std`\n\n: the input normalization, so images display in their original colors.`input_transform`\n\n: a callable mapping a non-RGB image input`(N, C, H, W)`\n\nto a displayable`(N, 1|3, H, W)`\n\nimage in`[0, 1]`\n\n(keeping`H × W`\n\n); without it, an input whose channel count isn't 1 or 3 shows a hint to add one. A flat`(N, C)`\n\ninput needs none; it renders as a colormapped strip. For a multi-input model,`input_mean`\n\n/`input_std`\n\n/`input_transform`\n\neach take either one value for all inputs or a`dict`\n\nkeyed by input name, and the input pane gains a dropdown to pick which input to view and perturb.\n\nIterate each phase with `session.batches(loader, phase=...)`\n\n, and call\n`session.close()`\n\nwhen training finishes (the served page stays up for\npost-mortem browsing). For time travel, drive the epoch loop with\n`for epoch in session.epochs(N, cache_dir=...)`\n\n(default `.nansense_cache`\n\n) and\nwrap each iteration's body in `with session.restore_point():`\n\nas shown above.\n\nThe schedule is discovered as you go: phase names and per-phase batch counts are learned while you iterate `session.batches`\n\n, so the UI's per-phase progress and boundary stops become exact after the first epoch. Pass `phases={\"train\": a, \"val\": b}`\n\nto `start()`\n\nif you want that precision from the very first epoch, an optional up-front declaration (it's what the PyTorch Lightning integration uses).\n\nFor **PyTorch Lightning**, attach a `NansenseCallback(model=\"<attr path to the network>\", ...)`\n\nto your trainer and run the fit through `fit_with_time_travel`\n\n,\nwhich owns the jump-and-replay loop. Both accept the same `port`\n\n/ `host`\n\n/\n`open_browser`\n\n/ `enabled`\n\n/ `input_mean`\n\n/ `input_std`\n\n/ `input_transform`\n\narguments as `start`\n\n.\n\n**Distributed (DDP)** needs no special wiring: call `nansense.start()`\n\non every\nrank (the DDP-wrapped model is unwrapped automatically). Rank 0 serves the UI and\ndrives pausing and stepping; the other ranks follow its pace and fold their data\nshard into the watch-page statistics. See `examples/standard/main.py --distributed`\n\n. Keep in mind that DDP support is currently **experimental**.\n\nSee [ INTERNALS.md](/kongaskristjan/nansense/blob/main/INTERNALS.md) for how it works under the hood (it's long).", "url": "https://wpnews.pro/news/an-interactive-pytorch-debugger-that-looks-deep-inside-your-neural-net", "canonical_source": "https://github.com/kongaskristjan/nansense", "published_at": "2026-06-29 17:26:55+00:00", "updated_at": "2026-06-29 17:50:40.442689+00:00", "lang": "en", "topics": ["machine-learning", "developer-tools", "neural-networks", "ai-tools"], "entities": ["PyTorch", "Nansense", "Weights & Biases", "TensorBoard", "Lightning", "Grad-CAM"], "alternates": {"html": "https://wpnews.pro/news/an-interactive-pytorch-debugger-that-looks-deep-inside-your-neural-net", "markdown": "https://wpnews.pro/news/an-interactive-pytorch-debugger-that-looks-deep-inside-your-neural-net.md", "text": "https://wpnews.pro/news/an-interactive-pytorch-debugger-that-looks-deep-inside-your-neural-net.txt", "jsonld": "https://wpnews.pro/news/an-interactive-pytorch-debugger-that-looks-deep-inside-your-neural-net.jsonld"}}