Training NanoGPT on Slurm with a Nix-Pinned Environment

A researcher using nanoGPT on a MacBook Pro faces dependency failures when moving to a GPU cluster, prompting a solution using Nix and Flox to create reproducible, cross-platform runtime environments for ML/AI workloads that work consistently from local development to production on Slurm and Kubernetes.

Training nanoGPT on Slurm with a Nix-Pinned Environment A researcher prototypes a model on her MacBook Pro. She uses current-stable PyTorch, a Python interpreter, some standard Python training libraries, a few native libs, and a Conda environment. This is a Python workload, so her single biggest concern isn’t, “Can I validate my assumption with a small local experiment?”; it’s: “Will all these Python dependencies actually run on my MacBook?” She’s using Conda, so everything should work. Holding her breath, she types a command and kicks off the run. Huzzah The model trains … and there’s signal Elated, she pushes her code and opens a PR. It’s at precisely this point that anything can happen. Because what works locally doesn’t automatically work everywhere else. On the GPU cluster, as soon as the EKS pod comes up, things go sideways. PyTorch wasn’t compiled against the cloud GPU’s CUDA stack. A native extension tries to load libstdc++ using a path that doesn’t exist. The loader expects to read from local disk instead of S3. The job fails. Fear, Loathing, and ML/AI Handoff Feel familiar? People who work with ML/AI live this everyday, sometimes several times a day. Maybe a job sails through prototyping on a GPU cluster … only to founder during training on Slurm: Or maybe it fails in eval. CI. MLOps. Staging. Maybe it transits all of these before failing in production. The point is the PTSD: The nagging anxiety that it’s going to fail , inexplicably, somewhere downstream. This article describes a pattern for creating reproducible runtime environments for ML/AI using declared, graph-backed environments https://flox.dev/blog/standardized-development-environments-explained/ based on Nix and Flox. The same Nix and Flox environments work on Linux and macOS, x86 or ARM, NVIDIA CUDA or Apple Metal/MLX. They travel from model training in local dev to checkpoint validation in eval. They run as-is , pulling in exactly the same dependencies, in CI, MLOps, and production. The pattern looks like this: - Teams define GPU-accelerated PyTorch, JaX, TensorRT, etc. as Nix or Flox runtime environments; - ML/AI researchers define project-specific Nix or Flox environments on top of the appropriate runtime; - Researchers run Nix or Flox ML/AI stacks on their MacBooks, prototype on NVIDIA DGX nodes, train models on Slurm. Apple Metal/MLX and NVIDIA CUDA get GPU-accelerated libraries; - MLOps teams use Nix or Flox environments when evaluating + packaging checkpoints for production; - Platform teams maintain just one environment for Slurm training and Kubernetes prod . The upshot is that a single GPU-accelerated environment transits the ML/AI software lifecycle without accumulating stage-specific runtime barnacles. Teams can compose modular Nix and Flox environments to create ML/AI stacks bundling CUDA or Metal/MLX for GPU support; PyTorch, JaX, or TensorRT for training or inferencing; project-specific native and Python dependencies; and the code, data pipelines, and tools required to train, package, and ship ML/AImodels. This pattern reduces debugging cycles and gives orgs a safe, atomic way to promote new releases, or to roll back if necessary to known-good ones. A reusable, cross-platform PyTorch runtime This article uses a PyTorch inferencing stack as its baseline example. But the same pattern works with JaX, TensorRT, and other ML frameworks. It works with model-serving runtimes, distributed training frameworks, batch inference jobs, EDA pipelines, eval harnesses, and RAG/embedding pipelines, too. Creating a cross-platform, GPU-accelerated PyTorch runtime is straightforward with both Nix and Flox. Each is “declarative” in the sense that teams declare the versions of packages they want to be available in an environment; from there, each tool’s resolving machinery figures out how to make these coexist. With Nix and Flox, then, a build recipe or runtime environment encodes named inputs, sources, patches, build instructions, toolchains, target systems, and environment variables as derivations https://nix.dev/manual/nix/2.25/language/derivations ; realizing these derivations produces store objects under /nix/store . With Nix and Flox, reproducibility is a function of the declared, resolve dependency graph, lock state, derivation, and closure of the realized store. With Nix The Nix equivalent of a cross-platform PyTorch runtime looks like: This flake defines a cross-platform Python 3.13 + PyTorch base and composes it into context-specific outputs for runtime, training, eval, dev shell, CI shell, and an OCI container image. This gives platform teams a reproducible, cross-platform PyTorch base that works from local development → production. Nix has excellent container tooling. The flake above defines an OCI image with dockerTools.buildLayeredImage . This tells Nix to build the image from the runtime’s closure, then to write the container’s default command and environment variables into the image metadata. For this example, we define the OCI image as part of the project repo’s Nix flake, but release teams can and do maintain their own downstream flakes. These consume the project flake’s output and emit an OCI image. This second pattern lets application teams expose a runtime closure https://nix.dev/manual/nix/2.25/glossary.html?highlight=closure gloss-closure as the complete set of dependencies for their application. CI can test against the same closure, and downstream release flakes can package it into an OCI image, then tag, sign, scan, and publish it. With Flox The Flox equivalent of this environment is less verbose: That’s it. The Nix flake explicitly defines a series of lifecycle roles viz., a dev shell, a runtime package, a CI shell, a container image that the Flox model abstracts. To take one example, the Flox manifest declares packages, versions, supported systems, environment variables, services, and build recipes; it doesn’t, however, declare entrypoints for specific roles. It doesn’t need to. A Flox environment exposes dev tooling and libraries for local dev by default; activating it with the --mode run flag, or declaring this in the Flox manifest, restricts access to dev tooling and libraries. run mode is the default on Kubernetes. Flox environments don’t need to declare the lifecycle options e.g., default command, environment variables, etc. for OCI images, either: the flox containerize command does this automatically. Flox primitives like package groups https://flox.dev/docs/concepts/package-groups package-groups , priorities https://flox.dev/docs/concepts/package-groups 3-resolving-file-conflicts-with-priority-and-groups , systems filters https://flox.dev/docs/concepts/package-groups cross-platform-split , and outputs https://flox.dev/docs/tutorials/package-outputs selecting-package-outputs abstract common Nix patterns; they don’t map one-for-one to a single Nix primitive. For example, package groups abstract the work of getting packages that resolve against different historical nixpkgs https://github.com/NixOS/nixpkgs commits to coexist in the same environment. The packages defined in the manifest above get isolated into Flox package groups so that it’s easy to version and manage them: You can define specific versions without worrying about conflicts with other packages. Compose a Training Stack The PyTorch runtime is the foundation. One or more downstream environments can easily consume it. Teams might compose this environment with other environments designed for: CUDA development . Nix or Flox environments that define, nvcc , cudart , cublas , cudnn , and other CUDA-specific dependencies. Essential on CUDA, skipped on macOS. nccl Model training . An environment declaring Python packages / native libraries used to train models. Building + packaging . Linux gets, macOS gets gcc ; all get clang + other tools. cmake CUDA Profiling / performance . NVIDIA’s Nsight Systems and Nsight Compute; PyTorch profiler workflows; CPU and memory profilers; kernel-level performance analysis. Offline eval . Eval harnesses, metrics libraries, dataset clients, tokenizers, etc. Model packaging . Model export, conversion, quantization, artifact packaging, and metadata tools. Model serving . Platforms like llamacpp, VLLM, Nvidia Triton, SGLang, Ollama. Each is a Nix or Flox environment that can use the base PyTorch runtime along with others as an input. With Nix The model-training flake consumes the foundational PyTorch runtime flake in inputs . It specifies and cuda-dev flakes as extra inputs. Linux/NVIDIA CUDA pulls in CUDA dev packages and the GCC stack; users on macOS/Metal or MLX skip CUDA and pull in build , along with other essential deps. The flake below declares a model training Nix dev shell. It declares clang and activates a project-local Python virtual environment, pulling in dependencies that run against the PyTorch runtime. uv With Flox The model-training Flox environment composes the foundational PyTorch runtime with separate build and CUDA development environments declared using Flox include s. On Linux/NVIDIA CUDA, the included environment pulls in CUDA development dependencies, while the flox-labs/cuda-dev-essentials environment provides GCC. macOS platforms using Metal or MLX skip CUDA and pull in flox-labs/build-env from clang , plus other platform-appropriate dependencies. flox-labs/build-env On both macOS and Linux, the manifest hook installs , defines cache-backed virtual environment and package-cache paths under uv , activates the project-local Python venv, and installs the model-training Python dependencies on top of the shared PyTorch runtime: $FLOX ENV CACHE The include section is specific to Flox. It fulfills a function similar to the Nix flake’s top-level set in that it declares the external inputs in this case, other Flox manifests that the composed environment will consume as dependencies. The difference has to do with the unit of composition: Nix fetches and locks the flake inputs , then passes them to the inputs function. The outputs flake author decides how to use those inputs to produce packages, dev shells, apps, containers, or other outputs. By contrast, the section in the Flox manifest references other manifests, which include Flox determines how to merge into the composing manifest. This eliminates the requirement to author and maintain wiring to compose these inputs, at the cost of slightly less control than a Nix flake. The Nix flake expects to consume its flake inputs from GitHub; this Flox manifest composes remote FloxHub environments. It’s a minor difference, but probably worth calling out. The end result With both Nix and Flox, an ML/AI researcher using an M5 MacBook Pro gets the following stack: While an ML/AI researcher working with CUDA locally or on a GPU cluster gets: Researchers can prototype or trail locally, emitting a checkpoint.pt and a , then upload or copy them to an artifact registry or model store. The runtime.json travels with the model. runtime.json Batch Model Training Slurm by itself doesn't address the core challenge of getting every node in a cluster to run against the same ML/AI runtime, with the same dependency graph and the same versions of CUDA, Python, and other finicky dependencies. In the wild, users rely on module load commands to assemble the right CUDA, Python, compilers, libraries, and other dependencies. But module definitions are notorious for drifting across login and compute nodes: the command can and often does resolve differently across a cluster. Containerized ML workflows on Slurm often use HPC container runtimes like Singularity or Apptainer; these improve reproducibility, but must be configured for each cluster’s runtime setup, security policy, GPU/MPI settings, and Slurm conventions. module load cuda/12.8 python/3.12 Alternatives like Conda have drawbacks too: Conda environments that bake in a large number of CUDA and Python dependencies can take a long time to resolve. Teams cannot copy a working Conda environment to a new prefix i.e., path and expect it to keep working. Unless every node sees the same shared environment path, teams typically need to add a separate packaging or distribution step. So modules, HPC container runtimes, and Conda add operational layers on top of the ML/AI workload. nanoGPT on Slurm A composed Nix or Flox ML stack runs as-is on Slurm clusters, with the proviso that Nix or Flox are available on each node. You can run this stack from a single shared environment accessible cluster-wide via NFS , or independently on each GPU node. No matter how you do it, Nix and Flox ensure each node gets the same packages and the same runtime environment, with the same env vars and secrets. The following sub-sections show how this works with an example nanoGPT https://github.com/karpathy/nanogpt training job. Both the Nix and Flox environments consume two or more input environments cross-platform PyTorch; Linux-only CUDA dev; cross-platform Python / general-purpose dev to compose a single unified ML stack environment. Allowing for the tutorial-specific config.sh wrapper see below , Nix and Flox drop into the normal operating model for HPC systems. Submit jobs from the login node with . Slurm schedules the data prep, training, sampling, and eval jobs using standard sbatch job rules. But each job script runs its workload with Nix or Flox, so every node in the cluster gets the same pinned runtime. sbatch –dependency Getting started First clone the repo https://github.com/flox/nanogpt-slurm , then change into the nanogpt-slurm directory and edit the script. config.sh This defines a run in env helper that dispatches to Flox or Nix based on what’s set in . It makes it possible for the same Slurm scripts to run with either Flox or Nix. In a real-world deployment, you wouldn’t need this; rather, you’d pick Nix or Flox and call it directly. ENV MANAGER Note : Clone this repo into a filesystem that’s visible across both the login node and the Slurm compute nodes, such as a shared NFS or GPFS mount. The job scripts source config.sh at runtime, so and any training code referenced by the scripts must be available on the compute node when the job starts. config.sh If your cluster does not provide a shared filesystem, stage the repo onto the compute node yourself, either by cloning it as part of the job script; copying it to node-local scratch; or using sbcast . Another viable pattern is to use the Nix or Flox environment to provide tools like , Python, CUDA, and PyTorch, and then to create an activation hook that clones a pinned revision of the training repo into git before running the training command. $SLURM TMPDIR In this repo, every job script sources config.sh and calls: If you plan to use Nix - Edit and switch to Nix mode: config.sh The flake reference can be any valid flake URL: - Verify on the login node: Note : Consider configuring Nix binary substitution before running Slurm jobs at scale Without a binary cache or shared Nix store, each GPU node must build CUDA dependencies the first time it runs the job. For PyTorch and other ML stacks, this can take an extremely long time: up to several hours. Alternatively, if you use the Nix package manager, you can pull pre-built, pre-patched CUDA-accelerated packages from Flox’s binary cache. Just add Flox as an extra substituter in nix.conf , like so: If you plan to use Flox - Edit config.sh to set your FloxHub username - Run to publish the environment to FloxHub. This way the Slurm GPU node pulls it dynamically at runtime. flox push - Verify on the login node: Compute nodes pull the environment by name at job start. No further setup needed on each node. Running Slurm Jobs with Nix and/or Flox Once config.sh is configured for either Nix or Flox, submitting Slurm jobs looks the same. Each script in ./ jobs/ is a standard Slurm batch script with headers for resources like GPUs, CPUs, time limits . When a compute node runs the job, it does: SBATCH The first job set runs a Shakespeare smoke test to validate the setup. Paste this on the login node: This trains a small ~10M parameter model and usually finishes in 10 minutes or fewer: GPT-2 124M pipeline GPT-2 training uses the same Nix and/or Flox dependencies, just with much longer-running jobs. This workflow requires us to make a decision about which dataset we want to use: Train on GPU ≈10-20 hours / ≈100GB disk Before submitting, check batch size and in gradient accumulation steps against your GPU's VRAM. The defaults target 32 GB A100, RTX 5090 : jobs/gpt2-train.sh The two values should satisfy batch size gradient accumulation steps ≈ 1024 . Some rows round slightly due to integer constraints . A smaller 524,288 means you need more gradient accumulation steps to compensate, so training takes longer but uses less VRAM. batch size Once you've tweaked those, submit the whole pipeline at once on the login node: Allowing for the tutorial-specific config.sh wrapper, Nix and Flox drop into the normal operating model for shared HPC systems. Submit jobs from the login node with . Slurm schedules the data prep, training, sampling, and eval jobs using standard sbatch job rules. But each job script runs its workload with Nix or Flox, so every node in the cluster gets the same pinned runtime. sbatch –dependency Promote models like software packages Working with declared, graph-backed technologies like Nix and Flox gives teams a straightforward path from training → eval → CI → prod. They can reuse modular Nix or Flox environments as inputs for other environments, composing them to create ML/AI stacks. After batch training, teams can make use of this same pattern to compose modular environments for evaluation, benchmarking, checkpoint packaging, and release gating. Similarly, platform teams can compose their own Nix or Flox environments for staging, serving, canaries, observability, and production rollout. Everybody starts with the Nix/Flox PyTorch and/or CUDA diagnostic environments and compose them with their own use-case specific environments. But because Nix and Flox are reproducible build systems they make it staightforward to package, publish, and pull software, too. Full disclosure: Flox inherits this virtuous behavior from Nix. So when training needs to hand off to eval, it can use Nix or Flox to package the checkpoint, model code, tokenizer assets, metadata, and runtime inputs before publishing it to a binary cache or to its private Flox Catalog. But why do this? Because with declared, graph-backed technologies, all of the transitive dependencies that software needs in order to run get packaged along with it. So instead of saying “Here’s checkpoint.pt ; have fun tracking down the right Python, PyTorch, CUDA, tokenizer, and so on ”, ML/AI researchers package and publish their model artifacts with the dependency graphs needed to use them. Eval declares these packages in their own environments, reuses the PyTorch runtime, and gets everything it needs to score the model. CI pulls eval’s scored model package, runs its release gates, and publishes an approved artifact. MLOps registers that artifact, attaches release metadata, and promotes it. Platform teams can declare the package as an input to a container build, or use Nix/Flox to generate OCI images. Declared technologies like Nix or Flox are not in any sense a panacea for ML work. But this pattern replaces the most error-prone part of the ML/AI lifecycle: the handoff. The result is a more regular promotion cycle: package the artifact, publish it, declare it downstream, and promote by reference.