Fine-Tuning Biological Foundation Models with LoRA Using NVIDIA BioNeMo Recipes

NVIDIA released BioNeMo Recipes, a set of training recipes that use Low-Rank Adaptation (LoRA) to fine-tune large biological foundation models like ESM2-3B and Evo2-1B on a single workstation GPU. The recipes enable parameter-efficient fine-tuning for tasks such as protein secondary structure prediction and DNA splice-site classification, reducing compute and memory requirements by training only about 1% of model parameters.

Foundation models https://www.nvidia.com/en-us/ai-data-science/foundation-models/ are reshaping computational biology. Pretrained on massive corpora of protein or genomic sequences, models such as ESM2 a protein language model and Evo 2 a DNA language model capture statistical regularities of biological sequences. These transfer well to a wide range of downstream tasks, including structure prediction, variant effect, and functional annotation. Yet adapting these models to a specific task is nontrivial: at billions of parameters, full fine-tuning quickly becomes impractical, both in compute and storage of optimizer state and checkpoints. Low-Rank Adaptation LoRA https://arxiv.org/abs/2106.09685 directly addresses this challenge. By keeping the pretrained backbone frozen and training only a small set of low-rank adapter matrices, LoRA can match full fine-tuning quality on many tasks while training ~1% of the parameters, fitting a single billion-scale model and its adapter state on a single workstation GPU. To reduce the difficulty of building these workflows, NVIDIA BioNeMo Recipes https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes provide step-by-step training recipes built on familiar PyTorch, Hugging Face, Megatron-Bridge https://github.com/NVIDIA-NeMo/Megatron-Bridge patterns. Performance-oriented components such as NVIDIA Transformer Engine TE https://github.com/NVIDIA/TransformerEngine and scale-out strategies are integrated where they pay off, but the recipes themselves stay readable. This post walks through two case studies that show how the same parameter-efficient recipe applies across biological modalities on a single NVIDIA RTX 6000 Blackwell Workstation Edition https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-6000/ GPU: - ESM2-3B plus LoRA for protein secondary structure prediction PSSP https://www.sciencedirect.com/science/article/abs/pii/S1093326317304217?via%3Dihub - Evo2-1B plus LoRA for DNA splice-site classification All the source code to customize or reproduce these results are available in NVIDIA BioNeMo Recipes https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes . How LoRA enables fine-tuning at scale Before diving into the case studies, a quick refresher on the method. Full fine-tuning is resource-heavy because it requires storing and updating all model parameters and their optimizer states, which quickly becomes impractical as models scale. LoRA is a practical method to fine-tune large pretrained transformers without updating or storing optimizer state for all model parameters. The core idea behind LoRA is that instead of updating a dense model’s weight matrix \ W\ , LoRA adds a new trainable low-rank matrix \ W=BA\ in parallel and keeps \ W\ frozen. This dramatically reduces the number of trainable parameters and the optimizer/memory footprint. LoRA is parameterized by a small set of hyperparameters that trade off capacity, stability, and cost. The rank \ r\ controls the size of the added low-rank matrices and therefore the number of trainable parameters. The target modules specify which layers receive adapters, with common choices such as attention and MLP projections. For small datasets, LoRA dropout can be enabled as an additional form of regularization. While the following two case studies differ in modality protein versus DNA , task type token classification versus sequence classification , and underlying architecture transformer versus striped Hyena , both use the same LoRA recipe pattern. ESM2-3B for protein secondary structure prediction PSSP https://www.sciencedirect.com/science/article/abs/pii/S1093326317304217?via%3Dihub is the task of assigning a structural label to each amino acid in a protein sequence. Secondary structure labels describe local backbone conformations—helices and strands—without requiring a full 3D structure prediction. For many proteins, these local patterns correlate with functional motifs and global fold organization. PSSP is a core building block for many downstream applications in biology. Because local structure is strongly correlated with protein function, PSSP can provide useful functional context. In addition, these predictions can inform tertiary structure prediction, solvent accessibility prediction, protein-protein interaction prediction, and structural class- or domain-related prediction. At a modeling level, PSSP is a token classification problem: the input is an amino-acid sequence, and the output is a structural label for each residue. There are two common evaluation variants, differing only in the label space: Q3 3-state : H Helix , E Strand/Sheet , C Coil/Loop Q8 8-state : H α-helix , B β-bridge , E β-strand , G 310helix , I π-helix , T turn , S bend , C coil/other ESM2-3B is a 3-billion-parameter protein language model, so full fine-tuning typically requires substantial compute and memory. LoRA makes adaptation practical by training only a small number of additional parameters, while still achieving strong performance on PSSP. ESM2 plus PEFT in BioNeMo Recipes TE-accelerated The team fine-tuned ESM2-3B for PSSP by adding a lightweight per-residue classification head for Q3/Q8 labels and training LoRA adapters through the PEFT https://github.com/huggingface/peft library, while keeping the pretrained backbone weights frozen. For data, we used the curated splits released by the authors of the Porter 6 https://doi.org/10.3390/ijms26010130 model and reported results on their provided test set. To maximize throughput, we enabled TE and sequence packing and ran the full training workflow on one NVIDIA RTX 6000 Blackwell Workstation Edition GPU in under one hour. The following snippet adapted from the BioNeMo Recipes ESM2 plus PEFT example https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes/recipes/esm2 peft te shows how to load the TE-compatible ESM2 model and attach LoRA adapters to the fused query/key/value QKV projections: python import peft import torch from transformers import AutoConfig, AutoModelForTokenClassification Load config and token-classification model use a local checkpoint path or HF model ID, e.g. nvidia/esm2 t36 3B UR50D . config = AutoConfig.from pretrained "nvidia/esm2 t36 3B UR50D", trust remote code=True model = AutoModelForTokenClassification.from pretrained "nvidia/esm2 t36 3B UR50D", config=config, trust remote code=True, dtype="bfloat16" peft config = peft.LoraConfig task type=peft.TaskType.TOKEN CLS, inference mode=False, r=8, lora alpha=16, target modules= "layernorm qkv" , bias="none", peft model = peft.get peft model model, peft config peft model.to "cuda", dtype=torch.bfloat16 You can then plug this PEFT model into your training loop. The full recipe includes the dataloader, loss, and optimizer setup. Table 1 summarizes Q3/Q8 test accuracy for the ESM2-3B plus LoRA model alongside strong published baselines reported in the Porter 6 paper. Table 1 reports the mean score over the top five validation checkpoints for ESM2-3B. Model | Q3 accuracy % | Q8 accuracy % | | ESM-2 3B plus LoRA top five validation mean | 84.80 | 74.30 | | Porter 6 | 84.56 | 74.18 | | NetSurfP-3.0 | 82.92 | 71.84 | | SPOT-1D-LM | 84.30 | 74.09 | Table 1. Q3 and Q8 protein secondary structure prediction accuracy comparing ESM2-3B plus LoRA with published baseline models on the Porter 6 benchmark Overall, LoRA fine-tuning reaches accuracy that is competitive with other state-of-the-art PSSP approaches. Figure 2 shows validation loss and accuracy versus fine-tuning steps. How can sequence packing yield higher utilization and throughput? Protein datasets typically contain sequences of varying lengths. If they are batched naively the padded BSHD format , they are padded to the maximum length in the batch and a large fraction of tokens become padding. This wastes compute and memory bandwidth inside attention and MLP layers. Sequence packing https://developer.nvidia.com/blog/scale-biology-transformer-models-with-pytorch-and-nvidia-bionemo-recipes/ the packed/flattened THD format reduces that waste by concatenating only the nonpadding tokens and tracking per-sequence boundaries with cumulative-length metadata. As a result, attention/MLP kernels operate on real tokens rather than padded tokens. For a deeper explanation of how packing works in practice and how it interacts with TE packed formats , see Scale Biology Transformer Models with PyTorch and NVIDIA BioNeMo Recipes https://developer.nvidia.com/blog/scale-biology-transformer-models-with-pytorch-and-nvidia-bionemo-recipes/ . Figure 3 shows throughput tokens/sec when fine-tuning with THD versus BSHD. In this setup, switching from BSHD to THD improved tokens/sec by ~5.5x, largely by removing padding overhead. The achieved speedup largely depends on the sequence length distribution, microbatch size, and GPU. Beyond throughput, THD packing improves memory efficiency. It reduces the amount of activation or attention work spent on padding tokens, so a larger fraction of the GPU memory traffic and compute goes toward useful non-padding tokens. For identical input sequences and batch size, THD typically uses less memory than BSHD because it avoids materializing padded tokens. In practice, that saved headroom is used to increase the number of real tokens processed per step. Evo2-1B for DNA splice-site classification Evo 2 https://github.com/ArcInstitute/evo2 is a generative DNA foundation model trained on genomic sequences spanning all domains of life. Architecturally it is built on striped Hyena blocks—a mix of state-space-style long-convolution operators and a smaller number of attention layers. This allows it to process long DNA contexts efficiently. Just as ESM2 learns protein “grammar” from amino-acid sequences, Evo2 learns genomic regularities directly from nucleotide sequences, which transfer to a variety of downstream tasks: variant effect prediction, regulatory element classification, and of interest here splice-site identification. What is splice-site classification? Splicing is the cellular process that removes introns from pre-mRNA and joins exons together. The boundaries are defined by two short sequence motifs: donor sites intron starts, typically GT at the 5′ end of the intron and acceptor sites intron ends, typically AG at the 3′ end. Identifying these sites from raw DNA is more difficult than just matching the dinucleotide motif. The same GT/AG patterns appear throughout the genome and only a small fraction are functional splice sites. Useful predictors have to learn longer-range context around the candidate position. We used the splice sites all task from the Nucleotide Transformer downstream-tasks https://huggingface.co/datasets/InstaDeepAI/nucleotide transformer downstream tasks revised dataset. Each example is a fixed-length 600 bp DNA window, and the label is one of three classes describing the central position—no-splice, acceptor, or donor. The benchmark ships ~30K training / ~3K test examples and is roughly class-balanced. At a modeling level, this is a sequence classification problem: a single label per input sequence, in contrast to the per-token labels in PSSP. Evo2 plus LoRA in BioNeMo Recipes The team fine-tuned Evo2-1B for splice-site classification by subclassing the Megatron Hyena model to add a small sequence-classification head on top of mean-pooled hidden states. LoRA adapters were then trained on the backbone attention, MLP, and Hyena-mixer projections. The pretrained backbone weights were kept frozen; only the LoRA adapters and the classification head were trained. To put the LoRA contribution in context, we trained two configurations on the same data and compared them: Head-only baseline : Backbone frozen; no adapters, only the classification head is trainable. Total trainable parameters: ~3.7 million 0.33 % of the model LoRA plus head: Backbone frozen; LoRA adapters on the listed target modules, classification head trainable. Total trainable parameters: ~16.0 million 1.42 % of the model Table 2 shows the test accuracy on the held-out 3K examples. Mode | Trainable parameters | Trainable fraction | Test accuracy | | Head only | 3,697,923 | 0.33 % | 52.3 % | | LoRA plus head | 15,985,923 | 1.42 % | 96.6 % | Table 2. Comparison of trainable parameter counts and splice-site classification accuracy for head-only versus LoRA-adapted Evo2-1B models The gap is large: with only ~1% of the parameters trainable, LoRA recovers nearly all of the signal that the pretrained Evo2 backbone holds about splicing, whereas pooling alone is far from sufficient. Most of the residual error from the LoRA model is in the donor↔acceptor direction. This is expected because both motifs share the GT/AG dinucleotide structure and require longer-range context to disambiguate. The full workflow runs end-to-end on a single RTX 6000 Workstation Edition in about one hour. The following snippet mirrors the ESM2 example stylistically: it loads the Evo2 backbone, attaches a classification head through a Hyena Model subclass, and configures LoRA adapters on the attention, MLP, and Hyena-mixer projections. python from bionemo.evo2.models.evo2 lora import Evo2LoRA from evo2 classifier import Hyena1bClassifierProvider, HyenaForSequenceClassification, Backbone provider: a HyenaModel subclass with a small classification head LayerNorm → Linear → GELU → Dropout → Linear on top of mean-pooled hidden states. model provider = Hyena1bClassifierProvider num classes=3, no-splice / acceptor / donor classifier dropout=0.1, pool="mean", LoRA adapters on attention linear qkv, linear proj , MLP linear fc1, linear fc2 , and the Hyena mixer dense projection, dense . The classification head is kept trainable via the skip freeze modules pattern; everything else is frozen. peft = Evo2LoRA target modules= "linear qkv", "linear proj", "linear fc1", "linear fc2", "dense projection", "dense", , dim=16, alpha=32, dropout=0.1, skip freeze modules= " classification head " , The Megatron-Bridge pretrain entry point handles the distributed training, optimizer, scheduler, checkpointing, dataloading, and logging. To launch a fine-tuning run end-to-end, the recipe exposes a CLI: torchrun --nproc per node=1 evo2 classifier.py \ --train-jsonl splice train.jsonl \ --val-jsonl splice val.jsonl \ --test-jsonl splice test.jsonl \ --base-ckpt-dir evo2 1b bf16 mbridge \ --result-dir splice run \ --experiment-name lora finetune \ --num-classes 3 \ --seq-length-tokens 600 \ --train-iters 1000 \ --global-batch-size 32 --micro-batch-size 32 \ --lr 5e-4 --min-lr 5e-5 --warmup-iters 30 \ --lora-finetune --lora-dim 16 --lora-alpha 32 --lora-dropout 0.1 Swap --lora-finetune and increase the batch size to reproduce the head-only baseline. Data, optimizer, scheduler, and evaluation stay the same. For the complete training loop, dataset code, parameter accounting, and evaluation utilities, see the Evo2 LoRA fine-tuning notebook https://github.com/NVIDIA-BioNeMo/bionemo-framework/tree/main/bionemo-recipes/recipes/evo2 megatron/examples . Get started fine-tuning biological foundation models Across two very different biological modalities—proteins with ESM2 and DNA with Evo2—the same parameter-efficient recipe can be used. You can freeze the pretrained backbone, train a small LoRA adapter plus a task-specific head, and recover accuracy that is competitive with full fine-tuning or specialized models, on a single workstation GPU. For ESM2-3B, LoRA brings PSSP performance into the same range as strong published baselines like Porter 6 and SPOT-1D-LM, while TE and THD sequence packing make training on a single NVIDIA RTX 6000 Blackwell Workstation Edition https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-6000/ GPU practical. For Evo2-1B, the same approach lifts splice-site classification from a frozen-backbone baseline of ~52% to ~97% test accuracy while training only ~1.4% of the parameters. Billion-parameter biological foundation models are now adaptable on modest hardware, provided that the surrounding training stack TE, Megatron-Bridge, packed sequences, PEFT is well integrated. NVIDIA BioNeMo Recipes https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes are designed to make that integration the default, not the exception. To get started fine-tuning biological foundation models with LoRA, TE, and scalable PyTorch workflows, check out the NVIDIA BioNeMo Recipes https://docs.nvidia.com/bionemo-framework/latest/main/recipes/index.html .