Fine-Tuning Biological Foundation Models with LoRA Using NVIDIA BioNeMo Recipes

wpnews.pro

Foundation models are reshaping computational biology. Pretrained on massive corpora of protein or genomic sequences, models such as ESM2 (a protein language model) and Evo 2 (a DNA language model) capture statistical regularities of biological sequences. These transfer well to a wide range of downstream tasks, including structure prediction, variant effect, and functional annotation.

Yet adapting these models to a specific task is nontrivial: at billions of parameters, full fine-tuning quickly becomes impractical, both in compute and storage of optimizer state and checkpoints.

Low-Rank Adaptation (LoRA) directly addresses this challenge. By keeping the pretrained backbone frozen and training only a small set of low-rank adapter matrices, LoRA can match full fine-tuning quality on many tasks while training ~1% of the parameters, fitting a single billion-scale model and its adapter state on a single workstation GPU.

To reduce the difficulty of building these workflows, NVIDIA BioNeMo Recipes provide step-by-step training recipes built on familiar PyTorch, Hugging Face, Megatron-Bridge patterns. Performance-oriented components such as NVIDIA Transformer Engine (TE) and scale-out strategies are integrated where they pay off, but the recipes themselves stay readable.

This post walks through two case studies that show how the same parameter-efficient recipe applies across biological modalities on a single NVIDIA RTX 6000 Blackwell Workstation Edition GPU:

ESM2-3B plus LoRA for protein secondary structure prediction (PSSP) - Evo2-1B plus LoRA for DNA splice-site classification

All the source code to customize or reproduce these results are available in NVIDIA BioNeMo Recipes.

How LoRA enables fine-tuning at scale #

Before diving into the case studies, a quick refresher on the method. Full fine-tuning is resource-heavy because it requires storing and updating all model parameters and their optimizer states, which quickly becomes impractical as models scale.

LoRA is a practical method to fine-tune large pretrained transformers without updating or storing optimizer state for all model parameters. The core idea behind LoRA is that instead of updating a dense model’s weight matrix (W), LoRA adds a new trainable low-rank matrix (W=BA) in parallel and keeps (W) frozen. This dramatically reduces the number of trainable parameters and the optimizer/memory footprint.

LoRA is parameterized by a small set of hyperparameters that trade off capacity, stability, and cost. The rank (r) controls the size of the added low-rank matrices and therefore the number of trainable parameters. The target modules specify which layers receive adapters, with common choices such as attention and MLP projections. For small datasets, LoRA dropout can be enabled as an additional form of regularization.

While the following two case studies differ in modality (protein versus DNA), task type (token classification versus sequence classification), and underlying architecture (transformer versus striped Hyena), both use the same LoRA recipe pattern.

ESM2-3B for protein secondary structure prediction #

PSSP is the task of assigning a structural label to each amino acid in a protein sequence. Secondary structure labels describe local backbone conformations—helices and strands—without requiring a full 3D structure prediction. For many proteins, these local patterns correlate with functional motifs and global fold organization.

PSSP is a core building block for many downstream applications in biology. Because local structure is strongly correlated with protein function, PSSP can provide useful functional context. In addition, these predictions can inform tertiary structure prediction, solvent accessibility prediction, protein-protein interaction prediction, and structural class- or domain-related prediction.

At a modeling level, PSSP is a token classification problem: the input is an amino-acid sequence, and the output is a structural label for each residue.

There are two common evaluation variants, differing only in the label space:

Q3 (3-state): H (Helix), E (Strand/Sheet), C (Coil/Loop)** Q8 (8-state)**: H (α-helix), B (β-bridge), E (β-strand), G (310helix), I (π-helix), T (turn), S (bend), C (coil/other)

ESM2-3B is a 3-billion-parameter protein language model, so full fine-tuning typically requires substantial compute and memory. LoRA makes adaptation practical by training only a small number of additional parameters, while still achieving strong performance on PSSP.

ESM2 plus PEFT in BioNeMo Recipes (TE-accelerated)

The team fine-tuned ESM2-3B for PSSP by adding a lightweight per-residue classification head (for Q3/Q8 labels) and training LoRA adapters through the PEFT library, while keeping the pretrained backbone weights frozen. For data, we used the curated splits released by the authors of the Porter 6 model and reported results on their provided test set. To maximize throughput, we enabled TE and sequence packing and ran the full training workflow on one NVIDIA RTX 6000 Blackwell Workstation Edition GPU in under one hour.

The following snippet adapted from the BioNeMo Recipes ESM2 plus PEFT example shows how to load the TE-compatible ESM2 model and attach LoRA adapters to the fused query/key/value (QKV) projections:

import peft
import torch
from transformers import AutoConfig, AutoModelForTokenClassification

config = AutoConfig.from_pretrained("nvidia/esm2_t36_3B_UR50D", trust_remote_code=True)
model = AutoModelForTokenClassification.from_pretrained(
    "nvidia/esm2_t36_3B_UR50D", config=config, trust_remote_code=True, dtype="bfloat16"
)

peft_config = peft.LoraConfig(
    task_type=peft.TaskType.TOKEN_CLS,
    inference_mode=False,
    r=8,
    lora_alpha=16,
    target_modules=["layernorm_qkv"],
    bias="none",
)

peft_model = peft.get_peft_model(model, peft_config)
peft_model.to("cuda", dtype=torch.bfloat16)

You can then plug this PEFT model into your training loop. The full recipe includes the data, loss, and optimizer setup.

Table 1 summarizes Q3/Q8 test accuracy for the ESM2-3B plus LoRA model alongside strong published baselines reported in the Porter 6 paper. Table 1 reports the mean score over the top five validation checkpoints for ESM2-3B.

Model | Q3 accuracy (%) | Q8 accuracy (%) | | ESM-2 3B plus LoRA (top five validation mean) | 84.80 | 74.30 | | Porter 6 | 84.56 | 74.18 | | NetSurfP-3.0 | 82.92 | 71.84 | | SPOT-1D-LM | 84.30 | 74.09 |

Table 1. Q3 and Q8 protein secondary structure prediction accuracy comparing ESM2-3B plus LoRA with published baseline models on the Porter 6 benchmark

Overall, LoRA fine-tuning reaches accuracy that is competitive with other state-of-the-art PSSP approaches. Figure 2 shows validation loss and accuracy versus fine-tuning steps.

How can sequence packing yield higher utilization and throughput?

Protein datasets typically contain sequences of varying lengths. If they are batched naively (the padded BSHD format), they are padded to the maximum length in the batch and a large fraction of tokens become padding. This wastes compute and memory bandwidth inside attention and MLP layers.

Sequence packing (the packed/flattened THD format) reduces that waste by concatenating only the nonpadding tokens and tracking per-sequence boundaries with cumulative-length metadata. As a result, attention/MLP kernels operate on real tokens rather than padded tokens. For a deeper explanation of how packing works in practice (and how it interacts with TE packed formats), see Scale Biology Transformer Models with PyTorch and NVIDIA BioNeMo Recipes. Figure 3 shows throughput (tokens/sec) when fine-tuning with THD versus BSHD.

In this setup, switching from BSHD to THD improved tokens/sec by ~5.5x, largely by removing padding overhead. The achieved speedup largely depends on the sequence length distribution, microbatch size, and GPU.

Beyond throughput, THD packing improves memory efficiency. It reduces the amount of activation or attention work spent on padding tokens, so a larger fraction of the GPU memory traffic and compute goes toward useful (non-padding) tokens.

For identical input sequences and batch size, THD typically uses less memory than BSHD because it avoids materializing padded tokens. In practice, that saved headroom is used to increase the number of real tokens processed per step.

Evo2-1B for DNA splice-site classification #

Evo 2 is a generative DNA foundation model trained on genomic sequences spanning all domains of life. Architecturally it is built on striped Hyena blocks—a mix of state-space-style long-convolution operators and a smaller number of attention layers. This allows it to process long DNA contexts efficiently. Just as ESM2 learns protein “grammar” from amino-acid sequences, Evo2 learns genomic regularities directly from nucleotide sequences, which transfer to a variety of downstream tasks: variant effect prediction, regulatory element classification, and (of interest here) splice-site identification.

What is splice-site classification?

Splicing is the cellular process that removes introns from pre-mRNA and joins exons together. The boundaries are defined by two short sequence motifs: donor sites (intron starts, typically GT) at the 5′ end of the intron and acceptor sites (intron ends, typically AG) at the 3′ end.

Identifying these sites from raw DNA is more difficult than just matching the dinucleotide motif. The same GT/AG patterns appear throughout the genome and only a small fraction are functional splice sites. Useful predictors have to learn longer-range context around the candidate position.

We used the splice_sites_all task from the Nucleotide Transformer downstream-tasks dataset. Each example is a fixed-length 600 bp DNA window, and the label is one of three classes describing the central position—no-splice, acceptor, or donor. The benchmark ships ~30K training / ~3K test examples and is roughly class-balanced.

At a modeling level, this is a sequence classification problem: a single label per input sequence, in contrast to the per-token labels in PSSP.

Evo2 plus LoRA in BioNeMo Recipes

The team fine-tuned Evo2-1B for splice-site classification by subclassing the Megatron Hyena model to add a small sequence-classification head on top of mean-pooled hidden states. LoRA adapters were then trained on the backbone attention, MLP, and Hyena-mixer projections. The pretrained backbone weights were kept frozen; only the LoRA adapters and the classification head were trained.

To put the LoRA contribution in context, we trained two configurations on the same data and compared them:

Head-only baseline: Backbone frozen; no adapters, only the classification head is trainable. Total trainable parameters: ~3.7 million (0.33 % of the model)LoRA plus head: Backbone frozen; LoRA adapters on the listed target modules, classification head trainable. Total trainable parameters: ~16.0 million (1.42 % of the model)

Table 2 shows the test accuracy on the held-out 3K examples.

Mode | Trainable parameters | Trainable fraction | Test accuracy | | Head only | 3,697,923 | 0.33 % | 52.3 % | | LoRA plus head | 15,985,923 | 1.42 % | 96.6 % |

Table 2. Comparison of trainable parameter counts and splice-site classification accuracy for head-only versus LoRA-adapted Evo2-1B models

The gap is large: with only ~1% of the parameters trainable, LoRA recovers nearly all of the signal that the pretrained Evo2 backbone holds about splicing, whereas pooling alone is far from sufficient. Most of the residual error from the LoRA model is in the donor↔acceptor direction. This is expected because both motifs share the GT/AG dinucleotide structure and require longer-range context to disambiguate.

The full workflow runs end-to-end on a single RTX 6000 Workstation Edition in about one hour.

The following snippet mirrors the ESM2 example stylistically: it loads the Evo2 backbone, attaches a classification head through a Hyena Model subclass, and configures LoRA adapters on the attention, MLP, and Hyena-mixer projections.

from bionemo.evo2.models.evo2_lora import Evo2LoRA
from evo2_classifier import (
    Hyena1bClassifierProvider,
    HyenaForSequenceClassification,
)

model_provider = Hyena1bClassifierProvider(
    num_classes=3,            # no-splice / acceptor / donor
    classifier_dropout=0.1,
    pool="mean",
)

peft = Evo2LoRA(
    target_modules=[
        "linear_qkv", "linear_proj",
        "linear_fc1", "linear_fc2",
        "dense_projection", "dense",
    ],
    dim=16,
    alpha=32,
    dropout=0.1,
    skip_freeze_modules=["*classification_head*"],
)

The Megatron-Bridge pretrain entry point handles the distributed training, optimizer, scheduler, checkpointing, data, and logging.

To launch a fine-tuning run end-to-end, the recipe exposes a CLI:

torchrun --nproc_per_node=1 evo2_classifier.py \
    --train-jsonl splice_train.jsonl \
    --val-jsonl   splice_val.jsonl \
    --test-jsonl  splice_test.jsonl \
    --base-ckpt-dir evo2_1b_bf16_mbridge \
    --result-dir   splice_run \
    --experiment-name lora_finetune \
    --num-classes 3 \
    --seq-length-tokens 600 \
    --train-iters 1000 \
    --global-batch-size 32 --micro-batch-size 32 \
    --lr 5e-4 --min-lr 5e-5 --warmup-iters 30 \
    --lora-finetune --lora-dim 16 --lora-alpha 32 --lora-dropout 0.1

Swap --lora-finetune

and increase the batch size to reproduce the head-only baseline. Data, optimizer, scheduler, and evaluation stay the same.

For the complete training loop, dataset code, parameter accounting, and evaluation utilities, see the Evo2 LoRA fine-tuning notebook.

Get started fine-tuning biological foundation models #

Across two very different biological modalities—proteins with ESM2 and DNA with Evo2—the same parameter-efficient recipe can be used. You can freeze the pretrained backbone, train a small LoRA adapter plus a task-specific head, and recover accuracy that is competitive with full fine-tuning or specialized models, on a single workstation GPU.

For ESM2-3B, LoRA brings PSSP performance into the same range as strong published baselines like Porter 6 and SPOT-1D-LM, while TE and THD sequence packing make training on a single NVIDIA RTX 6000 Blackwell Workstation Edition GPU practical. For Evo2-1B, the same approach lifts splice-site classification from a frozen-backbone baseline of ~52% to ~97% test accuracy while training only ~1.4% of the parameters.

Billion-parameter biological foundation models are now adaptable on modest hardware, provided that the surrounding training stack (TE, Megatron-Bridge, packed sequences, PEFT) is well integrated. NVIDIA BioNeMo Recipes are designed to make that integration the default, not the exception.

To get started fine-tuning biological foundation models with LoRA, TE, and scalable PyTorch workflows, check out the NVIDIA BioNeMo Recipes.

source & further reading

developer.nvidia.com — original article Run High-Performance Core Math at Scale with NVIDIA nvmath-python Four Ways to Deploy More Secure AI Agents NVIDIA Exemplar Cloud: Lessons for Unlocking Full Performance on AI Infrastructure