[AI] Practical QLoRA Fine-tuning: Axolotl & Unsloth | SLM Playbook

wpnews.pro

Full-parameter fine-tuning of a large language model is a luxury. For even an 8B model like Llama 3, updating all weights in 16-bit precision requires massive clusters far beyond the reach of mid-sized teams or startups.

To resolve these hardware barriers, Parameter-Efficient Fine-Tuning (PEFT) methods were developed, with LoRA and QLoRA emerging as the dominant paradigms. They allow developers to train multi-billion parameter models on a single consumer GPU (like an RTX 3090, 4090, or A10G) while maintaining near-zero performance degradation compared to full tuning.

This article dissects the mathematics behind low-rank adaptation, details how to build production-grade Axolotl configurations, and uses Unsloth to accelerate training loops.

During domain-specific fine-tuning (e.g., text-to-SQL or medical terminology), parameter weight updates do not occupy the full parameter space; they exhibit a very low intrinsic rank. Instead of updating the massive original weight matrix $W_0 \in \mathbb{R}^{d \times k}$, LoRA freezes $W_0$ and models the weight updates $\Delta W$ as the product of two extremely low-rank matrices $B$ and $A$ of rank $r$ ($r \ll \min(d, k)$):

$$\Delta W = B \cdot A$$

Where:

        LoRA Layer Forward Pass:

             Input x 
             ┌───┴───┐
             │       │
             ▼       ▼
          ┌─────┐ ┌─────┐
          │     │ │  A  │ (Rank r, Gaussian initialized)
          │ W_0 │ └─────┘
          │     │    │ (r-dimensional vector)
          │(Frozen)  ▼
          │     │ ┌─────┐
          │     │ │  B  │ (Rank r, Zero initialized)
          └─────┘ └─────┘
             │       │
             ▼       ▼
            h_W     h_LoRA * (alpha / r)
             └───┬───┘
                 ▼
              Output y

For a given input $x$, the output activation $y$ is computed as:

$$y = W_0 x + \Delta W x = W_0 x + \frac{\alpha}{r} (B A x)$$

Where:

Introduced by Tim Dettmers in 2023, QLoRA (Quantized Low-Rank Adaptation) takes memory efficiency a step further by quantizing the base model weights $W_0$ to a highly compressed 4-bit representation, while keeping the active LoRA adapter weights in 16-bit precision.

QLoRA relies on three key mathematical and systems innovations:

Neural network weights naturally follow a zero-centered normal distribution. Standard linear quantization schemes (like INT4) allocate quantization bins uniformly, wasting precision at the sparse tails of the distribution.

NF4 addresses this by establishing non-linear quantization intervals such that each bin contains an equal number of expected parameters (equal information entropy):

$$\int_{q_i}^{q_{i+1}} \mathcal{N}(0, 1) dx = \text{const}$$

This preserves the maximum information of the original FP16 weights, matching FP4/INT4 precision while cutting model weight size to 4 bits per parameter.

In standard quantization, weight blocks are scaled using a 32-bit float constant. With a block size of 64, this scaling constant introduces an overhead of $32 / 64 = 0.5$ bits per parameter.

Double Quantization quantizes these scaling constants themselves from 32-bit floats to 8-bit floats with a block size of 256.

During training with long sequence lengths or large batches, sudden gradient allocation spikes can exceed physical VRAM limits, triggering OOM crashes.

Paged Optimizers leverage CUDA Unified Memory to automatically swap (page) optimizer states between GPU VRAM and CPU RAM during peak memory phases, gracefully slowing down training rather than crashing.

Axolotl is a robust framework for LLM fine-tuning, offering native integration with FlashAttention-2, DeepSpeed, and PyTorch FSDP.

Here is a complete production-ready qlora_llama3_8b.yml

configuration optimized for a single NVIDIA A10G (24GB VRAM):

base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: PreTrainedTokenizerFast

load_in_8bit: false
load_in_4bit: true
gptq: false

bf16: true
fp16: false
tf32: true

adapter: qlora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

datasets:
  - path: ./temp_cleaned_dataset.jsonl
    type: alpaca
    shards: 10
dataset_prepared_path: ./last_run_prepared
val_set_size: 0.05
output_dir: ./lora-llama3-8b-output

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true
flash_attention: true

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0002
weight_decay: 0.01
max_grad_norm: 1.0

save_steps: 100
eval_steps: 100
logging_steps: 10

While Axolotl is highly configurable, standard PyTorch backward passes for attention layers leave performance on the table. Unsloth rewrites the attention and MLP backward steps in raw OpenAI Triton, achieving a 3x speedup while reducing memory consumption by 60%.

import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

max_seq_length = 4096 # Limit context length to optimize speed on 24GB GPUs
dtype = None # Auto-detect (Float16 or Bfloat16)
load_in_4bit = True # Enable 4-bit quantization

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Meta-Llama-3-8B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 32,
    lora_dropout = 0, # Unsloth is optimized for dropout = 0
    bias = "none",
    use_gradient_checkpointing = "unsloth", # Memory-optimized gradient checkpointing
    random_state = 3407,
)

alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    outputs      = examples["output"]
    texts = []
    for inst, out in zip(instructions, outputs):
        text = alpaca_prompt.format(inst, out) + tokenizer.eos_token
        texts.append(text)
    return { "text" : texts }

dataset = load_dataset("json", data_files="temp_cleaned_dataset.jsonl", split="train")
dataset = dataset.map(formatting_prompts_func, batched = True)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Set to True to pack short sequences and speed up training
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 120, # Number of training steps for test run
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

trainer_stats = trainer.train()

model.save_pretrained("lora_model_adapter")
tokenizer.save_pretrained("lora_model_adapter")
print("Training complete! Model saved.")

Fine-tuning via LoRA outputs a directory of adapter weights (typically 50MB - 500MB). To run high-performance inference serving with engines like vLLM, you should merge these adapter matrices back into the 16-bit base model weights.

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Meta-Llama-3-8B-Instruct",
    max_seq_length = 4096,
    dtype = None,
    load_in_4bit = False, # Must be False to export back to native 16-bit float
)
model.load_adapter("lora_model_adapter")

print("Merging weights and saving to disk...")
model.save_pretrained_merged("merged_model_fp16", tokenizer, save_method = "merged_16bit")
print("Merge complete! Ready for vLLM serving.")

The output in merged_model_fp16

is a standalone 16-bit Hugging Face model directory ready to be loaded by vllm serve

.

Supervised Fine-Tuning instructs your model on formatting styles and conversational behavior. However, complex, multi-step logical operations (Reasoning) benefit from structured channelling of reasoning steps.

In Part 4: Task & Knowledge Distillation, we explore how to extract reasoning traces (Chain of Thought - CoT) from larger teacher models like

This post was originally published on my blog at Practical QLoRA Fine-tuning: Axolotl & Unsloth | SLM Playbook.

Hi, I'm Lê Tuấn Anh (vesviet) 👋

I am a Senior Go Backend Architect & Distributed Systems Engineer with 17+ years of experience building high-traffic platforms (25M+ requests/month).

If you enjoyed this deep-dive, let's connect on LinkedIn or explore my consulting services at tanhdev.com/hire.

source & further reading

dev.to — original article 🦩OS June Recap: Reviewing PRs was my biggest milestone philliant on youtube is the companion to the writing How I Built and Secured a Self-Hosted Stack

[AI] Practical QLoRA Fine-tuning: Axolotl & Unsloth | SLM Playbook

Run your AI side-project on zahid.host