[AI] Practical QLoRA Fine-tuning: Axolotl & Unsloth | SLM Playbook

A developer details practical QLoRA fine-tuning using Axolotl and Unsloth, explaining how parameter-efficient methods like LoRA and QLoRA enable training multi-billion parameter models on a single consumer GPU. The post covers the mathematics of low-rank adaptation, production Axolotl configurations, and Unsloth's acceleration of training loops.

← Series hub https://dev.to/series/slm-playbook/ ← Previous https://dev.to/series/slm-playbook/part-2-sft-data-engineering/ | Next → https://dev.to/series/slm-playbook/part-4-knowledge-distillation-r1/ Full-parameter fine-tuning of a large language model is a luxury. For even an 8B model like Llama 3, updating all weights in 16-bit precision requires massive clusters far beyond the reach of mid-sized teams or startups. To resolve these hardware barriers, Parameter-Efficient Fine-Tuning PEFT methods were developed, with LoRA and QLoRA emerging as the dominant paradigms. They allow developers to train multi-billion parameter models on a single consumer GPU like an RTX 3090, 4090, or A10G while maintaining near-zero performance degradation compared to full tuning. This article dissects the mathematics behind low-rank adaptation, details how to build production-grade Axolotl configurations, and uses Unsloth to accelerate training loops. During domain-specific fine-tuning e.g., text-to-SQL or medical terminology , parameter weight updates do not occupy the full parameter space; they exhibit a very low intrinsic rank . Instead of updating the massive original weight matrix $W 0 \in \mathbb{R}^{d \times k}$, LoRA freezes $W 0$ and models the weight updates $\Delta W$ as the product of two extremely low-rank matrices $B$ and $A$ of rank $r$ $r \ll \min d, k $ : $$\Delta W = B \cdot A$$ Where: LoRA Layer Forward Pass: Input x ┌───┴───┐ │ │ ▼ ▼ ┌─────┐ ┌─────┐ │ │ │ A │ Rank r, Gaussian initialized │ W 0 │ └─────┘ │ │ │ r-dimensional vector │ Frozen ▼ │ │ ┌─────┐ │ │ │ B │ Rank r, Zero initialized └─────┘ └─────┘ │ │ ▼ ▼ h W h LoRA alpha / r └───┬───┘ ▼ Output y For a given input $x$, the output activation $y$ is computed as: $$y = W 0 x + \Delta W x = W 0 x + \frac{\alpha}{r} B A x $$ Where: Introduced by Tim Dettmers in 2023, QLoRA Quantized Low-Rank Adaptation takes memory efficiency a step further by quantizing the base model weights $W 0$ to a highly compressed 4-bit representation, while keeping the active LoRA adapter weights in 16-bit precision. QLoRA relies on three key mathematical and systems innovations: Neural network weights naturally follow a zero-centered normal distribution. Standard linear quantization schemes like INT4 allocate quantization bins uniformly, wasting precision at the sparse tails of the distribution. NF4 addresses this by establishing non-linear quantization intervals such that each bin contains an equal number of expected parameters equal information entropy : $$\int {q i}^{q {i+1}} \mathcal{N} 0, 1 dx = \text{const}$$ This preserves the maximum information of the original FP16 weights, matching FP4/INT4 precision while cutting model weight size to 4 bits per parameter. In standard quantization, weight blocks are scaled using a 32-bit float constant. With a block size of 64, this scaling constant introduces an overhead of $32 / 64 = 0.5$ bits per parameter. Double Quantization quantizes these scaling constants themselves from 32-bit floats to 8-bit floats with a block size of 256. During training with long sequence lengths or large batches, sudden gradient allocation spikes can exceed physical VRAM limits, triggering OOM crashes. Paged Optimizers leverage CUDA Unified Memory to automatically swap page optimizer states between GPU VRAM and CPU RAM during peak memory phases, gracefully slowing down training rather than crashing. Axolotl is a robust framework for LLM fine-tuning, offering native integration with FlashAttention-2, DeepSpeed, and PyTorch FSDP. Here is a complete production-ready qlora llama3 8b.yml configuration optimized for a single NVIDIA A10G 24GB VRAM : Model & Training Mode Config base model: meta-llama/Meta-Llama-3-8B-Instruct model type: LlamaForCausalLM tokenizer type: PreTrainedTokenizerFast Enable QLoRA 4-bit NF4 Quantization load in 8bit: false load in 4bit: true gptq: false Precision settings bf16: true fp16: false tf32: true LoRA Adapter Configuration adapter: qlora lora r: 16 lora alpha: 32 lora dropout: 0.05 lora target modules: - q proj - k proj - v proj - o proj - gate proj - up proj - down proj Dataset Configurations datasets: - path: ./temp cleaned dataset.jsonl type: alpaca shards: 10 dataset prepared path: ./last run prepared val set size: 0.05 output dir: ./lora-llama3-8b-output Memory & Speed Optimizations sequence len: 8192 sample packing: true pad to sequence len: true flash attention: true Hyperparameters gradient accumulation steps: 4 micro batch size: 2 num epochs: 3 optimizer: paged adamw 8bit lr scheduler: cosine learning rate: 0.0002 weight decay: 0.01 max grad norm: 1.0 Checkpointing & Logs save steps: 100 eval steps: 100 logging steps: 10 While Axolotl is highly configurable, standard PyTorch backward passes for attention layers leave performance on the table. Unsloth rewrites the attention and MLP backward steps in raw OpenAI Triton , achieving a 3x speedup while reducing memory consumption by 60% . python import torch from unsloth import FastLanguageModel from datasets import load dataset from trl import SFTTrainer from transformers import TrainingArguments max seq length = 4096 Limit context length to optimize speed on 24GB GPUs dtype = None Auto-detect Float16 or Bfloat16 load in 4bit = True Enable 4-bit quantization 1. Initialize model and tokenizer via Unsloth model, tokenizer = FastLanguageModel.from pretrained model name = "meta-llama/Meta-Llama-3-8B-Instruct", max seq length = max seq length, dtype = dtype, load in 4bit = load in 4bit, 2. Add optimized LoRA adapters model = FastLanguageModel.get peft model model, r = 16, target modules = "q proj", "k proj", "v proj", "o proj", "gate proj", "up proj", "down proj" , lora alpha = 32, lora dropout = 0, Unsloth is optimized for dropout = 0 bias = "none", use gradient checkpointing = "unsloth", Memory-optimized gradient checkpointing random state = 3407, 3. Format SFT Prompts Alpaca style alpaca prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request. Instruction: {} Response: {}""" def formatting prompts func examples : instructions = examples "instruction" outputs = examples "output" texts = for inst, out in zip instructions, outputs : text = alpaca prompt.format inst, out + tokenizer.eos token texts.append text return { "text" : texts } Load semantic deduplicated dataset from Part 2 dataset = load dataset "json", data files="temp cleaned dataset.jsonl", split="train" dataset = dataset.map formatting prompts func, batched = True 4. Setup SFT Trainer trainer = SFTTrainer model = model, tokenizer = tokenizer, train dataset = dataset, dataset text field = "text", max seq length = max seq length, dataset num proc = 2, packing = False, Set to True to pack short sequences and speed up training args = TrainingArguments per device train batch size = 2, gradient accumulation steps = 4, warmup steps = 10, max steps = 120, Number of training steps for test run learning rate = 2e-4, fp16 = not torch.cuda.is bf16 supported , bf16 = torch.cuda.is bf16 supported , logging steps = 1, optim = "adamw 8bit", weight decay = 0.01, lr scheduler type = "linear", seed = 3407, output dir = "outputs", , Execute training run trainer stats = trainer.train 5. Save model adapter weights model.save pretrained "lora model adapter" tokenizer.save pretrained "lora model adapter" print "Training complete Model saved." Fine-tuning via LoRA outputs a directory of adapter weights typically 50MB - 500MB . To run high-performance inference serving with engines like vLLM, you should merge these adapter matrices back into the 16-bit base model weights. python from unsloth import FastLanguageModel Load the base model and model adapter in native 16-bit model, tokenizer = FastLanguageModel.from pretrained model name = "meta-llama/Meta-Llama-3-8B-Instruct", max seq length = 4096, dtype = None, load in 4bit = False, Must be False to export back to native 16-bit float model.load adapter "lora model adapter" Execute weights merge and save to disk print "Merging weights and saving to disk..." model.save pretrained merged "merged model fp16", tokenizer, save method = "merged 16bit" print "Merge complete Ready for vLLM serving." The output in merged model fp16 is a standalone 16-bit Hugging Face model directory ready to be loaded by vllm serve . Supervised Fine-Tuning instructs your model on formatting styles and conversational behavior. However, complex, multi-step logical operations Reasoning benefit from structured channelling of reasoning steps. In Part 4: Task & Knowledge Distillation https://dev.to/series/slm-playbook/part-4-knowledge-distillation-r1/ , we explore how to extract reasoning traces Chain of Thought - CoT from larger teacher models like {{< author-cta }} This post was originally published on my blog at Practical QLoRA Fine-tuning: Axolotl & Unsloth | SLM Playbook. Hi, I'm Lê Tuấn Anh vesviet 👋 I am a Senior Go Backend Architect & Distributed Systems Engineer with 17+ years of experience building high-traffic platforms 25M+ requests/month . If you enjoyed this deep-dive, let's connect on LinkedIn or explore my consulting services at tanhdev.com/hire.