# [AI] Practical QLoRA Fine-tuning: Axolotl & Unsloth | SLM Playbook

> Source: <https://dev.to/vesviet/ai-practical-qlora-fine-tuning-axolotl-unsloth-slm-playbook-37d1>
> Published: 2026-06-30 23:39:00+00:00

[← Series hub](https://dev.to/series/slm-playbook/)

[← Previous](https://dev.to/series/slm-playbook/part-2-sft-data-engineering/) | [Next →](https://dev.to/series/slm-playbook/part-4-knowledge-distillation-r1/)

Full-parameter fine-tuning of a large language model is a luxury. For even an 8B model like Llama 3, updating all weights in 16-bit precision requires massive clusters far beyond the reach of mid-sized teams or startups.

To resolve these hardware barriers, **Parameter-Efficient Fine-Tuning (PEFT)** methods were developed, with **LoRA** and **QLoRA** emerging as the dominant paradigms. They allow developers to train multi-billion parameter models on a single consumer GPU (like an RTX 3090, 4090, or A10G) while maintaining near-zero performance degradation compared to full tuning.

This article dissects the mathematics behind low-rank adaptation, details how to build production-grade **Axolotl** configurations, and uses **Unsloth** to accelerate training loops.

During domain-specific fine-tuning (e.g., text-to-SQL or medical terminology), parameter weight updates do not occupy the full parameter space; they exhibit a very low **intrinsic rank**. Instead of updating the massive original weight matrix $W_0 \in \mathbb{R}^{d \times k}$, LoRA freezes $W_0$ and models the weight updates $\Delta W$ as the product of two extremely low-rank matrices $B$ and $A$ of rank $r$ ($r \ll \min(d, k)$):

$$\Delta W = B \cdot A$$

Where:

```
        LoRA Layer Forward Pass:

             Input x 
             ┌───┴───┐
             │       │
             ▼       ▼
          ┌─────┐ ┌─────┐
          │     │ │  A  │ (Rank r, Gaussian initialized)
          │ W_0 │ └─────┘
          │     │    │ (r-dimensional vector)
          │(Frozen)  ▼
          │     │ ┌─────┐
          │     │ │  B  │ (Rank r, Zero initialized)
          └─────┘ └─────┘
             │       │
             ▼       ▼
            h_W     h_LoRA * (alpha / r)
             └───┬───┘
                 ▼
              Output y
```

For a given input $x$, the output activation $y$ is computed as:

$$y = W_0 x + \Delta W x = W_0 x + \frac{\alpha}{r} (B A x)$$

Where:

Introduced by Tim Dettmers in 2023, **QLoRA (Quantized Low-Rank Adaptation)** takes memory efficiency a step further by quantizing the base model weights $W_0$ to a highly compressed **4-bit** representation, while keeping the active LoRA adapter weights in 16-bit precision.

QLoRA relies on three key mathematical and systems innovations:

Neural network weights naturally follow a zero-centered normal distribution. Standard linear quantization schemes (like INT4) allocate quantization bins uniformly, wasting precision at the sparse tails of the distribution.

NF4 addresses this by establishing non-linear quantization intervals such that **each bin contains an equal number of expected parameters (equal information entropy)**:

$$\int_{q_i}^{q_{i+1}} \mathcal{N}(0, 1) dx = \text{const}$$

This preserves the maximum information of the original FP16 weights, matching FP4/INT4 precision while cutting model weight size to 4 bits per parameter.

In standard quantization, weight blocks are scaled using a 32-bit float constant. With a block size of 64, this scaling constant introduces an overhead of $32 / 64 = 0.5$ bits per parameter.

Double Quantization quantizes **these scaling constants themselves** from 32-bit floats to 8-bit floats with a block size of 256.

During training with long sequence lengths or large batches, sudden gradient allocation spikes can exceed physical VRAM limits, triggering OOM crashes.

Paged Optimizers leverage CUDA Unified Memory to automatically swap (page) optimizer states between GPU VRAM and CPU RAM during peak memory phases, gracefully slowing down training rather than crashing.

**Axolotl** is a robust framework for LLM fine-tuning, offering native integration with FlashAttention-2, DeepSpeed, and PyTorch FSDP.

Here is a complete production-ready `qlora_llama3_8b.yml`

configuration optimized for a single NVIDIA A10G (24GB VRAM):

```
# Model & Training Mode Config
base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: PreTrainedTokenizerFast

# Enable QLoRA (4-bit NF4 Quantization)
load_in_8bit: false
load_in_4bit: true
gptq: false

# Precision settings
bf16: true
fp16: false
tf32: true

# LoRA Adapter Configuration
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

# Dataset Configurations
datasets:
  - path: ./temp_cleaned_dataset.jsonl
    type: alpaca
    shards: 10
dataset_prepared_path: ./last_run_prepared
val_set_size: 0.05
output_dir: ./lora-llama3-8b-output

# Memory & Speed Optimizations
sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true
flash_attention: true

# Hyperparameters
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0002
weight_decay: 0.01
max_grad_norm: 1.0

# Checkpointing & Logs
save_steps: 100
eval_steps: 100
logging_steps: 10
```

While Axolotl is highly configurable, standard PyTorch backward passes for attention layers leave performance on the table. **Unsloth** rewrites the attention and MLP backward steps in raw **OpenAI Triton**, achieving a **3x speedup** while reducing memory consumption by **60%**.

``` python
import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

max_seq_length = 4096 # Limit context length to optimize speed on 24GB GPUs
dtype = None # Auto-detect (Float16 or Bfloat16)
load_in_4bit = True # Enable 4-bit quantization

# 1. Initialize model and tokenizer via Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Meta-Llama-3-8B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# 2. Add optimized LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 32,
    lora_dropout = 0, # Unsloth is optimized for dropout = 0
    bias = "none",
    use_gradient_checkpointing = "unsloth", # Memory-optimized gradient checkpointing
    random_state = 3407,
)

# 3. Format SFT Prompts (Alpaca style)
alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    outputs      = examples["output"]
    texts = []
    for inst, out in zip(instructions, outputs):
        text = alpaca_prompt.format(inst, out) + tokenizer.eos_token
        texts.append(text)
    return { "text" : texts }

# Load semantic deduplicated dataset from Part 2
dataset = load_dataset("json", data_files="temp_cleaned_dataset.jsonl", split="train")
dataset = dataset.map(formatting_prompts_func, batched = True)

# 4. Setup SFT Trainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Set to True to pack short sequences and speed up training
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 120, # Number of training steps for test run
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

# Execute training run
trainer_stats = trainer.train()

# 5. Save model adapter weights
model.save_pretrained("lora_model_adapter")
tokenizer.save_pretrained("lora_model_adapter")
print("Training complete! Model saved.")
```

Fine-tuning via LoRA outputs a directory of adapter weights (typically 50MB - 500MB). To run high-performance inference serving with engines like vLLM, you should merge these adapter matrices back into the 16-bit base model weights.

``` python
from unsloth import FastLanguageModel

# Load the base model and model adapter in native 16-bit
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Meta-Llama-3-8B-Instruct",
    max_seq_length = 4096,
    dtype = None,
    load_in_4bit = False, # Must be False to export back to native 16-bit float
)
model.load_adapter("lora_model_adapter")

# Execute weights merge and save to disk
print("Merging weights and saving to disk...")
model.save_pretrained_merged("merged_model_fp16", tokenizer, save_method = "merged_16bit")
print("Merge complete! Ready for vLLM serving.")
```

The output in `merged_model_fp16`

is a standalone 16-bit Hugging Face model directory ready to be loaded by `vllm serve`

.

Supervised Fine-Tuning instructs your model on formatting styles and conversational behavior. However, complex, multi-step logical operations (Reasoning) benefit from structured channelling of reasoning steps.

In [ Part 4: Task & Knowledge Distillation](https://dev.to/series/slm-playbook/part-4-knowledge-distillation-r1/), we explore how to extract reasoning traces (Chain of Thought - CoT) from larger teacher models like

{{< author-cta >}}

*This post was originally published on my blog at Practical QLoRA Fine-tuning: Axolotl & Unsloth | SLM Playbook.*

**Hi, I'm Lê Tuấn Anh (vesviet) 👋**

*I am a Senior Go Backend Architect & Distributed Systems Engineer with 17+ years of experience building high-traffic platforms (25M+ requests/month).*

*If you enjoyed this deep-dive, let's connect on LinkedIn or explore my consulting services at tanhdev.com/hire.*