{"slug": "ai-practical-qlora-fine-tuning-axolotl-unsloth-slm-playbook", "title": "[AI] Practical QLoRA Fine-tuning: Axolotl & Unsloth | SLM Playbook", "summary": "A developer details practical QLoRA fine-tuning using Axolotl and Unsloth, explaining how parameter-efficient methods like LoRA and QLoRA enable training multi-billion parameter models on a single consumer GPU. The post covers the mathematics of low-rank adaptation, production Axolotl configurations, and Unsloth's acceleration of training loops.", "body_md": "[← Series hub](https://dev.to/series/slm-playbook/)\n\n[← Previous](https://dev.to/series/slm-playbook/part-2-sft-data-engineering/) | [Next →](https://dev.to/series/slm-playbook/part-4-knowledge-distillation-r1/)\n\nFull-parameter fine-tuning of a large language model is a luxury. For even an 8B model like Llama 3, updating all weights in 16-bit precision requires massive clusters far beyond the reach of mid-sized teams or startups.\n\nTo resolve these hardware barriers, **Parameter-Efficient Fine-Tuning (PEFT)** methods were developed, with **LoRA** and **QLoRA** emerging as the dominant paradigms. They allow developers to train multi-billion parameter models on a single consumer GPU (like an RTX 3090, 4090, or A10G) while maintaining near-zero performance degradation compared to full tuning.\n\nThis article dissects the mathematics behind low-rank adaptation, details how to build production-grade **Axolotl** configurations, and uses **Unsloth** to accelerate training loops.\n\nDuring domain-specific fine-tuning (e.g., text-to-SQL or medical terminology), parameter weight updates do not occupy the full parameter space; they exhibit a very low **intrinsic rank**. Instead of updating the massive original weight matrix $W_0 \\in \\mathbb{R}^{d \\times k}$, LoRA freezes $W_0$ and models the weight updates $\\Delta W$ as the product of two extremely low-rank matrices $B$ and $A$ of rank $r$ ($r \\ll \\min(d, k)$):\n\n$$\\Delta W = B \\cdot A$$\n\nWhere:\n\n```\n        LoRA Layer Forward Pass:\n\n             Input x \n             ┌───┴───┐\n             │       │\n             ▼       ▼\n          ┌─────┐ ┌─────┐\n          │     │ │  A  │ (Rank r, Gaussian initialized)\n          │ W_0 │ └─────┘\n          │     │    │ (r-dimensional vector)\n          │(Frozen)  ▼\n          │     │ ┌─────┐\n          │     │ │  B  │ (Rank r, Zero initialized)\n          └─────┘ └─────┘\n             │       │\n             ▼       ▼\n            h_W     h_LoRA * (alpha / r)\n             └───┬───┘\n                 ▼\n              Output y\n```\n\nFor a given input $x$, the output activation $y$ is computed as:\n\n$$y = W_0 x + \\Delta W x = W_0 x + \\frac{\\alpha}{r} (B A x)$$\n\nWhere:\n\nIntroduced by Tim Dettmers in 2023, **QLoRA (Quantized Low-Rank Adaptation)** takes memory efficiency a step further by quantizing the base model weights $W_0$ to a highly compressed **4-bit** representation, while keeping the active LoRA adapter weights in 16-bit precision.\n\nQLoRA relies on three key mathematical and systems innovations:\n\nNeural network weights naturally follow a zero-centered normal distribution. Standard linear quantization schemes (like INT4) allocate quantization bins uniformly, wasting precision at the sparse tails of the distribution.\n\nNF4 addresses this by establishing non-linear quantization intervals such that **each bin contains an equal number of expected parameters (equal information entropy)**:\n\n$$\\int_{q_i}^{q_{i+1}} \\mathcal{N}(0, 1) dx = \\text{const}$$\n\nThis preserves the maximum information of the original FP16 weights, matching FP4/INT4 precision while cutting model weight size to 4 bits per parameter.\n\nIn standard quantization, weight blocks are scaled using a 32-bit float constant. With a block size of 64, this scaling constant introduces an overhead of $32 / 64 = 0.5$ bits per parameter.\n\nDouble Quantization quantizes **these scaling constants themselves** from 32-bit floats to 8-bit floats with a block size of 256.\n\nDuring training with long sequence lengths or large batches, sudden gradient allocation spikes can exceed physical VRAM limits, triggering OOM crashes.\n\nPaged Optimizers leverage CUDA Unified Memory to automatically swap (page) optimizer states between GPU VRAM and CPU RAM during peak memory phases, gracefully slowing down training rather than crashing.\n\n**Axolotl** is a robust framework for LLM fine-tuning, offering native integration with FlashAttention-2, DeepSpeed, and PyTorch FSDP.\n\nHere is a complete production-ready `qlora_llama3_8b.yml`\n\nconfiguration optimized for a single NVIDIA A10G (24GB VRAM):\n\n```\n# Model & Training Mode Config\nbase_model: meta-llama/Meta-Llama-3-8B-Instruct\nmodel_type: LlamaForCausalLM\ntokenizer_type: PreTrainedTokenizerFast\n\n# Enable QLoRA (4-bit NF4 Quantization)\nload_in_8bit: false\nload_in_4bit: true\ngptq: false\n\n# Precision settings\nbf16: true\nfp16: false\ntf32: true\n\n# LoRA Adapter Configuration\nadapter: qlora\nlora_r: 16\nlora_alpha: 32\nlora_dropout: 0.05\nlora_target_modules:\n  - q_proj\n  - k_proj\n  - v_proj\n  - o_proj\n  - gate_proj\n  - up_proj\n  - down_proj\n\n# Dataset Configurations\ndatasets:\n  - path: ./temp_cleaned_dataset.jsonl\n    type: alpaca\n    shards: 10\ndataset_prepared_path: ./last_run_prepared\nval_set_size: 0.05\noutput_dir: ./lora-llama3-8b-output\n\n# Memory & Speed Optimizations\nsequence_len: 8192\nsample_packing: true\npad_to_sequence_len: true\nflash_attention: true\n\n# Hyperparameters\ngradient_accumulation_steps: 4\nmicro_batch_size: 2\nnum_epochs: 3\noptimizer: paged_adamw_8bit\nlr_scheduler: cosine\nlearning_rate: 0.0002\nweight_decay: 0.01\nmax_grad_norm: 1.0\n\n# Checkpointing & Logs\nsave_steps: 100\neval_steps: 100\nlogging_steps: 10\n```\n\nWhile Axolotl is highly configurable, standard PyTorch backward passes for attention layers leave performance on the table. **Unsloth** rewrites the attention and MLP backward steps in raw **OpenAI Triton**, achieving a **3x speedup** while reducing memory consumption by **60%**.\n\n``` python\nimport torch\nfrom unsloth import FastLanguageModel\nfrom datasets import load_dataset\nfrom trl import SFTTrainer\nfrom transformers import TrainingArguments\n\nmax_seq_length = 4096 # Limit context length to optimize speed on 24GB GPUs\ndtype = None # Auto-detect (Float16 or Bfloat16)\nload_in_4bit = True # Enable 4-bit quantization\n\n# 1. Initialize model and tokenizer via Unsloth\nmodel, tokenizer = FastLanguageModel.from_pretrained(\n    model_name = \"meta-llama/Meta-Llama-3-8B-Instruct\",\n    max_seq_length = max_seq_length,\n    dtype = dtype,\n    load_in_4bit = load_in_4bit,\n)\n\n# 2. Add optimized LoRA adapters\nmodel = FastLanguageModel.get_peft_model(\n    model,\n    r = 16,\n    target_modules = [\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n                      \"gate_proj\", \"up_proj\", \"down_proj\"],\n    lora_alpha = 32,\n    lora_dropout = 0, # Unsloth is optimized for dropout = 0\n    bias = \"none\",\n    use_gradient_checkpointing = \"unsloth\", # Memory-optimized gradient checkpointing\n    random_state = 3407,\n)\n\n# 3. Format SFT Prompts (Alpaca style)\nalpaca_prompt = \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{}\n\n### Response:\n{}\"\"\"\n\ndef formatting_prompts_func(examples):\n    instructions = examples[\"instruction\"]\n    outputs      = examples[\"output\"]\n    texts = []\n    for inst, out in zip(instructions, outputs):\n        text = alpaca_prompt.format(inst, out) + tokenizer.eos_token\n        texts.append(text)\n    return { \"text\" : texts }\n\n# Load semantic deduplicated dataset from Part 2\ndataset = load_dataset(\"json\", data_files=\"temp_cleaned_dataset.jsonl\", split=\"train\")\ndataset = dataset.map(formatting_prompts_func, batched = True)\n\n# 4. Setup SFT Trainer\ntrainer = SFTTrainer(\n    model = model,\n    tokenizer = tokenizer,\n    train_dataset = dataset,\n    dataset_text_field = \"text\",\n    max_seq_length = max_seq_length,\n    dataset_num_proc = 2,\n    packing = False, # Set to True to pack short sequences and speed up training\n    args = TrainingArguments(\n        per_device_train_batch_size = 2,\n        gradient_accumulation_steps = 4,\n        warmup_steps = 10,\n        max_steps = 120, # Number of training steps for test run\n        learning_rate = 2e-4,\n        fp16 = not torch.cuda.is_bf16_supported(),\n        bf16 = torch.cuda.is_bf16_supported(),\n        logging_steps = 1,\n        optim = \"adamw_8bit\",\n        weight_decay = 0.01,\n        lr_scheduler_type = \"linear\",\n        seed = 3407,\n        output_dir = \"outputs\",\n    ),\n)\n\n# Execute training run\ntrainer_stats = trainer.train()\n\n# 5. Save model adapter weights\nmodel.save_pretrained(\"lora_model_adapter\")\ntokenizer.save_pretrained(\"lora_model_adapter\")\nprint(\"Training complete! Model saved.\")\n```\n\nFine-tuning via LoRA outputs a directory of adapter weights (typically 50MB - 500MB). To run high-performance inference serving with engines like vLLM, you should merge these adapter matrices back into the 16-bit base model weights.\n\n``` python\nfrom unsloth import FastLanguageModel\n\n# Load the base model and model adapter in native 16-bit\nmodel, tokenizer = FastLanguageModel.from_pretrained(\n    model_name = \"meta-llama/Meta-Llama-3-8B-Instruct\",\n    max_seq_length = 4096,\n    dtype = None,\n    load_in_4bit = False, # Must be False to export back to native 16-bit float\n)\nmodel.load_adapter(\"lora_model_adapter\")\n\n# Execute weights merge and save to disk\nprint(\"Merging weights and saving to disk...\")\nmodel.save_pretrained_merged(\"merged_model_fp16\", tokenizer, save_method = \"merged_16bit\")\nprint(\"Merge complete! Ready for vLLM serving.\")\n```\n\nThe output in `merged_model_fp16`\n\nis a standalone 16-bit Hugging Face model directory ready to be loaded by `vllm serve`\n\n.\n\nSupervised Fine-Tuning instructs your model on formatting styles and conversational behavior. However, complex, multi-step logical operations (Reasoning) benefit from structured channelling of reasoning steps.\n\nIn [ Part 4: Task & Knowledge Distillation](https://dev.to/series/slm-playbook/part-4-knowledge-distillation-r1/), we explore how to extract reasoning traces (Chain of Thought - CoT) from larger teacher models like\n\n{{< author-cta >}}\n\n*This post was originally published on my blog at Practical QLoRA Fine-tuning: Axolotl & Unsloth | SLM Playbook.*\n\n**Hi, I'm Lê Tuấn Anh (vesviet) 👋**\n\n*I am a Senior Go Backend Architect & Distributed Systems Engineer with 17+ years of experience building high-traffic platforms (25M+ requests/month).*\n\n*If you enjoyed this deep-dive, let's connect on LinkedIn or explore my consulting services at tanhdev.com/hire.*", "url": "https://wpnews.pro/news/ai-practical-qlora-fine-tuning-axolotl-unsloth-slm-playbook", "canonical_source": "https://dev.to/vesviet/ai-practical-qlora-fine-tuning-axolotl-unsloth-slm-playbook-37d1", "published_at": "2026-06-30 23:39:00+00:00", "updated_at": "2026-07-01 00:18:56.260550+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "ai-infrastructure", "developer-tools"], "entities": ["Axolotl", "Unsloth", "LoRA", "QLoRA", "Llama 3", "Tim Dettmers", "FlashAttention-2", "DeepSpeed"], "alternates": {"html": "https://wpnews.pro/news/ai-practical-qlora-fine-tuning-axolotl-unsloth-slm-playbook", "markdown": "https://wpnews.pro/news/ai-practical-qlora-fine-tuning-axolotl-unsloth-slm-playbook.md", "text": "https://wpnews.pro/news/ai-practical-qlora-fine-tuning-axolotl-unsloth-slm-playbook.txt", "jsonld": "https://wpnews.pro/news/ai-practical-qlora-fine-tuning-axolotl-unsloth-slm-playbook.jsonld"}}