How to Fine-Tune LFM2 Using QLoRA and DPO: A Complete Step-by-Step Coding Tutorial on Google Colab

Liquid AI's LFM2 model can be fine-tuned using QLoRA and DPO in a complete open-source workflow on Google Colab. The process loads the base LFM2 checkpoint with 4-bit quantization, trains a lightweight LoRA adapter using TRL and PEFT, and extends the workflow with DPO to improve response preference using chosen and rejected answers. The resulting pipeline moves from a base LFM2 model to a supervised fine-tuned, preference-aligned checkpoint ready for testing or deployment.

In this tutorial, we fine-tune Liquid AI’s LFM2 https://github.com/Liquid4All/leap-finetune model through a complete open-source workflow. We start by loading the base LFM2 checkpoint with QLoRA, preparing a chat-style supervised fine-tuning dataset, training a lightweight LoRA adapter using TRL and PEFT, and then merging the adapter back into the model. We also extend the workflow with DPO to show how we can improve response preference using chosen and rejected answers. At the end, we have a practical pipeline that moves from a base LFM2 model to an SFT-tuned, preference-aligned checkpoint, ready for further testing or deployment. pip install -q -U "transformers =4.55" "trl =0.12" "peft =0.13" "datasets =2.20" "accelerate =0.34" bitsandbytes import torch, gc from datasets import load dataset, Dataset from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import LoraConfig, PeftModel, prepare model for kbit training from trl import SFTConfig, SFTTrainer, DPOConfig, DPOTrainer MODEL ID = "LiquidAI/LFM2-1.2B" USE 4BIT = True RUN DPO = True SFT SAMPLES = 500 SFT STEPS = 60 DPO STEPS = 40 MAX LEN = 1024 BF16 = torch.cuda.is available and torch.cuda.is bf16 supported DTYPE = torch.bfloat16 if BF16 else torch.float16 assert torch.cuda.is available , "No GPU detected — set Runtime Change runtime type GPU" print f"GPU: {torch.cuda.get device name 0 } | dtype={DTYPE} | 4bit={USE 4BIT}" We install all the required libraries for fine-tuning LFM2 inside Google Colab. We import the core tools from Transformers, TRL, PEFT, datasets, bitsandbytes, and PyTorch. We also define the main training settings, detect available GPUs, and select the appropriate precision for efficient training. python def load base four bit: bool : quant cfg = None if four bit: quant cfg = BitsAndBytesConfig load in 4bit=True, bnb 4bit quant type="nf4", bnb 4bit use double quant=True, bnb 4bit compute dtype=DTYPE, model = AutoModelForCausalLM.from pretrained MODEL ID, device map="auto", dtype=DTYPE, quantization config=quant cfg, model.config.use cache = False return model tokenizer = AutoTokenizer.from pretrained MODEL ID if tokenizer.pad token is None: tokenizer.pad token = tokenizer.eos token model = load base USE 4BIT @torch.no grad def chat m, user msg, system=None, max new tokens=200 : msgs = {"role": "system", "content": system} if system else + \ {"role": "user", "content": user msg} inputs = tokenizer.apply chat template msgs, add generation prompt=True, return tensors="pt", tokenize=True, return dict=True, .to m.device m.config.use cache = True out = m.generate inputs, max new tokens=max new tokens, do sample=True, temperature=0.3, min p=0.15, repetition penalty=1.05, pad token id=tokenizer.pad token id, m.config.use cache = False prompt len = inputs "input ids" .shape -1 return tokenizer.decode out 0, prompt len: , skip special tokens=True PROBE = "Explain what makes the LFM2 architecture good for on-device AI, in 2 sentences." print "\n=== BASELINE before fine-tuning ===\n", chat model, PROBE We load the LFM2 base model with optional 4-bit quantization to reduce GPU memory usage. We prepare the tokenizer, set the padding token, and define a chat function for testing model responses. We then run a baseline prompt to compare the model’s behavior before and after fine-tuning. sft ds = load dataset "HuggingFaceTB/smoltalk", "all", split=f"train :{SFT SAMPLES} " sft ds = sft ds.select columns "messages" print "\nSFT example messages:", sft ds 0 "messages" :2 lora sft = LoraConfig r=16, lora alpha=32, lora dropout=0.05, bias="none", task type="CAUSAL LM", target modules="all-linear", sft cfg = SFTConfig output dir="outputs/sft/lfm2 demo", max length=MAX LEN, per device train batch size=2, gradient accumulation steps=4, learning rate=2e-5, warmup ratio=0.03, lr scheduler type="cosine", max steps=SFT STEPS, logging steps=10, save strategy="no", gradient checkpointing=True, gradient checkpointing kwargs={"use reentrant": False}, bf16=BF16, fp16=not BF16, optim="paged adamw 8bit" if USE 4BIT else "adamw torch", packing=False, report to="none", sft trainer = SFTTrainer model=model, args=sft cfg, train dataset=sft ds, peft config=lora sft, processing class=tokenizer, sft trainer.train sft trainer.save model "outputs/sft/lfm2 adapter" print "\n=== AFTER SFT ===\n", chat sft trainer.model, PROBE We load a chat-formatted supervised fine-tuning dataset and keep only the messages column. We configure LoRA for lightweight adapter-based training and define the SFT training settings. We then train the model with SFT, save the LoRA adapter, and test the improved model response. del sft trainer, model gc.collect ; torch.cuda.empty cache base fp16 = AutoModelForCausalLM.from pretrained MODEL ID, device map="auto", dtype=DTYPE sft merged = PeftModel.from pretrained base fp16, "outputs/sft/lfm2 adapter" .merge and unload sft merged.save pretrained "outputs/sft/lfm2 merged" tokenizer.save pretrained "outputs/sft/lfm2 merged" print "Merged SFT model saved - outputs/sft/lfm2 merged" We clear the earlier training objects from memory to free GPU resources. We reload the base LFM2 model in fp16 or bf16 and attach the trained SFT LoRA adapter. We then merge the adapter into the base model and save the merged SFT checkpoint for the next stage. if RUN DPO: pref rows = {"prompt": {"role": "user", "content": "Reply to a customer whose order is late."} , "chosen": {"role": "assistant", "content": "I'm sorry your order is delayed. I've checked your tracking and it will arrive within 2 days — here's a 10% credit for the inconvenience."} , "rejected": {"role": "assistant", "content": "Orders are sometimes late. Please wait."} }, {"prompt": {"role": "user", "content": "Summarize the benefit of edge AI in one line."} , "chosen": {"role": "assistant", "content": "Edge AI runs models locally, giving low latency, offline reliability, and stronger privacy."} , "rejected": {"role": "assistant", "content": "Edge AI is AI on the edge of things and it is good."} }, {"prompt": {"role": "user", "content": "Decline a meeting politely."} , "chosen": {"role": "assistant", "content": "Thanks for the invite — I have a conflict then. Could we find another slot this week?"} , "rejected": {"role": "assistant", "content": "No."} }, 20 pref ds = Dataset.from list pref rows lora dpo = LoraConfig r=16, lora alpha=32, lora dropout=0.05, bias="none", task type="CAUSAL LM", target modules="all-linear" dpo cfg = DPOConfig output dir="outputs/dpo/lfm2 demo", per device train batch size=1, gradient accumulation steps=4, learning rate=5e-6, beta=0.1, max length=MAX LEN, max prompt length=512, max steps=DPO STEPS, logging steps=10, save strategy="no", gradient checkpointing=True, gradient checkpointing kwargs={"use reentrant": False}, bf16=BF16, fp16=not BF16, report to="none", dpo trainer = DPOTrainer model=sft merged, ref model=None, args=dpo cfg, train dataset=pref ds, processing class=tokenizer, peft config=lora dpo, dpo trainer.train final = dpo trainer.model.merge and unload final.save pretrained "outputs/final/lfm2 sft dpo" tokenizer.save pretrained "outputs/final/lfm2 sft dpo" print "\n=== AFTER SFT + DPO ===\n", chat dpo trainer.model, PROBE print "Final model saved - outputs/final/lfm2 sft dpo" print "\nDone. Compare the BASELINE vs AFTER-SFT +DPO outputs above." We optionally run DPO using prompt-chosen-and-rejected response pairs. We configure another LoRA adapter for preference tuning and train the SFT-merged model with DPO. We finally merge the DPO adapter, save the final model checkpoint, and compare the result against earlier outputs. In conclusion, we built a full fine-tuning pipeline for LFM2 using only open-source tools, including Transformers, TRL, PEFT, datasets, and bitsandbytes. We used QLoRA to make training efficient on Colab GPUs, applied supervised fine-tuning to chat-formatted data, merged the trained adapter into the base model, and optionally further improved the model through DPO. It gives us a clear view of how modern LLM fine-tuning works in practice, from loading the model to producing a final checkpoint that can be compared against the original baseline and prepared for deployment. Check out the Codes with Notebook here. Also, feel free to follow us on and don’t forget to join our Twitter https://x.com/intent/follow?screen name=marktechpost and Subscribe to 150k+ ML SubReddit https://www.reddit.com/r/machinelearningnews/ . Wait are you on telegram? our Newsletter https://www.aidevsignals.com/ now you can join us on telegram as well. https://t.me/machinelearningresearchnews Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us https://forms.gle/wbash1wF6efRj8G58 Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions. - Sana Hassan - Sana Hassan - Sana Hassan - Sana Hassan