95. Fine-Tuning LLMs: Make a General Model Do Your Specific Job

Fine-tuning adapts a general pre-trained language model to perform a specific task by continuing its training on a smaller, task-specific dataset, allowing it to gain deep domain knowledge and understand required formats without losing its broad language capabilities. The article outlines three main approaches: full fine-tuning (updating all weights for best results but at high cost), feature extraction (freezing the model and training only a new head), and parameter-efficient methods like LoRA (adding small trainable modules). It emphasizes that high-quality, task-specific data is more critical than the model itself for successful fine-tuning.

A general language model knows a little about everything. It knows some medicine. Some law. Some code. Some cooking. But it doesn't know your specific domain deeply. It doesn't know your company's tone, your product's terminology, or your task's format. Fine-tuning fixes this. You take a pretrained model that already understands language and specialize it for your specific task with a fraction of the data and compute you'd need to train from scratch. This post covers how to do it properly. What You'll Learn Here - What fine-tuning actually does to a pretrained model - The three types of fine-tuning and when to use each - Preparing datasets for instruction fine-tuning - Full fine-tuning with the HuggingFace Trainer - Evaluating fine-tuned models properly - Catastrophic forgetting and how to avoid it - Tips that actually make a difference What Fine-Tuning Does A pretrained LLM has learned a general representation of language from billions of tokens. Its weights encode grammar, facts, reasoning patterns, and world knowledge. Fine-tuning continues training on a smaller, task-specific dataset. The model adapts its weights slightly to specialize. The key word is slightly. You don't want to destroy the general knowledge. You want to build on it. Pretrained model: - Knows language deeply - Broad but shallow domain knowledge - No concept of your task format After fine-tuning: - Still knows language - Deep knowledge of your domain - Understands your task format - Responds in your required style The weights change. But not completely. A well-fine-tuned model retains its general capabilities while gaining task-specific expertise. Three Types of Fine-Tuning Type 1: Full Fine-Tuning Update all weights. Best results. Expensive. Needs lots of data. Risk of catastrophic forgetting. Type 2: Feature Extraction Frozen backbone Freeze the pretrained model. Only train a new head classification layer, etc. . Fast. Needs very little data. Limited adaptation. Type 3: Parameter-Efficient Fine-Tuning LoRA, adapters Add small trainable modules. Freeze most of the model. Train only a tiny fraction of parameters. Best of both worlds. Covered deeply in Post 96. Type 1: Full fine-tuning for param in model.parameters : param.requires grad = True all params update Type 2: Frozen backbone for param in model.base model.parameters : param.requires grad = False freeze backbone only classifier head trains Type 3: LoRA simplified Covered in Post 96 Dataset Preparation Good data beats a good model almost every time. This is where most fine-tuning projects live or die. For classification fine-tuning: python from datasets import Dataset, DatasetDict import pandas as pd Your labeled data data = { 'text': "The patient presented with acute chest pain radiating to the left arm.", "The quarterly earnings exceeded analyst expectations by 15%.", "The defendant claims he was not present at the scene of the crime.", "Treatment with metformin reduced HbA1c levels significantly.", "Revenue growth was driven by strong performance in cloud services.", "The prosecution presented DNA evidence linking the suspect to the crime.", "MRI results showed no signs of cerebral hemorrhage.", "Operating margins expanded by 200 basis points year over year.", "The jury found the defendant not guilty on all counts.", "The patient was discharged after a three-day hospitalization.", , 'label': 0, 1, 2, 0, 1, 2, 0, 1, 2, 0 0=medical, 1=finance, 2=legal } df = pd.DataFrame data Train/val split from sklearn.model selection import train test split train df, val df = train test split df, test size=0.2, random state=42, stratify=df 'label' train dataset = Dataset.from pandas train df.reset index drop=True val dataset = Dataset.from pandas val df.reset index drop=True dataset = DatasetDict {'train': train dataset, 'validation': val dataset} print dataset For instruction fine-tuning making a model follow prompts : python Instruction format used by most modern LLMs def format instruction example : return f""" Instruction: {example 'instruction' } Input: {example 'input' } Response: {example 'output' }""" Example instruction dataset instruction data = { 'instruction': 'Classify this medical text into one of: diagnosis, treatment, symptom.', 'input': 'Patient reports persistent cough and shortness of breath for 3 weeks.', 'output': 'symptom' }, { 'instruction': 'Classify this medical text into one of: diagnosis, treatment, symptom.', 'input': 'Prescribed amoxicillin 500mg three times daily for 7 days.', 'output': 'treatment' }, { 'instruction': 'Classify this medical text into one of: diagnosis, treatment, symptom.', 'input': 'Confirmed diagnosis of type 2 diabetes mellitus based on HbA1c of 7.8%.', 'output': 'diagnosis' }, for example in instruction data: print format instruction example print "-" 50 Data Quality Checklist Before fine-tuning, verify your data: python import pandas as pd import numpy as np def audit dataset df, text col='text', label col='label' : print "=" 50 print "DATASET AUDIT REPORT" print "=" 50 Size print f"\nTotal examples: {len df :,}" Class distribution print f"\nClass distribution:" dist = df label col .value counts normalize=True for label, pct in dist.items : count = df label col .value counts label print f" Class {label}: {count} {pct:.1%} " Imbalance check max class = dist.max min class = dist.min ratio = max class / min class if ratio 5: print f" WARNING: Imbalance ratio {ratio:.1f}x. Consider oversampling or class weights." Text length lengths = df text col .str.len print f"\nText length:" print f" Min: {lengths.min }" print f" Max: {lengths.max }" print f" Median: {lengths.median :.0f}" print f" Mean: {lengths.mean :.0f}" Long texts warning if lengths.max 512 4: rough estimate of 512 tokens print f" WARNING: Some texts may exceed token limits. Check truncation strategy." Duplicates n dupes = df text col .duplicated .sum if n dupes 0: print f"\n WARNING: {n dupes} duplicate texts found. Remove before training." Missing values missing = df.isnull .sum .sum if missing 0: print f"\n WARNING: {missing} missing values found." else: print f"\nNo missing values." print "=" 50 audit dataset pd.DataFrame data Full Fine-Tuning for Sequence Classification python from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding, EarlyStoppingCallback import evaluate import numpy as np import torch model name = 'distilbert-base-uncased' num labels = 3 label names = 'medical', 'finance', 'legal' id2label = {i: l for i, l in enumerate label names } label2id = {l: i for i, l in enumerate label names } Tokenizer tokenizer = AutoTokenizer.from pretrained model name def tokenize function examples : return tokenizer examples 'text' , truncation=True, padding=False, DataCollator will pad dynamically max length=256 tokenized train = train dataset.map tokenize function, batched=True tokenized val = val dataset.map tokenize function, batched=True Model model = AutoModelForSequenceClassification.from pretrained model name, num labels=num labels, id2label=id2label, label2id=label2id Metrics accuracy = evaluate.load 'accuracy' f1 metric = evaluate.load 'f1' def compute metrics eval pred : logits, labels = eval pred predictions = np.argmax logits, axis=-1 acc = accuracy.compute predictions=predictions, references=labels 'accuracy' f1 = f1 metric.compute predictions=predictions, references=labels, average='weighted' 'f1' return {'accuracy': acc, 'f1': f1} Training arguments training args = TrainingArguments output dir='./checkpoints/domain classifier', Training schedule num train epochs=5, per device train batch size=8, per device eval batch size=16, Optimization learning rate=2e-5, weight decay=0.01, warmup ratio=0.1, warmup for 10% of steps lr scheduler type='cosine', cosine decay after warmup Evaluation evaluation strategy='epoch', save strategy='epoch', load best model at end=True, metric for best model='f1', greater is better=True, Logging logging steps=10, logging dir='./logs', report to='none', Efficiency fp16=torch.cuda.is available , mixed precision on GPU dataloader num workers=0, Trainer trainer = Trainer model=model, args=training args, train dataset=tokenized train, eval dataset=tokenized val, tokenizer=tokenizer, data collator=DataCollatorWithPadding tokenizer , compute metrics=compute metrics, callbacks= EarlyStoppingCallback early stopping patience=2 Train print "Starting fine-tuning..." trainer.train Evaluate results = trainer.evaluate print f"\nFinal Results:" print f" Accuracy: {results 'eval accuracy' :.3f}" print f" F1: {results 'eval f1' :.3f}" Evaluating a Fine-Tuned Model Properly Accuracy alone isn't enough. Look at per-class performance, confusion matrix, and error cases. python from sklearn.metrics import classification report, confusion matrix import matplotlib.pyplot as plt import seaborn as sns import torch Get predictions on validation set model.eval all preds = all labels = val dataloader = trainer.get eval dataloader with torch.no grad : for batch in val dataloader: batch = {k: v.to model.device for k, v in batch.items } outputs = model batch preds = torch.argmax outputs.logits, dim=-1 all preds.extend preds.cpu .numpy all labels.extend batch 'labels' .cpu .numpy Classification report print "Classification Report:" print classification report all labels, all preds, target names=label names Confusion matrix cm = confusion matrix all labels, all preds plt.figure figsize= 7, 5 sns.heatmap cm, annot=True, fmt='d', cmap='Blues', xticklabels=label names, yticklabels=label names plt.ylabel 'True Label' plt.xlabel 'Predicted Label' plt.title 'Confusion Matrix - Fine-tuned DistilBERT' plt.tight layout plt.savefig 'fine tune confusion.png', dpi=100 plt.show Error analysis: look at what the model gets wrong errors = texts = val df 'text' .tolist for i, pred, true in enumerate zip all preds, all labels : if pred = true: errors.append { 'text': texts i , 'true': label names true , 'predicted': label names pred } print f"\nErrors {len errors } out of {len all labels } :" for e in errors: print f"\n True: {e 'true' }, Predicted: {e 'predicted' }" print f" Text: '{e 'text' :80 }...'" Error analysis is often the most valuable step. Understanding why the model gets specific examples wrong tells you what data to add next. Catastrophic Forgetting: The Real Risk When you fine-tune on a small dataset, the model can forget what it learned during pretraining. Weights move too far from their pretrained values. General capabilities degrade. Signs of catastrophic forgetting: 1. Model performs well on your task but fails on general text 2. Perplexity on general text spikes 3. Model generates incoherent text outside your domain Prevent it with: 1. Low learning rate 2e-5 is usually safe for BERT-based models training args safe = TrainingArguments learning rate=2e-5, not 1e-3 or 1e-4 weight decay=0.01, L2 regularization warmup ratio=0.1, num train epochs=3, not 50 output dir='./safe ft' 2. Freeze early layers they contain general language knowledge def freeze early layers model, n frozen layers=4 : Freeze embedding layers for param in model.distilbert.embeddings.parameters : param.requires grad = False Freeze first n transformer layers for layer in model.distilbert.transformer.layer :n frozen layers : for param in layer.parameters : param.requires grad = False trainable = sum p.numel for p in model.parameters if p.requires grad total = sum p.numel for p in model.parameters print f"Trainable: {trainable:,} / {total:,} {trainable/total:.1%} " freeze early layers model, n frozen layers=4 3. Use a small dataset? Consider LoRA Post 96 instead of full fine-tuning Instruction Fine-Tuning a Generative Model For causal LLMs GPT-style , you format the data as prompts and completions. python from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer from datasets import Dataset import torch Load a small generative model model name = 'gpt2' tokenizer = AutoTokenizer.from pretrained model name tokenizer.pad token = tokenizer.eos token model = AutoModelForCausalLM.from pretrained model name model.config.use cache = False required for gradient checkpointing Instruction dataset instructions = { 'prompt': " Instruction:\nSummarize this in one sentence.\n\n Input:\nMachine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. It uses algorithms to parse data, learn from it, and make informed decisions.\n\n Response:\n", 'completion': "Machine learning allows computers to learn from data and make decisions without explicit programming." }, { 'prompt': " Instruction:\nSummarize this in one sentence.\n\n Input:\nThe Eiffel Tower, located in Paris, France, was built between 1887 and 1889 as the entrance arch for the 1889 World's Fair and stands 330 meters tall.\n\n Response:\n", 'completion': "The Eiffel Tower is a 330-meter structure in Paris built in 1889 as the entrance arch for the World's Fair." }, Tokenize: concatenate prompt + completion, mask prompt in loss def tokenize instruction example, max length=256 : full text = example 'prompt' + example 'completion' + tokenizer.eos token tokenized = tokenizer full text, max length=max length, truncation=True, padding='max length', return tensors='pt' input ids = tokenized 'input ids' 0 labels = input ids.clone Mask the prompt tokens in loss we only want to train on completions prompt ids = tokenizer example 'prompt' , return tensors='pt' 'input ids' 0 prompt len = len prompt ids labels :prompt len = -100 -100 is ignored in CrossEntropyLoss return { 'input ids': input ids, 'attention mask': tokenized 'attention mask' 0 , 'labels': labels } tokenized data = tokenize instruction ex for ex in instructions Convert to dataset import torch class InstructionDataset torch.utils.data.Dataset : def init self, data : self.data = data def len self : return len self.data def getitem self, idx : return self.data idx train ds = InstructionDataset tokenized data Fine-tune training args = TrainingArguments output dir='./instruct model', num train epochs=3, per device train batch size=1, gradient accumulation steps=4, effective batch size = 4 learning rate=2e-5, warmup steps=10, logging steps=5, save steps=50, report to='none', fp16=torch.cuda.is available trainer = Trainer model=model, args=training args, train dataset=train ds, trainer.train print "Instruction fine-tuning complete" Testing Your Fine-Tuned Model python Test the fine-tuned generative model model.eval def generate response prompt, max new tokens=100, temperature=0.7 : inputs = tokenizer prompt, return tensors='pt' .to model.device with torch.no grad : output = model.generate inputs, max new tokens=max new tokens, temperature=temperature, do sample=True, pad token id=tokenizer.eos token id generated = output 0 inputs 'input ids' .shape 1 : return tokenizer.decode generated, skip special tokens=True Test prompt test prompt = """ Instruction: Summarize this in one sentence. Input: Neural networks are computing systems inspired by biological neural networks. They consist of layers of interconnected nodes that process information using connectionist approaches to computation. Response: """ response = generate response test prompt print f"Generated response:\n{response}" Fine-Tuning Best Practices Summary of what actually works best practices = { 'learning rate': { 'BERT-based classification ': '2e-5 to 5e-5', 'GPT-based generation ': '1e-5 to 3e-5', 'Frozen backbone': '1e-3 to 1e-4 for head only' }, 'batch size': { 'recommendation': '16 or 32 if memory allows', 'small GPU': 'batch=4 + gradient accumulation=4' }, 'epochs': { 'BERT classification': '2 to 4', 'GPT generation': '1 to 3', 'note': 'More epochs = more overfitting risk' }, 'data size': { 'frozen backbone': 'Works with 100+ examples', 'full fine-tuning': 'Need 1000+ for reliable results', 'instruction FT': '1000 to 10000 good examples' }, 'stopping': { 'recommendation': 'Always use early stopping', 'metric': 'Monitor validation loss, not training loss' } } for category, details in best practices.items : print f"\n{category.upper }:" for k, v in details.items : print f" {k}: {v}" Quick Cheat Sheet | Decision | Guidance | |---|---| | How much data do I have? | < 500: freeze backbone. 500-5k: full fine-tune. 5k: great | | Which model to start with? | DistilBERT for speed, RoBERTa for accuracy | | Learning rate | 2e-5 for BERT, 1e-5 for GPT, never 5e-5 | | Epochs | 2-4, use early stopping | | Catastrophic forgetting | Lower LR, freeze early layers, fewer epochs | | Model not learning | Raise LR, check data quality, check label correctness | | Model overfitting | Lower LR, add dropout, add more data, use LoRA | | Task | Code | |---|---| | Load model | AutoModelForSequenceClassification.from pretrained name, num labels=N | | Tokenize | tokenizer texts, truncation=True, padding=False, max length=256 | | Train | Trainer model, args, train dataset, eval dataset | | Early stop | EarlyStoppingCallback early stopping patience=2 | | Save | trainer.save model './my model' | | Predict | trainer.predict test dataset | Practice Challenges Level 1: Download any small labeled text dataset from the HuggingFace hub. Fine-tune distilbert-base-uncased on it for 3 epochs. Print the classification report. Compare to a TF-IDF + LogisticRegression baseline. Level 2: Fine-tune with and without freezing the first 4 transformer layers. Compare final F1 scores and training time. Which approach is better for your dataset size? Level 3: Create your own instruction dataset of 50+ examples for a specific task code explanation, medical text classification, legal summarization . Fine-tune GPT-2 on it. Test the model with 10 new prompts it hasn't seen. Rate the responses 1-5 and report average quality. References HuggingFace: Fine-tuning tutorial https://huggingface.co/docs/transformers/training HuggingFace: TrainingArguments docs https://huggingface.co/docs/transformers/main classes/trainer Stanford Alpaca: instruction fine-tuning https://crfm.stanford.edu/2023/03/13/alpaca.html HuggingFace: PEFT library for LoRA https://huggingface.co/docs/peft Next up, Post 96:LoRA: Fine-Tune a Billion-Parameter Model on a Laptop. Parameter-efficient fine-tuning using rank decomposition. Train 1% of parameters and get 95% of the performance of full fine-tuning.