{"slug": "95-fine-tuning-llms-make-a-general-model-do-your-specific-job", "title": "95. Fine-Tuning LLMs: Make a General Model Do Your Specific Job", "summary": "Fine-tuning adapts a general pre-trained language model to perform a specific task by continuing its training on a smaller, task-specific dataset, allowing it to gain deep domain knowledge and understand required formats without losing its broad language capabilities. The article outlines three main approaches: full fine-tuning (updating all weights for best results but at high cost), feature extraction (freezing the model and training only a new head), and parameter-efficient methods like LoRA (adding small trainable modules). It emphasizes that high-quality, task-specific data is more critical than the model itself for successful fine-tuning.", "body_md": "A general language model knows a little about everything.\n\nIt knows some medicine. Some law. Some code. Some cooking. But it doesn't know your specific domain deeply. It doesn't know your company's tone, your product's terminology, or your task's format.\n\nFine-tuning fixes this. You take a pretrained model that already understands language and specialize it for your specific task with a fraction of the data and compute you'd need to train from scratch.\n\nThis post covers how to do it properly.\n\n### What You'll Learn Here\n\n- What fine-tuning actually does to a pretrained model\n- The three types of fine-tuning and when to use each\n- Preparing datasets for instruction fine-tuning\n- Full fine-tuning with the HuggingFace Trainer\n- Evaluating fine-tuned models properly\n- Catastrophic forgetting and how to avoid it\n- Tips that actually make a difference\n\n### What Fine-Tuning Does\n\nA pretrained LLM has learned a general representation of language from billions of tokens. Its weights encode grammar, facts, reasoning patterns, and world knowledge.\n\nFine-tuning continues training on a smaller, task-specific dataset. The model adapts its weights slightly to specialize. The key word is slightly. You don't want to destroy the general knowledge. You want to build on it.\n\n```\nPretrained model:\n  - Knows language deeply\n  - Broad but shallow domain knowledge\n  - No concept of your task format\n\nAfter fine-tuning:\n  - Still knows language\n  - Deep knowledge of your domain\n  - Understands your task format\n  - Responds in your required style\n```\n\nThe weights change. But not completely. A well-fine-tuned model retains its general capabilities while gaining task-specific expertise.\n\n### Three Types of Fine-Tuning\n\n**Type 1: Full Fine-Tuning**\n\nUpdate all weights. Best results. Expensive. Needs lots of data. Risk of catastrophic forgetting.\n\n**Type 2: Feature Extraction (Frozen backbone)**\n\nFreeze the pretrained model. Only train a new head (classification layer, etc.). Fast. Needs very little data. Limited adaptation.\n\n**Type 3: Parameter-Efficient Fine-Tuning (LoRA, adapters)**\n\nAdd small trainable modules. Freeze most of the model. Train only a tiny fraction of parameters. Best of both worlds. Covered deeply in Post 96.\n\n```\n# Type 1: Full fine-tuning\nfor param in model.parameters():\n    param.requires_grad = True   # all params update\n\n# Type 2: Frozen backbone\nfor param in model.base_model.parameters():\n    param.requires_grad = False  # freeze backbone\n# only classifier head trains\n\n# Type 3: LoRA (simplified)\n# Covered in Post 96\n```\n\n### Dataset Preparation\n\nGood data beats a good model almost every time. This is where most fine-tuning projects live or die.\n\n**For classification fine-tuning:**\n\n``` python\nfrom datasets import Dataset, DatasetDict\nimport pandas as pd\n\n# Your labeled data\ndata = {\n    'text': [\n        \"The patient presented with acute chest pain radiating to the left arm.\",\n        \"The quarterly earnings exceeded analyst expectations by 15%.\",\n        \"The defendant claims he was not present at the scene of the crime.\",\n        \"Treatment with metformin reduced HbA1c levels significantly.\",\n        \"Revenue growth was driven by strong performance in cloud services.\",\n        \"The prosecution presented DNA evidence linking the suspect to the crime.\",\n        \"MRI results showed no signs of cerebral hemorrhage.\",\n        \"Operating margins expanded by 200 basis points year over year.\",\n        \"The jury found the defendant not guilty on all counts.\",\n        \"The patient was discharged after a three-day hospitalization.\",\n    ],\n    'label': [0, 1, 2, 0, 1, 2, 0, 1, 2, 0]  # 0=medical, 1=finance, 2=legal\n}\n\ndf = pd.DataFrame(data)\n\n# Train/val split\nfrom sklearn.model_selection import train_test_split\ntrain_df, val_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])\n\ntrain_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))\nval_dataset   = Dataset.from_pandas(val_df.reset_index(drop=True))\n\ndataset = DatasetDict({'train': train_dataset, 'validation': val_dataset})\nprint(dataset)\n```\n\n**For instruction fine-tuning (making a model follow prompts):**\n\n``` python\n# Instruction format used by most modern LLMs\ndef format_instruction(example):\n    return f\"\"\"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}\"\"\"\n\n# Example instruction dataset\ninstruction_data = [\n    {\n        'instruction': 'Classify this medical text into one of: diagnosis, treatment, symptom.',\n        'input': 'Patient reports persistent cough and shortness of breath for 3 weeks.',\n        'output': 'symptom'\n    },\n    {\n        'instruction': 'Classify this medical text into one of: diagnosis, treatment, symptom.',\n        'input': 'Prescribed amoxicillin 500mg three times daily for 7 days.',\n        'output': 'treatment'\n    },\n    {\n        'instruction': 'Classify this medical text into one of: diagnosis, treatment, symptom.',\n        'input': 'Confirmed diagnosis of type 2 diabetes mellitus based on HbA1c of 7.8%.',\n        'output': 'diagnosis'\n    },\n]\n\nfor example in instruction_data:\n    print(format_instruction(example))\n    print(\"-\" * 50)\n```\n\n### Data Quality Checklist\n\nBefore fine-tuning, verify your data:\n\n``` python\nimport pandas as pd\nimport numpy as np\n\ndef audit_dataset(df, text_col='text', label_col='label'):\n    print(\"=\" * 50)\n    print(\"DATASET AUDIT REPORT\")\n    print(\"=\" * 50)\n\n    # Size\n    print(f\"\\nTotal examples: {len(df):,}\")\n\n    # Class distribution\n    print(f\"\\nClass distribution:\")\n    dist = df[label_col].value_counts(normalize=True)\n    for label, pct in dist.items():\n        count = df[label_col].value_counts()[label]\n        print(f\"  Class {label}: {count} ({pct:.1%})\")\n\n    # Imbalance check\n    max_class = dist.max()\n    min_class = dist.min()\n    ratio     = max_class / min_class\n    if ratio > 5:\n        print(f\"  WARNING: Imbalance ratio {ratio:.1f}x. Consider oversampling or class weights.\")\n\n    # Text length\n    lengths = df[text_col].str.len()\n    print(f\"\\nText length:\")\n    print(f\"  Min:    {lengths.min()}\")\n    print(f\"  Max:    {lengths.max()}\")\n    print(f\"  Median: {lengths.median():.0f}\")\n    print(f\"  Mean:   {lengths.mean():.0f}\")\n\n    # Long texts warning\n    if lengths.max() > 512 * 4:  # rough estimate of 512 tokens\n        print(f\"  WARNING: Some texts may exceed token limits. Check truncation strategy.\")\n\n    # Duplicates\n    n_dupes = df[text_col].duplicated().sum()\n    if n_dupes > 0:\n        print(f\"\\n  WARNING: {n_dupes} duplicate texts found. Remove before training.\")\n\n    # Missing values\n    missing = df.isnull().sum().sum()\n    if missing > 0:\n        print(f\"\\n  WARNING: {missing} missing values found.\")\n    else:\n        print(f\"\\nNo missing values.\")\n\n    print(\"=\" * 50)\n\naudit_dataset(pd.DataFrame(data))\n```\n\n### Full Fine-Tuning for Sequence Classification\n\n``` python\nfrom transformers import (\n    AutoTokenizer,\n    AutoModelForSequenceClassification,\n    TrainingArguments,\n    Trainer,\n    DataCollatorWithPadding,\n    EarlyStoppingCallback\n)\nimport evaluate\nimport numpy as np\nimport torch\n\nmodel_name  = 'distilbert-base-uncased'\nnum_labels  = 3\nlabel_names = ['medical', 'finance', 'legal']\n\nid2label = {i: l for i, l in enumerate(label_names)}\nlabel2id = {l: i for i, l in enumerate(label_names)}\n\n# Tokenizer\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\ndef tokenize_function(examples):\n    return tokenizer(\n        examples['text'],\n        truncation=True,\n        padding=False,       # DataCollator will pad dynamically\n        max_length=256\n    )\n\ntokenized_train = train_dataset.map(tokenize_function, batched=True)\ntokenized_val   = val_dataset.map(tokenize_function, batched=True)\n\n# Model\nmodel = AutoModelForSequenceClassification.from_pretrained(\n    model_name,\n    num_labels=num_labels,\n    id2label=id2label,\n    label2id=label2id\n)\n\n# Metrics\naccuracy = evaluate.load('accuracy')\nf1_metric = evaluate.load('f1')\n\ndef compute_metrics(eval_pred):\n    logits, labels = eval_pred\n    predictions    = np.argmax(logits, axis=-1)\n    acc = accuracy.compute(predictions=predictions, references=labels)['accuracy']\n    f1  = f1_metric.compute(\n        predictions=predictions, references=labels, average='weighted'\n    )['f1']\n    return {'accuracy': acc, 'f1': f1}\n\n# Training arguments\ntraining_args = TrainingArguments(\n    output_dir='./checkpoints/domain_classifier',\n\n    # Training schedule\n    num_train_epochs=5,\n    per_device_train_batch_size=8,\n    per_device_eval_batch_size=16,\n\n    # Optimization\n    learning_rate=2e-5,\n    weight_decay=0.01,\n    warmup_ratio=0.1,            # warmup for 10% of steps\n    lr_scheduler_type='cosine',  # cosine decay after warmup\n\n    # Evaluation\n    evaluation_strategy='epoch',\n    save_strategy='epoch',\n    load_best_model_at_end=True,\n    metric_for_best_model='f1',\n    greater_is_better=True,\n\n    # Logging\n    logging_steps=10,\n    logging_dir='./logs',\n    report_to='none',\n\n    # Efficiency\n    fp16=torch.cuda.is_available(),  # mixed precision on GPU\n    dataloader_num_workers=0,\n)\n\n# Trainer\ntrainer = Trainer(\n    model=model,\n    args=training_args,\n    train_dataset=tokenized_train,\n    eval_dataset=tokenized_val,\n    tokenizer=tokenizer,\n    data_collator=DataCollatorWithPadding(tokenizer),\n    compute_metrics=compute_metrics,\n    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]\n)\n\n# Train\nprint(\"Starting fine-tuning...\")\ntrainer.train()\n\n# Evaluate\nresults = trainer.evaluate()\nprint(f\"\\nFinal Results:\")\nprint(f\"  Accuracy: {results['eval_accuracy']:.3f}\")\nprint(f\"  F1:       {results['eval_f1']:.3f}\")\n```\n\n### Evaluating a Fine-Tuned Model Properly\n\nAccuracy alone isn't enough. Look at per-class performance, confusion matrix, and error cases.\n\n``` python\nfrom sklearn.metrics import classification_report, confusion_matrix\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport torch\n\n# Get predictions on validation set\nmodel.eval()\nall_preds  = []\nall_labels = []\n\nval_dataloader = trainer.get_eval_dataloader()\n\nwith torch.no_grad():\n    for batch in val_dataloader:\n        batch   = {k: v.to(model.device) for k, v in batch.items()}\n        outputs = model(**batch)\n        preds   = torch.argmax(outputs.logits, dim=-1)\n\n        all_preds.extend(preds.cpu().numpy())\n        all_labels.extend(batch['labels'].cpu().numpy())\n\n# Classification report\nprint(\"Classification Report:\")\nprint(classification_report(all_labels, all_preds, target_names=label_names))\n\n# Confusion matrix\ncm = confusion_matrix(all_labels, all_preds)\nplt.figure(figsize=(7, 5))\nsns.heatmap(cm, annot=True, fmt='d', cmap='Blues',\n            xticklabels=label_names, yticklabels=label_names)\nplt.ylabel('True Label')\nplt.xlabel('Predicted Label')\nplt.title('Confusion Matrix - Fine-tuned DistilBERT')\nplt.tight_layout()\nplt.savefig('fine_tune_confusion.png', dpi=100)\nplt.show()\n# Error analysis: look at what the model gets wrong\nerrors = []\ntexts  = val_df['text'].tolist()\n\nfor i, (pred, true) in enumerate(zip(all_preds, all_labels)):\n    if pred != true:\n        errors.append({\n            'text':      texts[i],\n            'true':      label_names[true],\n            'predicted': label_names[pred]\n        })\n\nprint(f\"\\nErrors ({len(errors)} out of {len(all_labels)}):\")\nfor e in errors:\n    print(f\"\\n  True: {e['true']}, Predicted: {e['predicted']}\")\n    print(f\"  Text: '{e['text'][:80]}...'\")\n```\n\nError analysis is often the most valuable step. Understanding why the model gets specific examples wrong tells you what data to add next.\n\n### Catastrophic Forgetting: The Real Risk\n\nWhen you fine-tune on a small dataset, the model can forget what it learned during pretraining. Weights move too far from their pretrained values. General capabilities degrade.\n\n```\n# Signs of catastrophic forgetting:\n# 1. Model performs well on your task but fails on general text\n# 2. Perplexity on general text spikes\n# 3. Model generates incoherent text outside your domain\n\n# Prevent it with:\n\n# 1. Low learning rate (2e-5 is usually safe for BERT-based models)\ntraining_args_safe = TrainingArguments(\n    learning_rate=2e-5,        # not 1e-3 or 1e-4\n    weight_decay=0.01,         # L2 regularization\n    warmup_ratio=0.1,\n    num_train_epochs=3,        # not 50\n    output_dir='./safe_ft'\n)\n\n# 2. Freeze early layers (they contain general language knowledge)\ndef freeze_early_layers(model, n_frozen_layers=4):\n    # Freeze embedding layers\n    for param in model.distilbert.embeddings.parameters():\n        param.requires_grad = False\n\n    # Freeze first n transformer layers\n    for layer in model.distilbert.transformer.layer[:n_frozen_layers]:\n        for param in layer.parameters():\n            param.requires_grad = False\n\n    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)\n    total     = sum(p.numel() for p in model.parameters())\n    print(f\"Trainable: {trainable:,} / {total:,} ({trainable/total:.1%})\")\n\nfreeze_early_layers(model, n_frozen_layers=4)\n\n# 3. Use a small dataset? Consider LoRA (Post 96) instead of full fine-tuning\n```\n\n### Instruction Fine-Tuning a Generative Model\n\nFor causal LLMs (GPT-style), you format the data as prompts and completions.\n\n``` python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer\nfrom datasets import Dataset\nimport torch\n\n# Load a small generative model\nmodel_name = 'gpt2'\ntokenizer  = AutoTokenizer.from_pretrained(model_name)\ntokenizer.pad_token = tokenizer.eos_token\n\nmodel = AutoModelForCausalLM.from_pretrained(model_name)\nmodel.config.use_cache = False   # required for gradient checkpointing\n\n# Instruction dataset\ninstructions = [\n    {\n        'prompt': \"### Instruction:\\nSummarize this in one sentence.\\n\\n### Input:\\nMachine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. It uses algorithms to parse data, learn from it, and make informed decisions.\\n\\n### Response:\\n\",\n        'completion': \"Machine learning allows computers to learn from data and make decisions without explicit programming.\"\n    },\n    {\n        'prompt': \"### Instruction:\\nSummarize this in one sentence.\\n\\n### Input:\\nThe Eiffel Tower, located in Paris, France, was built between 1887 and 1889 as the entrance arch for the 1889 World's Fair and stands 330 meters tall.\\n\\n### Response:\\n\",\n        'completion': \"The Eiffel Tower is a 330-meter structure in Paris built in 1889 as the entrance arch for the World's Fair.\"\n    },\n]\n\n# Tokenize: concatenate prompt + completion, mask prompt in loss\ndef tokenize_instruction(example, max_length=256):\n    full_text = example['prompt'] + example['completion'] + tokenizer.eos_token\n\n    tokenized = tokenizer(\n        full_text,\n        max_length=max_length,\n        truncation=True,\n        padding='max_length',\n        return_tensors='pt'\n    )\n\n    input_ids  = tokenized['input_ids'][0]\n    labels     = input_ids.clone()\n\n    # Mask the prompt tokens in loss (we only want to train on completions)\n    prompt_ids = tokenizer(example['prompt'], return_tensors='pt')['input_ids'][0]\n    prompt_len = len(prompt_ids)\n    labels[:prompt_len] = -100   # -100 is ignored in CrossEntropyLoss\n\n    return {\n        'input_ids':      input_ids,\n        'attention_mask': tokenized['attention_mask'][0],\n        'labels':         labels\n    }\n\ntokenized_data = [tokenize_instruction(ex) for ex in instructions]\n\n# Convert to dataset\nimport torch\n\nclass InstructionDataset(torch.utils.data.Dataset):\n    def __init__(self, data):\n        self.data = data\n    def __len__(self):\n        return len(self.data)\n    def __getitem__(self, idx):\n        return self.data[idx]\n\ntrain_ds = InstructionDataset(tokenized_data)\n\n# Fine-tune\ntraining_args = TrainingArguments(\n    output_dir='./instruct_model',\n    num_train_epochs=3,\n    per_device_train_batch_size=1,\n    gradient_accumulation_steps=4,   # effective batch size = 4\n    learning_rate=2e-5,\n    warmup_steps=10,\n    logging_steps=5,\n    save_steps=50,\n    report_to='none',\n    fp16=torch.cuda.is_available()\n)\n\ntrainer = Trainer(\n    model=model,\n    args=training_args,\n    train_dataset=train_ds,\n)\n\ntrainer.train()\nprint(\"Instruction fine-tuning complete\")\n```\n\n### Testing Your Fine-Tuned Model\n\n``` python\n# Test the fine-tuned generative model\nmodel.eval()\n\ndef generate_response(prompt, max_new_tokens=100, temperature=0.7):\n    inputs = tokenizer(prompt, return_tensors='pt').to(model.device)\n    with torch.no_grad():\n        output = model.generate(\n            **inputs,\n            max_new_tokens=max_new_tokens,\n            temperature=temperature,\n            do_sample=True,\n            pad_token_id=tokenizer.eos_token_id\n        )\n    generated = output[0][inputs['input_ids'].shape[1]:]\n    return tokenizer.decode(generated, skip_special_tokens=True)\n\n# Test prompt\ntest_prompt = \"\"\"### Instruction:\nSummarize this in one sentence.\n\n### Input:\nNeural networks are computing systems inspired by biological neural networks. They consist of layers of interconnected nodes that process information using connectionist approaches to computation.\n\n### Response:\n\"\"\"\n\nresponse = generate_response(test_prompt)\nprint(f\"Generated response:\\n{response}\")\n```\n\n### Fine-Tuning Best Practices\n\n```\n# Summary of what actually works\n\nbest_practices = {\n    'learning_rate': {\n        'BERT-based (classification)': '2e-5 to 5e-5',\n        'GPT-based (generation)':      '1e-5 to 3e-5',\n        'Frozen backbone':             '1e-3 to 1e-4 for head only'\n    },\n    'batch_size': {\n        'recommendation': '16 or 32 if memory allows',\n        'small GPU':      'batch=4 + gradient_accumulation=4'\n    },\n    'epochs': {\n        'BERT classification': '2 to 4',\n        'GPT generation':      '1 to 3',\n        'note':                'More epochs = more overfitting risk'\n    },\n    'data_size': {\n        'frozen backbone':  'Works with 100+ examples',\n        'full fine-tuning': 'Need 1000+ for reliable results',\n        'instruction FT':   '1000 to 10000 good examples'\n    },\n    'stopping': {\n        'recommendation': 'Always use early stopping',\n        'metric':         'Monitor validation loss, not training loss'\n    }\n}\n\nfor category, details in best_practices.items():\n    print(f\"\\n{category.upper()}:\")\n    for k, v in details.items():\n        print(f\"  {k}: {v}\")\n```\n\n### Quick Cheat Sheet\n\n| Decision | Guidance |\n|---|---|\n| How much data do I have? | < 500: freeze backbone. 500-5k: full fine-tune. > 5k: great |\n| Which model to start with? | DistilBERT for speed, RoBERTa for accuracy |\n| Learning rate | 2e-5 for BERT, 1e-5 for GPT, never > 5e-5 |\n| Epochs | 2-4, use early stopping |\n| Catastrophic forgetting | Lower LR, freeze early layers, fewer epochs |\n| Model not learning | Raise LR, check data quality, check label correctness |\n| Model overfitting | Lower LR, add dropout, add more data, use LoRA |\n\n| Task | Code |\n|---|---|\n| Load model | `AutoModelForSequenceClassification.from_pretrained(name, num_labels=N)` |\n| Tokenize | `tokenizer(texts, truncation=True, padding=False, max_length=256)` |\n| Train | `Trainer(model, args, train_dataset, eval_dataset)` |\n| Early stop | `EarlyStoppingCallback(early_stopping_patience=2)` |\n| Save | `trainer.save_model('./my_model')` |\n| Predict | `trainer.predict(test_dataset)` |\n\n### Practice Challenges\n\n**Level 1:**\n\nDownload any small labeled text dataset from the HuggingFace hub. Fine-tune `distilbert-base-uncased`\n\non it for 3 epochs. Print the classification report. Compare to a TF-IDF + LogisticRegression baseline.\n\n**Level 2:**\n\nFine-tune with and without freezing the first 4 transformer layers. Compare final F1 scores and training time. Which approach is better for your dataset size?\n\n**Level 3:**\n\nCreate your own instruction dataset of 50+ examples for a specific task (code explanation, medical text classification, legal summarization). Fine-tune GPT-2 on it. Test the model with 10 new prompts it hasn't seen. Rate the responses 1-5 and report average quality.\n\n### References\n\n[HuggingFace: Fine-tuning tutorial](https://huggingface.co/docs/transformers/training)[HuggingFace: TrainingArguments docs](https://huggingface.co/docs/transformers/main_classes/trainer)[Stanford Alpaca: instruction fine-tuning](https://crfm.stanford.edu/2023/03/13/alpaca.html)[HuggingFace: PEFT library (for LoRA)](https://huggingface.co/docs/peft)\n\nNext up, Post 96:LoRA: Fine-Tune a Billion-Parameter Model on a Laptop. Parameter-efficient fine-tuning using rank decomposition. Train 1% of parameters and get 95% of the performance of full fine-tuning.", "url": "https://wpnews.pro/news/95-fine-tuning-llms-make-a-general-model-do-your-specific-job", "canonical_source": "https://dev.to/yakhilesh/95-fine-tuning-llms-make-a-general-model-do-your-specific-job-44n2", "published_at": "2026-05-23 13:30:18+00:00", "updated_at": "2026-05-23 14:04:10.749917+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "artificial-intelligence"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/95-fine-tuning-llms-make-a-general-model-do-your-specific-job", "markdown": "https://wpnews.pro/news/95-fine-tuning-llms-make-a-general-model-do-your-specific-job.md", "text": "https://wpnews.pro/news/95-fine-tuning-llms-make-a-general-model-do-your-specific-job.txt", "jsonld": "https://wpnews.pro/news/95-fine-tuning-llms-make-a-general-model-do-your-specific-job.jsonld"}}