# Fine-Tune Your First LLM: A Guide with PyTorch and Hugging Face

> Source: <https://pub.towardsai.net/fine-tune-your-first-llm-a-guide-with-pytorch-and-hugging-face-bc4cdfb156c3?source=rss----98111c9905da---4>
> Published: 2026-06-27 20:01:00+00:00

This article draws from the excellent full fine-tuning walkthrough at[learnhuggingface.com]as a primary study reference. What you are reading is my own version, restructured into a cleaner, more modular format built for beginners who want a straight line from setup to a working fine-tuned model.

Pre-trained language models know a lot about language in general. What they do not know is your specific task, your specific output format, or your specific domain. Fine-tuning is the process of taking a pre-trained model and teaching it exactly that.

In this article, you will fine-tune **Gemma 3 270M**, a small but capable open-source language model from Google, to perform one focused task: extracting food and drink items from raw text and returning them in a clean, structured format.

The input will be plain text like this:

```
British breakfast with baked beans, fried eggs, sausages, bacon, mushrooms, a cup of tea, toast and fried tomatoes.
```

And the output will be structured like this:

```
food_or_drink: 1tags: fi, difoods: baked beans, fried eggs, sausages, bacon, mushrooms, toast, fried tomatoesdrinks: tea
```

This specific task is called structured data extraction. It is one of the most common real-world use cases for fine-tuning: you have unstructured text, you want structured output, and a general-purpose model does not reliably give it to you in the format you need.

By the end of this article you will have a fine-tuned model saved locally and uploaded to Hugging Face, ready to use in any project.

You need a GPU for this. Training even a 270M parameter model on CPU is too slow to be practical.

The options in order of convenience: Google Colab with a T4 GPU (free tier), a local NVIDIA GPU with at least 8GB VRAM, or a cloud instance. If you are using Colab, go to Runtime, then Change Runtime Type, then select GPU before running anything.

Install the required libraries:

```
pip install transformers trl datasets accelerate gradio
```

And set one environment variable before anything else runs. This prevents tokenizer parallelism warnings that clutter your output without affecting anything:

``` python
import osos.environ["TOKENIZERS_PARALLELISM"] = "false"
```

Every step in this article maps to one of six modules. We move through them in order. No jumping ahead.

```
Module 1: Setup and Hardware CheckModule 2: Load the Base ModelModule 3: Load and Prepare the DatasetModule 4: Configure and Run TrainingModule 5: Evaluate and Save the ModelModule 6: Upload to Hugging Face Hub
```

That is the entire workflow. Let’s build it.

Before touching a model or a dataset, confirm that your hardware is ready and your libraries are imported. A misconfigured runtime is the most common reason beginner fine-tuning runs fail before they start.

```
# ============================================================# MODULE 1: SETUP AND HARDWARE CHECK# ============================================================import osimport torchimport transformersimport trlimport datasetsos.environ["TOKENIZERS_PARALLELISM"] = "false"# Detect the best available compute backend.# CUDA = NVIDIA GPU. MPS = Apple Silicon. CPU = fallback.# You genuinely need CUDA or MPS for this to be practical.if torch.cuda.is_available():    DEVICE = "cuda"elif torch.backends.mps.is_available():    DEVICE = "mps"else:    DEVICE = "cpu"print(f"Device: {DEVICE}")# If you are on CUDA, print memory details so you know what you are working with.# 8GB minimum is recommended for this model at bfloat16 precision.if DEVICE == "cuda":    gpu_name = torch.cuda.get_device_name(0)    total_memory = torch.cuda.get_device_properties(0).total_memory / 1e9    print(f"GPU: {gpu_name}")    print(f"Total VRAM: {total_memory:.2f} GB")
```

You should see your device printed clearly. If you see cpu, stop and fix your runtime before continuing. Training on CPU for this model will take hours instead of minutes.

A base model is the starting point. It already understands language in general — grammar, word relationships, how sentences are structured. Your fine-tuning run will take that general understanding and redirect it toward a specific task.

The model we are using is google/gemma-3-270m-it. The it stands for instruction-tuned, meaning it has already been aligned to follow user instructions. This makes it a better starting point than a raw base model for a task that involves following a specific output format.

The model does not read text. It reads numbers. A tokenizer is the translator between human-readable text and the integer sequences the model actually processes.

The tokenizer for this model converts:

```
"Hello my name is Daniel"→ [2, 9259, 1041, 1463, 563, 13108]
```

Every model has its own tokenizer tied to its specific vocabulary. Always load the tokenizer that matches the model. Using a mismatched tokenizer will produce garbage output.

```
# ============================================================# MODULE 2: LOAD THE BASE MODEL# ============================================================from transformers import AutoTokenizer, AutoModelForCausalLMMODEL_NAME = "google/gemma-3-270m-it"# Load the model. dtype="auto" lets the model choose the best precision# for your hardware (usually bfloat16 on modern GPUs).# device_map="auto" places the model on your GPU automatically.model = AutoModelForCausalLM.from_pretrained(    MODEL_NAME,    dtype="auto",    device_map="auto",    attn_implementation="eager")# Load the tokenizer that matches this model exactly.tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)print(f"Model loaded on: {model.device}")print(f"Model precision: {model.dtype}")# Count the total number of trainable parameters.# In full fine-tuning, ALL parameters are updated during training.# This is different from LoRA, where only a small adapter is trained.total_params = sum(p.numel() for p in model.parameters())trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)print(f"Total parameters: {total_params:,}")print(f"Trainable parameters: {trainable_params:,}")
```

In **full fine-tuning**, every parameter in the model is updated during training. This gives the model the most flexibility to learn your task, but it requires more VRAM and takes longer. At 270M parameters, Gemma 3 270M is small enough to do this comfortably on a consumer GPU.

**LoRA** (Low-Rank Adaptation) is an alternative where you freeze the original model weights and only train a small set of adapter weights. This uses far less memory and is the preferred approach for larger models. For a 270M model with a specific task and a GPU with at least 8GB VRAM, full fine-tuning is the right choice.

Training data is what separates a generic model from a task-specific one. The dataset needs three things to be ready for training: it needs to be loaded, it needs to be formatted into the input-output structure the model expects, and it needs to be split into training and test sets.

We are using mrdbourke/FoodExtract-1k, a dataset of 1,420 samples. Each sample contains a raw text string and a structured label showing what food and drink items should be extracted from it.

The labels were generated by a large teacher model (gpt-oss-120b) and condensed into a compact format that uses fewer tokens. This compact format is what we will train our model to produce. Fewer output tokens means faster inference at deployment time.

Supervised fine-tuning (SFT) means you give the model paired examples of input and ideal output. The model learns to produce outputs like the examples by minimizing the difference between what it generates and what the label says it should generate.

In our case: input is a raw text string. Output is the structured food extraction format. Given enough examples, the model learns the pattern.

```
# ============================================================# MODULE 3: LOAD AND PREPARE THE DATASET# ============================================================from datasets import load_dataset# Load the dataset from Hugging Face Hub.dataset = load_dataset("mrdbourke/FoodExtract-1k")print(f"Total samples: {len(dataset['train'])}")print(f"Columns: {dataset['train'].column_names}")# Inspect one sample so you understand the raw structure.# The two columns we care about are:#   'sequence'                    → the raw text input#   'gpt-oss-120b-label-condensed' → the structured output we want to producesample = dataset["train"][0]print(f"\nInput:\n{sample['sequence']}")print(f"\nTarget output:\n{sample['gpt-oss-120b-label-condensed']}")
```

Language models trained for instruction following expect inputs in a conversational format with roles: user (the input) and assistant (the expected output). This is called a chat template. The model was pre-trained to recognize this structure, so matching it during fine-tuning produces better results than feeding raw text directly.

```
# Convert each raw sample into the prompt/completion format# the model was pre-trained to expect.## 'prompt'     → what the user sends (the raw text to extract from)# 'completion' → what the assistant should respond with (the structured output)## TRL's SFTTrainer knows how to handle this exact format automatically.def format_sample(sample):    return {        "prompt": [            {"role": "user", "content": sample["sequence"]}        ],        "completion": [            {"role": "assistant", "content": sample["gpt-oss-120b-label-condensed"]}        ]    }dataset = dataset.map(format_sample, batched=False)# Verify the formatting worked.formatted_sample = dataset["train"][0]print(f"Prompt role:     {formatted_sample['prompt'][0]['role']}")print(f"Completion role: {formatted_sample['completion'][0]['role']}")
# Split into 80% training, 20% test.# The model trains on the training set and is evaluated on the test set.# The test set contains samples the model has never seen during training.# This is how we get an honest measure of how well fine-tuning worked.dataset = dataset["train"].train_test_split(    test_size=0.2,    shuffle=False,   # Keep order consistent for reproducibility    seed=42)print(f"Training samples: {len(dataset['train'])}")print(f"Test samples:     {len(dataset['test'])}")
```

This is where the actual learning happens. You configure the training settings (called hyperparameters), hand everything to a trainer, and let it run.

The two objects you need are SFTConfig (the settings) and SFTTrainer (the engine that runs training using those settings). SFT stands for Supervised Fine-Tuning.

**num_train_epochs :** how many times the trainer passes through the full training dataset. Three epochs is a reasonable default for a focused task like this. Too few and the model hasn't learned enough. Too many and it starts memorizing the training data instead of generalizing.

**per_device_train_batch_size** :how many samples the model processes at once before updating its weights. Larger batches use more VRAM. If you hit an out-of-memory error, reduce this number first.

**learning_rate** :how large each weight update step is. Too high and training becomes unstable. Too low and the model barely changes. 5e-5 is a safe starting point for fine-tuning a pre-trained model.

**completion_only_loss** :this tells the trainer to only compute the training loss on the output tokens, not the input tokens. The model only needs to learn how to generate the structured output. Computing loss on the input would confuse this signal.

**max_length** :the maximum number of tokens per training sample. Any sample longer than this is truncated. Set based on the length of your data and your available VRAM.

```
# ============================================================# MODULE 4: CONFIGURE AND RUN TRAINING# ============================================================from trl import SFTConfig, SFTTrainerCHECKPOINT_DIR = "./fine_tuned_model"BATCH_SIZE     = 16   # Reduce to 8 or 4 if you run out of VRAMLEARNING_RATE  = 5e-5NUM_EPOCHS     = 3sft_config = SFTConfig(    # Where to save the model and training checkpoints    output_dir = CHECKPOINT_DIR,    # Training duration    num_train_epochs = NUM_EPOCHS,    # Batch size per GPU    per_device_train_batch_size = BATCH_SIZE,    per_device_eval_batch_size  = BATCH_SIZE,    # Maximum token length per sample (longer samples are truncated)    max_length = 512,    packing    = False,    # Only compute loss on the output (completion) tokens, not the input tokens.    # This focuses the model on learning the output format.    completion_only_loss = True,    # Optimizer - adamw_torch_fused is faster than standard adamw on GPU    optim = "adamw_torch_fused",    # Learning rate settings    learning_rate     = LEARNING_RATE,    lr_scheduler_type = "constant",    # Precision: set based on model dtype    # bfloat16 is more stable than float16 for most modern GPUs    bf16 = (model.dtype == torch.bfloat16),    fp16 = (model.dtype == torch.float16),    # Evaluation and checkpointing    eval_strategy  = "epoch",   # Evaluate after each full pass through the data    save_strategy  = "epoch",   # Save a checkpoint after each epoch    # Load the best checkpoint at the end of training    load_best_model_at_end   = True,    metric_for_best_model    = "mean_token_accuracy",    greater_is_better        = True,    # Logging    logging_steps = 10,    # Disable external reporting for simplicity    report_to = "none",    push_to_hub = False,)
# Create the trainer object.# This is the engine that runs the actual fine-tuning loop.trainer = SFTTrainer(    model            = model,    args             = sft_config,    train_dataset    = dataset["train"],    eval_dataset     = dataset["test"],    processing_class = tokenizer,)# Start training.# This is the line that actually trains the model.# Watch the training loss - it should trend downward across steps.training_output = trainer.train()print(f"Training complete.")print(f"Final training loss: {training_output.training_loss:.4f}")
```

The trainer prints a log every logging_steps steps. The two numbers to focus on are the training loss and the validation loss. Training loss should decrease steadily. Validation loss should follow a similar trend.

If validation loss starts increasing while training loss keeps decreasing, the model is beginning to overfit to the training data. At that point, load_best_model_at_end=True ensures you end up with the best checkpoint, not the most recently trained one.

Training is not finished when the loop ends. You need to confirm the model actually learned the task by evaluating it on samples it has never seen, then save it to disk in a reusable format.

```
# ============================================================# MODULE 5: EVALUATE AND SAVE THE MODEL# ============================================================# Run the model across the full test dataset and collect metrics.eval_results = trainer.evaluate()print(f"Evaluation results:")print(f"  Loss:                {eval_results['eval_loss']:.4f}")print(f"  Mean token accuracy: {eval_results['eval_mean_token_accuracy']*100:.2f}%")print(f"  Best metric:         {trainer.state.best_metric*100:.2f}%")
```

Mean token accuracy measures what percentage of generated tokens exactly match the target label tokens. Because the output format is highly structured and repetitive (the same format keys appear in every sample), even a model that has partially learned the task will score relatively high here. The more diagnostic check is the manual inspection below.

Numbers do not tell the full story. Always look at actual model outputs on test samples and compare them to the ground truth labels. This is where you catch failure modes that metrics miss.

``` python
from transformers import pipeline# Load the model into a text-generation pipeline for easy inference.inference_pipeline = pipeline(    "text-generation",    model     = model,    tokenizer = tokenizer,)def predict(input_text: str, max_new_tokens: int = 256) -> str:    """    Run inference on a raw text string.    Formats the input into chat template format, runs the model,    and returns only the generated output - not the input prompt.    """    formatted_input = [{"role": "user", "content": input_text}]    prompt = inference_pipeline.tokenizer.apply_chat_template(        conversation           = formatted_input,        tokenize               = False,        add_generation_prompt  = True,    )    output = inference_pipeline(        text_inputs     = prompt,        max_new_tokens  = max_new_tokens,        disable_compile = True,    )    # Return only the newly generated text, not the input prompt    return output[0]["generated_text"][len(prompt):]# Inspect 5 random test samples.# For each one: print the input, the model's output, and the ground truth label.import randomrandom.seed(42)test_indices = random.sample(range(len(dataset["test"])), 5)for i, idx in enumerate(test_indices):    sample = dataset["test"][idx]    raw_input    = sample["sequence"]    ground_truth = sample["gpt-oss-120b-label-condensed"]    model_output = predict(raw_input)    print(f"\n{'='*55}")    print(f"Sample {i+1}")    print(f"{'='*55}")    print(f"Input:\n{raw_input[:200]}...")    print(f"\nModel output:\n{model_output}")    print(f"\nGround truth:\n{ground_truth}")
```

A well fine-tuned model should produce outputs that closely match the ground truth in structure and content. The format should be consistent across every sample: the same keys in the same order, the same field names, no extra text before or after.

If the model is occasionally missing a food item or tagging things slightly differently, that is normal for a model trained on 1,000 samples. The format consistency is what matters most. Content accuracy improves with more training data.

```
# Save the best model to disk.# trainer.save_model() saves to the output_dir specified in SFTConfig.# This includes the model weights, tokenizer, and configuration files.trainer.save_model()print(f"Model saved to: {CHECKPOINT_DIR}")
```

After saving, your checkpoint directory will contain everything needed to reload and use the model: model.safetensors, config.json, tokenizer.json, and supporting files. You can reload it at any time with AutoModelForCausalLM.from_pretrained(CHECKPOINT_DIR).

Always reload the model from disk and run one inference call to confirm the saved files are valid before uploading or deploying.

``` python
from transformers import AutoModelForCausalLM, AutoTokenizer# Reload from diskreloaded_model = AutoModelForCausalLM.from_pretrained(    CHECKPOINT_DIR,    dtype      = "auto",    device_map = "auto",    attn_implementation = "eager")reloaded_tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT_DIR)# Quick sanity checkreloaded_pipeline = pipeline(    "text-generation",    model     = reloaded_model,    tokenizer = reloaded_tokenizer,)test_text   = "A plate with scrambled eggs, avocado toast, and a glass of orange juice."test_prompt = reloaded_pipeline.tokenizer.apply_chat_template(    conversation          = [{"role": "user", "content": test_text}],    tokenize              = False,    add_generation_prompt = True,)test_output = reloaded_pipeline(test_prompt, max_new_tokens=128)generated   = test_output[0]["generated_text"][len(test_prompt):]print(f"Input:  {test_text}")print(f"Output: {generated}")
```

You should see a correctly formatted extraction output. If you do, the model is saved correctly.

A model saved only to your local disk disappears when your Colab session ends or your machine resets. Uploading to Hugging Face Hub gives your model a permanent home, makes it reusable across machines, and optionally shareable with anyone.

You need a Hugging Face account and an access token. Create one at [huggingface.co](https://huggingface.co/), then generate a token under Settings → Access Tokens. Give it write permissions.

In Colab, store it under Secrets (the key icon in the left panel) with the name HF_TOKEN. In a local environment, set it as an environment variable.

``` python
from huggingface_hub import login# This will prompt for your token if it isn't already set.# In Colab, use userdata.get("HF_TOKEN") to read from Secrets.login()
```

A model card is the README for your model. It tells anyone who finds it what the model does, how it was trained, and how to use it. This is not optional — an undocumented model is nearly useless to anyone, including your future self.

```
# ============================================================# MODULE 6: UPLOAD TO HUGGING FACE HUB# ============================================================YOUR_HF_USERNAME = "your_username"   # Replace with your actual Hugging Face usernameMODEL_REPO_ID    = f"{YOUR_HF_USERNAME}/FoodExtract-gemma-3-270m-finetuned"model_card = f"""---base_model: google/gemma-3-270m-itlibrary_name: transformerstags:  - sft  - trl  - fine-tuned  - food-extractionlicense: gemma---# FoodExtract: Fine-Tuned Gemma 3 270MA fine-tuned version of [google/gemma-3-270m-it](https://huggingface.co/google/gemma-3-270m-it)trained to extract food and drink items from raw text.## What It DoesGiven any text input, this model outputs a structured extraction:
```

food_or_drink: 1 tags: fi, di foods: eggs, bacon, toast drinks: orange juice

```
If the input contains no food or drink items:
```

food_or_drink: 0 tags: foods: drinks:

```
## Tag Reference| Tag | Meaning ||-----|---------|| np  | nutrition_panel || il  | ingredient_list || me  | menu || re  | recipe || fi  | food_items || di  | drink_items || fa  | food_advertisement || fp  | food_packaging |## How to Usewith open(f”{CHECKPOINT_DIR}/README.md”, “w”) as f: f.write(model_card)

print(“Model card written.”)

``` python
### Creating the Repository and UploadingYour model is now live at https://huggingface.co/{YOUR_HF_USERNAME}/FoodExtract-gemma-3-270m-finetuned. Anyone with the link can download and use it. You can also load it from any machine in the future with just the model ID.

Here is the entire workflow condensed to its essentials — every module, every critical line, in sequence:

```
Module 1: Detect hardware → confirm GPU is availableModule 2: Load model + tokenizer → google/gemma-3-270m-itModule 3: Load dataset → format into prompt/completion → split 80/20Module 4: Configure SFTConfig → build SFTTrainer → call trainer.train()Module 5: Evaluate metrics → inspect outputs manually → save model to diskModule 6: Write model card → create HF repo → upload_folder
```

Each module has one clear input and one clear output. None of them depend on anything outside their own section except the objects passed from the previous one.

You have a fine-tuned language model. Not a wrapper around an API. Not a prompt-engineered workaround. A model whose weights have been updated to reliably perform a specific task, saved to disk, and living on Hugging Face where you can reload it in one line from anywhere.

The task in this article was food extraction. The pattern is universal. The same six modules apply to any structured data extraction problem: replace the dataset, adjust the output format in your model card, change the task description in your evaluation function. Everything else stays the same.

[Fine-Tune Your First LLM: A Guide with PyTorch and Hugging Face](https://pub.towardsai.net/fine-tune-your-first-llm-a-guide-with-pytorch-and-hugging-face-bc4cdfb156c3) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.
