Fine-Tune Your First LLM: A Guide with PyTorch and Hugging Face

wpnews.pro

This article draws from the excellent full fine-tuning walkthrough at[learnhuggingface.com]as a primary study reference. What you are reading is my own version, restructured into a cleaner, more modular format built for beginners who want a straight line from setup to a working fine-tuned model.

Pre-trained language models know a lot about language in general. What they do not know is your specific task, your specific output format, or your specific domain. Fine-tuning is the process of taking a pre-trained model and teaching it exactly that.

In this article, you will fine-tune Gemma 3 270M, a small but capable open-source language model from Google, to perform one focused task: extracting food and drink items from raw text and returning them in a clean, structured format.

The input will be plain text like this:

British breakfast with baked beans, fried eggs, sausages, bacon, mushrooms, a cup of tea, toast and fried tomatoes.

And the output will be structured like this:

food_or_drink: 1tags: fi, difoods: baked beans, fried eggs, sausages, bacon, mushrooms, toast, fried tomatoesdrinks: tea

This specific task is called structured data extraction. It is one of the most common real-world use cases for fine-tuning: you have unstructured text, you want structured output, and a general-purpose model does not reliably give it to you in the format you need.

By the end of this article you will have a fine-tuned model saved locally and uploaded to Hugging Face, ready to use in any project.

You need a GPU for this. Training even a 270M parameter model on CPU is too slow to be practical.

The options in order of convenience: Google Colab with a T4 GPU (free tier), a local NVIDIA GPU with at least 8GB VRAM, or a cloud instance. If you are using Colab, go to Runtime, then Change Runtime Type, then select GPU before running anything.

Install the required libraries:

pip install transformers trl datasets accelerate gradio

And set one environment variable before anything else runs. This prevents tokenizer parallelism warnings that clutter your output without affecting anything:

import osos.environ["TOKENIZERS_PARALLELISM"] = "false"

Every step in this article maps to one of six modules. We move through them in order. No jumping ahead.

Module 1: Setup and Hardware CheckModule 2: Load the Base ModelModule 3: Load and Prepare the DatasetModule 4: Configure and Run TrainingModule 5: Evaluate and Save the ModelModule 6: Upload to Hugging Face Hub

That is the entire workflow. Let’s build it.

Before touching a model or a dataset, confirm that your hardware is ready and your libraries are imported. A misconfigured runtime is the most common reason beginner fine-tuning runs fail before they start.

You should see your device printed clearly. If you see cpu, stop and fix your runtime before continuing. Training on CPU for this model will take hours instead of minutes.

A base model is the starting point. It already understands language in general — grammar, word relationships, how sentences are structured. Your fine-tuning run will take that general understanding and redirect it toward a specific task.

The model we are using is google/gemma-3-270m-it. The it stands for instruction-tuned, meaning it has already been aligned to follow user instructions. This makes it a better starting point than a raw base model for a task that involves following a specific output format.

The model does not read text. It reads numbers. A tokenizer is the translator between human-readable text and the integer sequences the model actually processes.

The tokenizer for this model converts:

"Hello my name is Daniel"→ [2, 9259, 1041, 1463, 563, 13108]

Every model has its own tokenizer tied to its specific vocabulary. Always load the tokenizer that matches the model. Using a mismatched tokenizer will produce garbage output.

In full fine-tuning, every parameter in the model is updated during training. This gives the model the most flexibility to learn your task, but it requires more VRAM and takes longer. At 270M parameters, Gemma 3 270M is small enough to do this comfortably on a consumer GPU.

LoRA (Low-Rank Adaptation) is an alternative where you freeze the original model weights and only train a small set of adapter weights. This uses far less memory and is the preferred approach for larger models. For a 270M model with a specific task and a GPU with at least 8GB VRAM, full fine-tuning is the right choice.

Training data is what separates a generic model from a task-specific one. The dataset needs three things to be ready for training: it needs to be loaded, it needs to be formatted into the input-output structure the model expects, and it needs to be split into training and test sets.

We are using mrdbourke/FoodExtract-1k, a dataset of 1,420 samples. Each sample contains a raw text string and a structured label showing what food and drink items should be extracted from it.

The labels were generated by a large teacher model (gpt-oss-120b) and condensed into a compact format that uses fewer tokens. This compact format is what we will train our model to produce. Fewer output tokens means faster inference at deployment time.

Supervised fine-tuning (SFT) means you give the model paired examples of input and ideal output. The model learns to produce outputs like the examples by minimizing the difference between what it generates and what the label says it should generate.

In our case: input is a raw text string. Output is the structured food extraction format. Given enough examples, the model learns the pattern.

Language models trained for instruction following expect inputs in a conversational format with roles: user (the input) and assistant (the expected output). This is called a chat template. The model was pre-trained to recognize this structure, so matching it during fine-tuning produces better results than feeding raw text directly.

This is where the actual learning happens. You configure the training settings (called hyperparameters), hand everything to a trainer, and let it run.

The two objects you need are SFTConfig (the settings) and SFTTrainer (the engine that runs training using those settings). SFT stands for Supervised Fine-Tuning.

num_train_epochs : how many times the trainer passes through the full training dataset. Three epochs is a reasonable default for a focused task like this. Too few and the model hasn't learned enough. Too many and it starts memorizing the training data instead of generalizing.

per_device_train_batch_size :how many samples the model processes at once before updating its weights. Larger batches use more VRAM. If you hit an out-of-memory error, reduce this number first.

learning_rate :how large each weight update step is. Too high and training becomes unstable. Too low and the model barely changes. 5e-5 is a safe starting point for fine-tuning a pre-trained model.

completion_only_loss :this tells the trainer to only compute the training loss on the output tokens, not the input tokens. The model only needs to learn how to generate the structured output. Computing loss on the input would confuse this signal.

max_length :the maximum number of tokens per training sample. Any sample longer than this is truncated. Set based on the length of your data and your available VRAM.

The trainer prints a log every logging_steps steps. The two numbers to focus on are the training loss and the validation loss. Training loss should decrease steadily. Validation loss should follow a similar trend.

If validation loss starts increasing while training loss keeps decreasing, the model is beginning to overfit to the training data. At that point, load_best_model_at_end=True ensures you end up with the best checkpoint, not the most recently trained one.

Training is not finished when the loop ends. You need to confirm the model actually learned the task by evaluating it on samples it has never seen, then save it to disk in a reusable format.

Mean token accuracy measures what percentage of generated tokens exactly match the target label tokens. Because the output format is highly structured and repetitive (the same format keys appear in every sample), even a model that has partially learned the task will score relatively high here. The more diagnostic check is the manual inspection below.

Numbers do not tell the full story. Always look at actual model outputs on test samples and compare them to the ground truth labels. This is where you catch failure modes that metrics miss.

from transformers import pipeline# Load the model into a text-generation pipeline for easy inference.inference_pipeline = pipeline(    "text-generation",    model     = model,    tokenizer = tokenizer,)def predict(input_text: str, max_new_tokens: int = 256) -> str:    """    Run inference on a raw text string.    Formats the input into chat template format, runs the model,    and returns only the generated output - not the input prompt.    """    formatted_input = [{"role": "user", "content": input_text}]    prompt = inference_pipeline.tokenizer.apply_chat_template(        conversation           = formatted_input,        tokenize               = False,        add_generation_prompt  = True,    )    output = inference_pipeline(        text_inputs     = prompt,        max_new_tokens  = max_new_tokens,        disable_compile = True,    )    # Return only the newly generated text, not the input prompt    return output[0]["generated_text"][len(prompt):]# Inspect 5 random test samples.# For each one: print the input, the model's output, and the ground truth label.import randomrandom.seed(42)test_indices = random.sample(range(len(dataset["test"])), 5)for i, idx in enumerate(test_indices):    sample = dataset["test"][idx]    raw_input    = sample["sequence"]    ground_truth = sample["gpt-oss-120b-label-condensed"]    model_output = predict(raw_input)    print(f"\n{'='*55}")    print(f"Sample {i+1}")    print(f"{'='*55}")    print(f"Input:\n{raw_input[:200]}...")    print(f"\nModel output:\n{model_output}")    print(f"\nGround truth:\n{ground_truth}")

A well fine-tuned model should produce outputs that closely match the ground truth in structure and content. The format should be consistent across every sample: the same keys in the same order, the same field names, no extra text before or after.

If the model is occasionally missing a food item or tagging things slightly differently, that is normal for a model trained on 1,000 samples. The format consistency is what matters most. Content accuracy improves with more training data.

After saving, your checkpoint directory will contain everything needed to reload and use the model: model.safetensors, config.json, tokenizer.json, and supporting files. You can reload it at any time with AutoModelForCausalLM.from_pretrained(CHECKPOINT_DIR).

Always reload the model from disk and run one inference call to confirm the saved files are valid before up or deploying.

from transformers import AutoModelForCausalLM, AutoTokenizer# Reload from diskreloaded_model = AutoModelForCausalLM.from_pretrained(    CHECKPOINT_DIR,    dtype      = "auto",    device_map = "auto",    attn_implementation = "eager")reloaded_tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT_DIR)# Quick sanity checkreloaded_pipeline = pipeline(    "text-generation",    model     = reloaded_model,    tokenizer = reloaded_tokenizer,)test_text   = "A plate with scrambled eggs, avocado toast, and a glass of orange juice."test_prompt = reloaded_pipeline.tokenizer.apply_chat_template(    conversation          = [{"role": "user", "content": test_text}],    tokenize              = False,    add_generation_prompt = True,)test_output = reloaded_pipeline(test_prompt, max_new_tokens=128)generated   = test_output[0]["generated_text"][len(test_prompt):]print(f"Input:  {test_text}")print(f"Output: {generated}")

You should see a correctly formatted extraction output. If you do, the model is saved correctly.

A model saved only to your local disk disappears when your Colab session ends or your machine resets. Up to Hugging Face Hub gives your model a permanent home, makes it reusable across machines, and optionally shareable with anyone.

You need a Hugging Face account and an access token. Create one at huggingface.co, then generate a token under Settings → Access Tokens. Give it write permissions.

In Colab, store it under Secrets (the key icon in the left panel) with the name HF_TOKEN. In a local environment, set it as an environment variable.

from huggingface_hub import login# This will prompt for your token if it isn't already set.# In Colab, use userdata.get("HF_TOKEN") to read from Secrets.login()

A model card is the README for your model. It tells anyone who finds it what the model does, how it was trained, and how to use it. This is not optional — an undocumented model is nearly useless to anyone, including your future self.

food_or_drink: 1 tags: fi, di foods: eggs, bacon, toast drinks: orange juice

If the input contains no food or drink items:

food_or_drink: 0 tags: foods: drinks:

## Tag Reference| Tag | Meaning ||-----|---------|| np  | nutrition_panel || il  | ingredient_list || me  | menu || re  | recipe || fi  | food_items || di  | drink_items || fa  | food_advertisement || fp  | food_packaging |## How to Usewith open(f”{CHECKPOINT_DIR}/README.md”, “w”) as f: f.write(model_card)

print(“Model card written.”)

``` python
### Creating the Repository and UpYour model is now live at https://huggingface.co/{YOUR_HF_USERNAME}/FoodExtract-gemma-3-270m-finetuned. Anyone with the link can download and use it. You can also load it from any machine in the future with just the model ID.

Here is the entire workflow condensed to its essentials — every module, every critical line, in sequence:

Module 1: Detect hardware → confirm GPU is availableModule 2: Load model + tokenizer → google/gemma-3-270m-itModule 3: Load dataset → format into prompt/completion → split 80/20Module 4: Configure SFTConfig → build SFTTrainer → call trainer.train()Module 5: Evaluate metrics → inspect outputs manually → save model to diskModule 6: Write model card → create HF repo → upload_folder


Each module has one clear input and one clear output. None of them depend on anything outside their own section except the objects passed from the previous one.

You have a fine-tuned language model. Not a wrapper around an API. Not a prompt-engineered workaround. A model whose weights have been updated to reliably perform a specific task, saved to disk, and living on Hugging Face where you can reload it in one line from anywhere.

The task in this article was food extraction. The pattern is universal. The same six modules apply to any structured data extraction problem: replace the dataset, adjust the output format in your model card, change the task description in your evaluation function. Everything else stays the same.

[Fine-Tune Your First LLM: A Guide with PyTorch and Hugging Face](https://pub.towardsai.net/fine-tune-your-first-llm-a-guide-with-pytorch-and-hugging-face-bc4cdfb156c3) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article Loop Engineering in 2026: Why the Best Developers Don’t Prompt AI Agents Anymore : They Design… Stop Defaulting to GPT-4 for Everything: A Practical Guide to Picking the Right Model Korea Built the Most Leveraged Bet on the AI Boom. In June, It Got a Preview.

Fine-Tune Your First LLM: A Guide with PyTorch and Hugging Face

Run your AI side-project on zahid.host