Fine-Tune Your First LLM: A Guide with PyTorch and Hugging Face

Google's Gemma 3 270M language model can be fine-tuned for structured data extraction using PyTorch and Hugging Face libraries, according to a tutorial that walks beginners through the process of teaching a pre-trained model to output food and drink items in a specific format. The guide requires a GPU and covers six modules from setup to uploading the model to Hugging Face Hub.

This article draws from the excellent full fine-tuning walkthrough at learnhuggingface.com as a primary study reference. What you are reading is my own version, restructured into a cleaner, more modular format built for beginners who want a straight line from setup to a working fine-tuned model. Pre-trained language models know a lot about language in general. What they do not know is your specific task, your specific output format, or your specific domain. Fine-tuning is the process of taking a pre-trained model and teaching it exactly that. In this article, you will fine-tune Gemma 3 270M , a small but capable open-source language model from Google, to perform one focused task: extracting food and drink items from raw text and returning them in a clean, structured format. The input will be plain text like this: British breakfast with baked beans, fried eggs, sausages, bacon, mushrooms, a cup of tea, toast and fried tomatoes. And the output will be structured like this: food or drink: 1tags: fi, difoods: baked beans, fried eggs, sausages, bacon, mushrooms, toast, fried tomatoesdrinks: tea This specific task is called structured data extraction. It is one of the most common real-world use cases for fine-tuning: you have unstructured text, you want structured output, and a general-purpose model does not reliably give it to you in the format you need. By the end of this article you will have a fine-tuned model saved locally and uploaded to Hugging Face, ready to use in any project. You need a GPU for this. Training even a 270M parameter model on CPU is too slow to be practical. The options in order of convenience: Google Colab with a T4 GPU free tier , a local NVIDIA GPU with at least 8GB VRAM, or a cloud instance. If you are using Colab, go to Runtime, then Change Runtime Type, then select GPU before running anything. Install the required libraries: pip install transformers trl datasets accelerate gradio And set one environment variable before anything else runs. This prevents tokenizer parallelism warnings that clutter your output without affecting anything: python import osos.environ "TOKENIZERS PARALLELISM" = "false" Every step in this article maps to one of six modules. We move through them in order. No jumping ahead. Module 1: Setup and Hardware CheckModule 2: Load the Base ModelModule 3: Load and Prepare the DatasetModule 4: Configure and Run TrainingModule 5: Evaluate and Save the ModelModule 6: Upload to Hugging Face Hub That is the entire workflow. Let’s build it. Before touching a model or a dataset, confirm that your hardware is ready and your libraries are imported. A misconfigured runtime is the most common reason beginner fine-tuning runs fail before they start. ============================================================ MODULE 1: SETUP AND HARDWARE CHECK ============================================================import osimport torchimport transformersimport trlimport datasetsos.environ "TOKENIZERS PARALLELISM" = "false" Detect the best available compute backend. CUDA = NVIDIA GPU. MPS = Apple Silicon. CPU = fallback. You genuinely need CUDA or MPS for this to be practical.if torch.cuda.is available : DEVICE = "cuda"elif torch.backends.mps.is available : DEVICE = "mps"else: DEVICE = "cpu"print f"Device: {DEVICE}" If you are on CUDA, print memory details so you know what you are working with. 8GB minimum is recommended for this model at bfloat16 precision.if DEVICE == "cuda": gpu name = torch.cuda.get device name 0 total memory = torch.cuda.get device properties 0 .total memory / 1e9 print f"GPU: {gpu name}" print f"Total VRAM: {total memory:.2f} GB" You should see your device printed clearly. If you see cpu, stop and fix your runtime before continuing. Training on CPU for this model will take hours instead of minutes. A base model is the starting point. It already understands language in general — grammar, word relationships, how sentences are structured. Your fine-tuning run will take that general understanding and redirect it toward a specific task. The model we are using is google/gemma-3-270m-it. The it stands for instruction-tuned, meaning it has already been aligned to follow user instructions. This makes it a better starting point than a raw base model for a task that involves following a specific output format. The model does not read text. It reads numbers. A tokenizer is the translator between human-readable text and the integer sequences the model actually processes. The tokenizer for this model converts: "Hello my name is Daniel"→ 2, 9259, 1041, 1463, 563, 13108 Every model has its own tokenizer tied to its specific vocabulary. Always load the tokenizer that matches the model. Using a mismatched tokenizer will produce garbage output. ============================================================ MODULE 2: LOAD THE BASE MODEL ============================================================from transformers import AutoTokenizer, AutoModelForCausalLMMODEL NAME = "google/gemma-3-270m-it" Load the model. dtype="auto" lets the model choose the best precision for your hardware usually bfloat16 on modern GPUs . device map="auto" places the model on your GPU automatically.model = AutoModelForCausalLM.from pretrained MODEL NAME, dtype="auto", device map="auto", attn implementation="eager" Load the tokenizer that matches this model exactly.tokenizer = AutoTokenizer.from pretrained MODEL NAME print f"Model loaded on: {model.device}" print f"Model precision: {model.dtype}" Count the total number of trainable parameters. In full fine-tuning, ALL parameters are updated during training. This is different from LoRA, where only a small adapter is trained.total params = sum p.numel for p in model.parameters trainable params = sum p.numel for p in model.parameters if p.requires grad print f"Total parameters: {total params:,}" print f"Trainable parameters: {trainable params:,}" In full fine-tuning , every parameter in the model is updated during training. This gives the model the most flexibility to learn your task, but it requires more VRAM and takes longer. At 270M parameters, Gemma 3 270M is small enough to do this comfortably on a consumer GPU. LoRA Low-Rank Adaptation is an alternative where you freeze the original model weights and only train a small set of adapter weights. This uses far less memory and is the preferred approach for larger models. For a 270M model with a specific task and a GPU with at least 8GB VRAM, full fine-tuning is the right choice. Training data is what separates a generic model from a task-specific one. The dataset needs three things to be ready for training: it needs to be loaded, it needs to be formatted into the input-output structure the model expects, and it needs to be split into training and test sets. We are using mrdbourke/FoodExtract-1k, a dataset of 1,420 samples. Each sample contains a raw text string and a structured label showing what food and drink items should be extracted from it. The labels were generated by a large teacher model gpt-oss-120b and condensed into a compact format that uses fewer tokens. This compact format is what we will train our model to produce. Fewer output tokens means faster inference at deployment time. Supervised fine-tuning SFT means you give the model paired examples of input and ideal output. The model learns to produce outputs like the examples by minimizing the difference between what it generates and what the label says it should generate. In our case: input is a raw text string. Output is the structured food extraction format. Given enough examples, the model learns the pattern. ============================================================ MODULE 3: LOAD AND PREPARE THE DATASET ============================================================from datasets import load dataset Load the dataset from Hugging Face Hub.dataset = load dataset "mrdbourke/FoodExtract-1k" print f"Total samples: {len dataset 'train' }" print f"Columns: {dataset 'train' .column names}" Inspect one sample so you understand the raw structure. The two columns we care about are: 'sequence' → the raw text input 'gpt-oss-120b-label-condensed' → the structured output we want to producesample = dataset "train" 0 print f"\nInput:\n{sample 'sequence' }" print f"\nTarget output:\n{sample 'gpt-oss-120b-label-condensed' }" Language models trained for instruction following expect inputs in a conversational format with roles: user the input and assistant the expected output . This is called a chat template. The model was pre-trained to recognize this structure, so matching it during fine-tuning produces better results than feeding raw text directly. Convert each raw sample into the prompt/completion format the model was pre-trained to expect. 'prompt' → what the user sends the raw text to extract from 'completion' → what the assistant should respond with the structured output TRL's SFTTrainer knows how to handle this exact format automatically.def format sample sample : return { "prompt": {"role": "user", "content": sample "sequence" } , "completion": {"role": "assistant", "content": sample "gpt-oss-120b-label-condensed" } }dataset = dataset.map format sample, batched=False Verify the formatting worked.formatted sample = dataset "train" 0 print f"Prompt role: {formatted sample 'prompt' 0 'role' }" print f"Completion role: {formatted sample 'completion' 0 'role' }" Split into 80% training, 20% test. The model trains on the training set and is evaluated on the test set. The test set contains samples the model has never seen during training. This is how we get an honest measure of how well fine-tuning worked.dataset = dataset "train" .train test split test size=0.2, shuffle=False, Keep order consistent for reproducibility seed=42 print f"Training samples: {len dataset 'train' }" print f"Test samples: {len dataset 'test' }" This is where the actual learning happens. You configure the training settings called hyperparameters , hand everything to a trainer, and let it run. The two objects you need are SFTConfig the settings and SFTTrainer the engine that runs training using those settings . SFT stands for Supervised Fine-Tuning. num train epochs : how many times the trainer passes through the full training dataset. Three epochs is a reasonable default for a focused task like this. Too few and the model hasn't learned enough. Too many and it starts memorizing the training data instead of generalizing. per device train batch size :how many samples the model processes at once before updating its weights. Larger batches use more VRAM. If you hit an out-of-memory error, reduce this number first. learning rate :how large each weight update step is. Too high and training becomes unstable. Too low and the model barely changes. 5e-5 is a safe starting point for fine-tuning a pre-trained model. completion only loss :this tells the trainer to only compute the training loss on the output tokens, not the input tokens. The model only needs to learn how to generate the structured output. Computing loss on the input would confuse this signal. max length :the maximum number of tokens per training sample. Any sample longer than this is truncated. Set based on the length of your data and your available VRAM. ============================================================ MODULE 4: CONFIGURE AND RUN TRAINING ============================================================from trl import SFTConfig, SFTTrainerCHECKPOINT DIR = "./fine tuned model"BATCH SIZE = 16 Reduce to 8 or 4 if you run out of VRAMLEARNING RATE = 5e-5NUM EPOCHS = 3sft config = SFTConfig Where to save the model and training checkpoints output dir = CHECKPOINT DIR, Training duration num train epochs = NUM EPOCHS, Batch size per GPU per device train batch size = BATCH SIZE, per device eval batch size = BATCH SIZE, Maximum token length per sample longer samples are truncated max length = 512, packing = False, Only compute loss on the output completion tokens, not the input tokens. This focuses the model on learning the output format. completion only loss = True, Optimizer - adamw torch fused is faster than standard adamw on GPU optim = "adamw torch fused", Learning rate settings learning rate = LEARNING RATE, lr scheduler type = "constant", Precision: set based on model dtype bfloat16 is more stable than float16 for most modern GPUs bf16 = model.dtype == torch.bfloat16 , fp16 = model.dtype == torch.float16 , Evaluation and checkpointing eval strategy = "epoch", Evaluate after each full pass through the data save strategy = "epoch", Save a checkpoint after each epoch Load the best checkpoint at the end of training load best model at end = True, metric for best model = "mean token accuracy", greater is better = True, Logging logging steps = 10, Disable external reporting for simplicity report to = "none", push to hub = False, Create the trainer object. This is the engine that runs the actual fine-tuning loop.trainer = SFTTrainer model = model, args = sft config, train dataset = dataset "train" , eval dataset = dataset "test" , processing class = tokenizer, Start training. This is the line that actually trains the model. Watch the training loss - it should trend downward across steps.training output = trainer.train print f"Training complete." print f"Final training loss: {training output.training loss:.4f}" The trainer prints a log every logging steps steps. The two numbers to focus on are the training loss and the validation loss. Training loss should decrease steadily. Validation loss should follow a similar trend. If validation loss starts increasing while training loss keeps decreasing, the model is beginning to overfit to the training data. At that point, load best model at end=True ensures you end up with the best checkpoint, not the most recently trained one. Training is not finished when the loop ends. You need to confirm the model actually learned the task by evaluating it on samples it has never seen, then save it to disk in a reusable format. ============================================================ MODULE 5: EVALUATE AND SAVE THE MODEL ============================================================ Run the model across the full test dataset and collect metrics.eval results = trainer.evaluate print f"Evaluation results:" print f" Loss: {eval results 'eval loss' :.4f}" print f" Mean token accuracy: {eval results 'eval mean token accuracy' 100:.2f}%" print f" Best metric: {trainer.state.best metric 100:.2f}%" Mean token accuracy measures what percentage of generated tokens exactly match the target label tokens. Because the output format is highly structured and repetitive the same format keys appear in every sample , even a model that has partially learned the task will score relatively high here. The more diagnostic check is the manual inspection below. Numbers do not tell the full story. Always look at actual model outputs on test samples and compare them to the ground truth labels. This is where you catch failure modes that metrics miss. python from transformers import pipeline Load the model into a text-generation pipeline for easy inference.inference pipeline = pipeline "text-generation", model = model, tokenizer = tokenizer, def predict input text: str, max new tokens: int = 256 - str: """ Run inference on a raw text string. Formats the input into chat template format, runs the model, and returns only the generated output - not the input prompt. """ formatted input = {"role": "user", "content": input text} prompt = inference pipeline.tokenizer.apply chat template conversation = formatted input, tokenize = False, add generation prompt = True, output = inference pipeline text inputs = prompt, max new tokens = max new tokens, disable compile = True, Return only the newly generated text, not the input prompt return output 0 "generated text" len prompt : Inspect 5 random test samples. For each one: print the input, the model's output, and the ground truth label.import randomrandom.seed 42 test indices = random.sample range len dataset "test" , 5 for i, idx in enumerate test indices : sample = dataset "test" idx raw input = sample "sequence" ground truth = sample "gpt-oss-120b-label-condensed" model output = predict raw input print f"\n{'=' 55}" print f"Sample {i+1}" print f"{'=' 55}" print f"Input:\n{raw input :200 }..." print f"\nModel output:\n{model output}" print f"\nGround truth:\n{ground truth}" A well fine-tuned model should produce outputs that closely match the ground truth in structure and content. The format should be consistent across every sample: the same keys in the same order, the same field names, no extra text before or after. If the model is occasionally missing a food item or tagging things slightly differently, that is normal for a model trained on 1,000 samples. The format consistency is what matters most. Content accuracy improves with more training data. Save the best model to disk. trainer.save model saves to the output dir specified in SFTConfig. This includes the model weights, tokenizer, and configuration files.trainer.save model print f"Model saved to: {CHECKPOINT DIR}" After saving, your checkpoint directory will contain everything needed to reload and use the model: model.safetensors, config.json, tokenizer.json, and supporting files. You can reload it at any time with AutoModelForCausalLM.from pretrained CHECKPOINT DIR . Always reload the model from disk and run one inference call to confirm the saved files are valid before uploading or deploying. python from transformers import AutoModelForCausalLM, AutoTokenizer Reload from diskreloaded model = AutoModelForCausalLM.from pretrained CHECKPOINT DIR, dtype = "auto", device map = "auto", attn implementation = "eager" reloaded tokenizer = AutoTokenizer.from pretrained CHECKPOINT DIR Quick sanity checkreloaded pipeline = pipeline "text-generation", model = reloaded model, tokenizer = reloaded tokenizer, test text = "A plate with scrambled eggs, avocado toast, and a glass of orange juice."test prompt = reloaded pipeline.tokenizer.apply chat template conversation = {"role": "user", "content": test text} , tokenize = False, add generation prompt = True, test output = reloaded pipeline test prompt, max new tokens=128 generated = test output 0 "generated text" len test prompt : print f"Input: {test text}" print f"Output: {generated}" You should see a correctly formatted extraction output. If you do, the model is saved correctly. A model saved only to your local disk disappears when your Colab session ends or your machine resets. Uploading to Hugging Face Hub gives your model a permanent home, makes it reusable across machines, and optionally shareable with anyone. You need a Hugging Face account and an access token. Create one at huggingface.co https://huggingface.co/ , then generate a token under Settings → Access Tokens. Give it write permissions. In Colab, store it under Secrets the key icon in the left panel with the name HF TOKEN. In a local environment, set it as an environment variable. python from huggingface hub import login This will prompt for your token if it isn't already set. In Colab, use userdata.get "HF TOKEN" to read from Secrets.login A model card is the README for your model. It tells anyone who finds it what the model does, how it was trained, and how to use it. This is not optional — an undocumented model is nearly useless to anyone, including your future self. ============================================================ MODULE 6: UPLOAD TO HUGGING FACE HUB ============================================================YOUR HF USERNAME = "your username" Replace with your actual Hugging Face usernameMODEL REPO ID = f"{YOUR HF USERNAME}/FoodExtract-gemma-3-270m-finetuned"model card = f"""---base model: google/gemma-3-270m-itlibrary name: transformerstags: - sft - trl - fine-tuned - food-extractionlicense: gemma--- FoodExtract: Fine-Tuned Gemma 3 270MA fine-tuned version of google/gemma-3-270m-it https://huggingface.co/google/gemma-3-270m-it trained to extract food and drink items from raw text. What It DoesGiven any text input, this model outputs a structured extraction: food or drink: 1 tags: fi, di foods: eggs, bacon, toast drinks: orange juice If the input contains no food or drink items: food or drink: 0 tags: foods: drinks: Tag Reference| Tag | Meaning ||-----|---------|| np | nutrition panel || il | ingredient list || me | menu || re | recipe || fi | food items || di | drink items || fa | food advertisement || fp | food packaging | How to Usewith open f”{CHECKPOINT DIR}/README.md”, “w” as f: f.write model card print “Model card written.” python Creating the Repository and UploadingYour model is now live at https://huggingface.co/{YOUR HF USERNAME}/FoodExtract-gemma-3-270m-finetuned. Anyone with the link can download and use it. You can also load it from any machine in the future with just the model ID. Here is the entire workflow condensed to its essentials — every module, every critical line, in sequence: Module 1: Detect hardware → confirm GPU is availableModule 2: Load model + tokenizer → google/gemma-3-270m-itModule 3: Load dataset → format into prompt/completion → split 80/20Module 4: Configure SFTConfig → build SFTTrainer → call trainer.train Module 5: Evaluate metrics → inspect outputs manually → save model to diskModule 6: Write model card → create HF repo → upload folder Each module has one clear input and one clear output. None of them depend on anything outside their own section except the objects passed from the previous one. You have a fine-tuned language model. Not a wrapper around an API. Not a prompt-engineered workaround. A model whose weights have been updated to reliably perform a specific task, saved to disk, and living on Hugging Face where you can reload it in one line from anywhere. The task in this article was food extraction. The pattern is universal. The same six modules apply to any structured data extraction problem: replace the dataset, adjust the output format in your model card, change the task description in your evaluation function. Everything else stays the same. Fine-Tune Your First LLM: A Guide with PyTorch and Hugging Face https://pub.towardsai.net/fine-tune-your-first-llm-a-guide-with-pytorch-and-hugging-face-bc4cdfb156c3 was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.