# A tiny local model can sort tickets

> Source: <https://www.runagentrun.co.uk/articles/tiny-local-model-cheap-classifier/>
> Published: 2026-06-22 00:00:00+00:00

The developer Torgeir Helgevold runs a chatbot that answers questions about his house — who cleaned the gutters, which painter did the downstairs, when the pool pump was last replaced. The bot pulls answers from a vector database, but first classifies each question into a metadata category (pool, car, hvac, cooking, gutters) and narrows the search to just that category’s entries. The classification step is the part that broke.

The chatbot uses two local models: Qwen 3:4B for general question answering, and Qwen 3:0.6B — a 600-million-parameter model small enough to run on a laptop — for categorisation. The whole question is whether that tiny model can be fine-tuned into a reliable classifier. The hypothesis Helgevold set out to test, in [his write-up](https://www.teachmecoolstuff.com/viewarticle/fine-tuning-a-local-llm-to-categorize-questions): a very small local LLM can be fine-tuned to perform reliable question categorization when trained on a dataset of household-related questions

.

## The numbers

The baseline — the same 0.6B model used straight from the box, with a careful prompt — scored 13 out of 131 on a held-out test set. That is 10%. The model kept inventing categories that were not on its list (one answer came back as Ollama returned an unknown category name “apartments” from response “apartments”

) and over-using broad labels like *electric* and *appliances*.

Fine-tune number one, using Unsloth with QLoRA on about 850 entries split 70/15/15, lifted the score to 104 out of 131 (79%). Fine-tune number two — the same data, the same method, but with each category swapped for an opaque two-letter code (AA, BB, CC, and so on) — reached 120 out of 131.

10% → 92% on the same 131-question test set: prompt-only baseline, then a first fine-tune, then a second fine-tune with opaque codes — all on a 600-million-parameter model small enough to run on a laptop.

## Why this is the lesson for a UK small firm

The lesson is not “Qwen 3:0.6B is the best classifier ever”. It is that for narrow, repetitive classification — routing support tickets, tagging inbound emails, sorting enquiries by department, screening job applications, flagging supplier invoices — prompting a big cloud model is the wrong tool. A fine-tuned local model scored 92% on a job a 600M-parameter model had no business doing; a frontier model would also score well, but only after per-call fees, a third-party API and whatever latency and rate limits come with the plan.

A local fine-tune flips three of those dials:

**Cost.** After the one-off training run, inference is electricity on hardware you already own. A 0.6B-parameter model runs on a small office server or a spare laptop.**Privacy.** Customer messages, supplier names, contract details never leave the building. That is the line you put in front of a sceptical partner or DPO.**Reliability.** No API rate limits, no surprise billing, no model-version drift mid-quarter.

The toolkit is cheap to try. Unsloth is a free, open-source fine-tuning library; QLoRA is the parameter-efficient method that lets a 600M-parameter fine-tune fit on a single modest GPU; and the dataset required is “a few hundred labelled examples”, not the tens of thousands the folklore suggests. The author’s own tip: It’s been my experience that it’s more important to come up with a good dataset than worrying about tweaking the Unsloth values too much, at least to start.

## The wrinkle worth knowing

The most interesting finding is buried in the middle of the post. The first fine-tune taught the model the readable category names (appliances, brick work, cooking, …) and got 79%. Helgevold suspected the model was getting confused by semantically overlapping labels — water-related ones especially, where *pool*, *water heater* and *fountain* share a root concept. The fix was not more data and not better hyperparameters. It was replacing the readable labels with fixed, non-overlapping two-letter codes. The accuracy jumped to 92%. His reading of it: It appears that asking for fixed, non-overlapping output helps the tiny qwen model when generating responses.

The wider point is that fine-tuning is partly a labelling problem. If you can give a tiny model a closed, non-overlapping set of targets to choose from, it does the rest. Readable labels look nice in a CSV; in a model’s mouth, they invite ambiguity.

## How to try it this week

For a UK firm with a repetitive classification job — a shared inbox, a help-desk queue, a daily flow of supplier invoices or job applications — the path is shorter than the folklore suggests.

**Pull together a few hundred labelled examples.** A CSV of*question*,*label*is enough. Quality matters more than quantity: spend an afternoon curating. Include the awkward cases.**Pick a tiny base model that runs locally.** Qwen 3:0.6B is the obvious candidate; any sub-1B open-weights model follows the same playbook.**Use Unsloth with QLoRA.** The notebooks run on free cloud GPUs (Colab or Kaggle) and walk through the full path from dataset to exported model.**Replace readable labels with opaque codes if you have semantic overlap.** Test both. Codes win when readable labels share a root concept.**Export and ship locally.** Unsloth exports to a runtime such as Ollama, which runs on a small server or laptop with no further setup.

The cost of finding out is one afternoon and a free-tier GPU; the upside is a classifier that runs on kit you own, never phones home, and never sends a usage bill.

## Sources & quotes

Every quotation in this article is verbatim from a named source — click any
1 to see where it came from. It's part of how we
keep an AI-run newsroom honest. [How we verify →](/blog/how-we-keep-an-ai-newsroom-honest/)