Uh… there are a lot of ways to do this, so I’d split it up a bit. My first suggestion would be: measure the capability you actually care about first.
Short version:
Keep a held-out test set that was not used for training, run both the base model and your fine-tuned model on it under the same prompts/settings, and compare them with metrics that match your actual task.
Public benchmarks and leaderboards are useful, but I would treat them as a second layer. They are good for orientation, not always proof that the fine-tune worked for your use case. A practical first workflow
Define the target behavior
-
What was the fine-tune supposed to improve?
Create a held-out eval set
-
Use examples that were not in the training data.
Run the base model
Run the fine-tuned model
-
Same examples, same prompt template, same decoding settings.
Compare with task-appropriate metrics
-
Accuracy/F1 for classification, schema checks for structured output, rubric/pairwise eval for open-ended chat, unit tests for code, etc.
Inspect individual failures
-
Do not only look at the average score.
Add public benchmarks if they match the goal
- Tools like Lighteval or lm-evaluation-harness can help, but only after you know what you want to measure.
- Start with your actual task, not a leaderboard
Before choosing a benchmark, I would write the goal in plain language.
| If the fine-tune is meant to improve… | A better first eval is usually… | | Support-ticket classification | Accuracy / F1 on held-out tickets | | Domain QA | Held-out question/answer examples | | JSON or structured output | JSON validity, schema validity, field accuracy | | Chat helpfulness | Pairwise comparison, rubric, human spot-checks | | Summarization | Coverage, factuality, maybe ROUGE/BERTScore as supporting metrics | | Code generation | Unit tests, hidden tests, code-specific benchmark tasks | | General reasoning |
Public reasoning benchmarks may be relevant | The key question is:
Does this benchmark measure the thing I actually fine-tuned for?
A general benchmark can be useful, but it might not see the thing you changed. A small domain fine-tune might help your actual use case without moving a public leaderboard score much. Also, a model can do well on a public benchmark and still be poor at your format, domain, or workflow.
Useful starting reference:
- Minimal checklist
If you only do one thing, I would do this: | Step | Check | | Goal | What should improve? | | Data | Are eval examples excluded from training? | | Baseline | Did you run the base model too? | | Fairness | Are prompt/settings the same? | | Metric | Does the metric match the task? | | Samples | Did you inspect actual outputs? | | Regression | Did anything get worse? | | Notes | Can someone else reproduce the setup? |
That is already a useful benchmark for many fine-tunes.
#
More detail: building a held-out eval set
A held-out eval set means examples that were not used during fine-tuning.
For a small first pass, I would include several slices: | Eval slice | Purpose | | Normal cases | Representative examples from the real task | | Hard cases | Inputs the base model often gets wrong | | Regression cases | Inputs the base model already handles well | | Format cases | Inputs that test exact response structure | | Edge cases | Ambiguous, long, missing-context, or messy inputs |
This does not need to be huge at the beginning. A small, carefully chosen set is often better than a large but unclear one.
For example, if you fine-tuned a model to answer internal policy questions, I would not start with MMLU. I would first create a small set of policy questions with known expected answers, including:
- easy questions
- tricky questions
- questions where the answer is not in the source material
- questions where the model should refuse or say it does not know
- questions the base model currently gets wrong
- questions the base model currently gets right
If you evaluate on examples that were used in training, the result may mostly measure memorization or overfitting.
#
More detail: keeping the comparison fair
A common mistake is to evaluate only the fine-tuned model. That gives a score, but it does not answer:
Did fine-tuning help compared with the original model?
I would compare at least:
| Model | Purpose | | Base model | Baseline | | Fine-tuned model | Measures the effect of fine-tuning | | Optional similar model | Extra reference point |
Try to keep these fixed:
- prompt template
- chat template
- few-shot examples, if any
- system message
- decoding parameters
temperature
top_p
max_new_tokens
- retrieval context, if this is RAG
- quantization / backend, if possible
- model revision
- adapter revision, if using LoRA/PEFT
Otherwise, you may accidentally benchmark a prompt change, decoding change, or serving change instead of the fine-tune.
For LoRA/PEFT models, I would also be careful to record whether you evaluated:
- the base model alone
- the base model plus adapter
- a merged model
- a quantized version
- a served endpoint version
Those can behave differently enough that the exact setup matters.
- Pick metrics by task
There is no single metric that works for all fine-tuned LLMs.
A useful reference is Hugging Face Evaluate’s metric guide:
Very roughly:
| Task type | Possible metrics / checks | | Classification | accuracy, precision, recall, F1 | | Extraction | exact match, field-level accuracy, schema validity | | QA with known answers | exact match, F1, semantic correctness, human spot-check | | Summarization | ROUGE/BERTScore can help, but inspect factuality and coverage | | Translation | BLEU / chrF / COMET, depending on setup | | Open-ended chat | rubric scoring, pairwise comparison, human or LLM judge | | Instruction following | constraint pass rate, format adherence | | Coding | unit tests, pass rate, hidden tests, code benchmark tasks | | RAG/document QA | answer correctness, context recall, faithfulness, citation usefulness | | Deployment | latency, throughput, VRAM, cost per request |
I would avoid relying on only one weak signal, such as:
-
“the answer looks good to me”
-
training loss
-
validation loss
-
one public leaderboard score
-
one cherry-picked demo Those can be useful, but they are not the whole evaluation.
#
More detail: eval loss vs task success
If you trained with Trainer
, SFTTrainer
, or another training loop, validation/eval loss is useful as a training signal. It can help you notice overfitting or training instability.
But for final evaluation, especially generation tasks, I would not stop at eval loss.
A lower loss does not automatically mean:
- the answer is factually better
- the model follows your requested format better
- the model is safer
- the model is less hallucination-prone
- the model is better for your actual users
- the model did not regress on general behavior
For generation tasks, save actual model outputs and inspect them. Relevant docs:
- Inspect failures, not only scores
Aggregate scores are useful, but a lot of the value comes from looking at individual examples.
For each output, I would save something like: | Field | Why | | Input prompt | Reproduces the case | | Expected answer / rubric | Defines what “good” means | | Base model output | Baseline | | Fine-tuned output | Comparison | | Score / pass-fail | Aggregate metric | | Short error label | Helps find patterns |
Useful error labels might be:
- wrong answer
- incomplete answer
- hallucinated detail
- ignored instruction
- wrong format
- invalid JSON
- too verbose
- too short
- unsafe refusal
- should have refused but did not
- correct answer but bad explanation
- regression from base model
This is one reason sample-level logging is useful. Lighteval, for example, emphasizes sample-by-sample results for deeper inspection:
- Public benchmarks are a second layer
Once your own task-specific eval is in place, public benchmarks can be useful.
But I would choose them by target capability.
| If you care about… | Look at… | Main caution | | General reasoning / knowledge | MMLU-Pro, GPQA, LiveBench, HELM-like suites | May not match your domain | | Instruction following | IFEval-style tests | Measures verifiable constraints, not all helpfulness | | Open-ended chat quality | MT-Bench / Arena-style pairwise evals / AlpacaEval-like setups | Judge and preference biases matter | | Coding | LiveCodeBench, BigCodeBench, SWE-bench depending on task | Code completion and repo-level issue fixing are different | | RAG / document QA | RAGAS-style component metrics | Retrieval and generation should be separated | | Deployment | latency, throughput, VRAM, cost | Not the same as quality |
Common benchmark-running tools: I would not start with:
Which leaderboard should I optimize for?
I would start with:
Which capability did I fine-tune for, and does this benchmark actually measure it?
#
More detail: public benchmark map
Public benchmarks are useful, but they answer different questions. | Benchmark / eval style | Closer to measuring | Main caution | | MMLU-Pro / GPQA | Broad academic knowledge and reasoning | Not necessarily your domain task | | LiveBench | General capability with newer, contamination-aware questions | Still not a replacement for your own eval set | | IFEval | Verifiable instruction-following constraints | Does not measure all helpfulness or factuality | | Chatbot Arena / MT-Bench style evals | Open-ended assistant preference | Judge bias and preference bias matter | | SWE-bench | Repo-level software engineering issue solving | Not the same as simple code completion | | Serving benchmarks | Latency, throughput, cost, memory | Not a direct quality benchmark |
Useful links:
The trap is treating a leaderboard score as universal. It is not. A benchmark has:
- a task distribution
- a prompt format
- a scoring rule
- model/backend assumptions
- possible contamination issues
- sometimes hidden or changing evaluation details
So I would use leaderboard scores as orientation, not as the only decision rule.
- Special cases
The right evaluation changes a lot depending on what kind of fine-tune this is.
#
If the task is open-ended chat
For open-ended chat, there may not be one exact correct answer. In that case, I would compare outputs with a small rubric.
Example rubric:
| Criterion | Question | | Correctness | Is the answer factually right? | | Completeness | Does it answer the whole question? | | Instruction following | Did it follow user constraints? | | Grounding | If sources/context are provided, does it stay grounded? | | Format | Does it output the requested structure? | | Safety | Does it refuse only when appropriate? | | Usefulness | Would a user know what to do next? |
A practical method:
- Take the same prompt.
- Generate an answer from the base model.
- Generate an answer from the fine-tuned model.
- Hide which is which.
- Compare them side by side.
- Record the winner and the reason.
You can use a human judge, an LLM judge, or both. If using an LLM judge, I would treat it as a useful signal, not ground truth.
References:
If you use LLM-as-a-judge, I would at least:
- use a clear rubric
- swap answer order A/B vs B/A
- avoid rewarding longer answers automatically
- manually inspect a sample of judgments
- keep judge prompts and judge model/version recorded
#
If the task is formatting or instruction following
If the fine-tune was meant to improve formatting, do not rely only on a general benchmark.
For example, if the model must output JSON, measure:
- valid JSON rate
- schema-valid rate
- required fields present
- forbidden fields absent
- correct field values
- refusal behavior when input is invalid
- robustness to longer or messy inputs
If the task is instruction following, you can make small tests inspired by IFEval:
- “Use exactly N bullet points”
- “Mention keyword X at least twice”
- “Do not use word Y”
- “Return only JSON”
- “Answer in one sentence”
- “Include these fields and no others”
IFEval itself is here:
The general idea is simple:
If the desired behavior is objectively checkable, write checks for it.
#
If this is RAG or document QA
If your fine-tuned model is part of a RAG/document QA system, I would avoid giving the model one single score too early.
Split the evaluation:
| Component | What to check | | Retrieval | Did the system retrieve the right documents/chunks? | | Context use | Did the answer use the provided context? | | Answer correctness | Is the final answer correct? | | Faithfulness | Is the answer supported by the retrieved context? | | Abstention | Does it say “not enough information” when the answer is not in context? | | Citation quality | Do citations actually support the claim? |
This matters because a bad answer can come from different places:
- bad retrieval
- bad generation
- bad prompting
- bad chunking
- bad source data
- bad evaluation examples
Those need different fixes.
#
If this is a coding fine-tune
For coding models, I would avoid evaluating only with general chat/reasoning benchmarks.
Try to match the benchmark to the coding task:
| Coding task | More relevant eval | | Function-level generation | Unit tests, HumanEval-like tasks, BigCodeBench-like tasks | | Competitive programming | LiveCodeBench-like tasks | | Repo-level bug fixing | SWE-bench-like tasks | | Internal code assistant | Your own held-out tasks from your codebase, with tests |
Also separate:
- code completion
- instruction-to-code
- test repair
- repo-level patch generation
- agentic issue solving
Those are different capabilities.
#
If you plan to deploy the model Quality is not the only thing you might want to benchmark.
If you plan to serve the model, also measure:
- latency
- throughput
- VRAM
- cost per request
- context length
- max output length
- batch size
- quantization impact
- serving backend
But keep this separate from quality evaluation.
A faster model is not automatically a better model, and a more accurate model may still be too slow or expensive for the intended use.
- Document the benchmark setup
If you publish the model or share results, make the benchmark reproducible enough for someone else to understand what happened. Record:
- base model name and revision
- fine-tuned model name and revision
- whether it is LoRA/PEFT, QLoRA, or a merged/full fine-tune
- dataset name and split
- whether eval examples were excluded from training
- prompt template
- chat template
- decoding parameters
- metric
- benchmark tool and version
- hardware/backend, if reporting latency or throughput
- sample outputs or failure examples
- known limitations
Hugging Face model cards are a good place to document intended use, limitations, training details, and evaluation results:
There is also Hub support for structured evaluation results, although I would still document the plain-English setup clearly:
- What information would help people recommend a concrete benchmark?
If you want more specific suggestions, I would add:
- What base model did you fine-tune?
- Is it LoRA/PEFT, QLoRA, or full fine-tuning?
- What task did you fine-tune for?
- What dataset did you train on?
- Do you have a held-out test set?
- Is the goal accuracy, instruction following, chat quality, formatting, coding, RAG, latency, or something else?
- Are you trying to publish a model card/leaderboard result, or just check whether the fine-tune helped?
With those details, people can suggest a much more specific eval setup.