# What's your method for benchmarking?

> Source: <https://discuss.huggingface.co/t/whats-your-method-for-benchmarking/177127#post_3>
> Published: 2026-06-25 12:04:22+00:00

Uh… there are a lot of ways to do this, so I’d split it up a bit. My first suggestion would be: measure the capability you actually care about first.

Short version:

**Keep a held-out test set that was not used for training, run both the base model and your fine-tuned model on it under the same prompts/settings, and compare them with metrics that match your actual task.**

Public benchmarks and leaderboards are useful, but I would treat them as a second layer. They are good for orientation, not always proof that the fine-tune worked for your use case.

A practical first workflow

-
**Define the target behavior**

- What was the fine-tune supposed to improve?

-
**Create a held-out eval set**

- Use examples that were not in the training data.

-
**Run the base model**

-
**Run the fine-tuned model**

- Same examples, same prompt template, same decoding settings.

-
**Compare with task-appropriate metrics**

- Accuracy/F1 for classification, schema checks for structured output, rubric/pairwise eval for open-ended chat, unit tests for code, etc.

-
**Inspect individual failures**

- Do not only look at the average score.

-
**Add public benchmarks if they match the goal**

- Tools like Lighteval or lm-evaluation-harness can help, but only after you know what you want to measure.

1. Start with your actual task, not a leaderboard

Before choosing a benchmark, I would write the goal in plain language.

| If the fine-tune is meant to improve… |
A better first eval is usually… |
| Support-ticket classification |
Accuracy / F1 on held-out tickets |
| Domain QA |
Held-out question/answer examples |
| JSON or structured output |
JSON validity, schema validity, field accuracy |
| Chat helpfulness |
Pairwise comparison, rubric, human spot-checks |
| Summarization |
Coverage, factuality, maybe ROUGE/BERTScore as supporting metrics |
| Code generation |
Unit tests, hidden tests, code-specific benchmark tasks |
| General reasoning |
Public reasoning benchmarks may be relevant |

The key question is:

Does this benchmark measure the thing I actually fine-tuned for?

A general benchmark can be useful, but it might not see the thing you changed. A small domain fine-tune might help your actual use case without moving a public leaderboard score much. Also, a model can do well on a public benchmark and still be poor at your format, domain, or workflow.

Useful starting reference:

2. Minimal checklist

If you only do one thing, I would do this:

| Step |
Check |
| Goal |
What should improve? |
| Data |
Are eval examples excluded from training? |
| Baseline |
Did you run the base model too? |
| Fairness |
Are prompt/settings the same? |
| Metric |
Does the metric match the task? |
| Samples |
Did you inspect actual outputs? |
| Regression |
Did anything get worse? |
| Notes |
Can someone else reproduce the setup? |

That is already a useful benchmark for many fine-tunes.

##
More detail: building a held-out eval set

A held-out eval set means examples that were not used during fine-tuning.

For a small first pass, I would include several slices:

| Eval slice |
Purpose |
| Normal cases |
Representative examples from the real task |
| Hard cases |
Inputs the base model often gets wrong |
| Regression cases |
Inputs the base model already handles well |
| Format cases |
Inputs that test exact response structure |
| Edge cases |
Ambiguous, long, missing-context, or messy inputs |

This does not need to be huge at the beginning. A small, carefully chosen set is often better than a large but unclear one.

For example, if you fine-tuned a model to answer internal policy questions, I would not start with MMLU. I would first create a small set of policy questions with known expected answers, including:

- easy questions
- tricky questions
- questions where the answer is not in the source material
- questions where the model should refuse or say it does not know
- questions the base model currently gets wrong
- questions the base model currently gets right

If you evaluate on examples that were used in training, the result may mostly measure memorization or overfitting.

##
More detail: keeping the comparison fair

A common mistake is to evaluate only the fine-tuned model. That gives a score, but it does not answer:

Did fine-tuning help compared with the original model?

I would compare at least:

| Model |
Purpose |
| Base model |
Baseline |
| Fine-tuned model |
Measures the effect of fine-tuning |
| Optional similar model |
Extra reference point |

Try to keep these fixed:

- prompt template
- chat template
- few-shot examples, if any
- system message
- decoding parameters
`temperature`

`top_p`

`max_new_tokens`

- retrieval context, if this is RAG
- quantization / backend, if possible
- model revision
- adapter revision, if using LoRA/PEFT

Otherwise, you may accidentally benchmark a prompt change, decoding change, or serving change instead of the fine-tune.

For LoRA/PEFT models, I would also be careful to record whether you evaluated:

- the base model alone
- the base model plus adapter
- a merged model
- a quantized version
- a served endpoint version

Those can behave differently enough that the exact setup matters.

3. Pick metrics by task

There is no single metric that works for all fine-tuned LLMs.

A useful reference is Hugging Face Evaluate’s metric guide:

Very roughly:

| Task type |
Possible metrics / checks |
| Classification |
accuracy, precision, recall, F1 |
| Extraction |
exact match, field-level accuracy, schema validity |
| QA with known answers |
exact match, F1, semantic correctness, human spot-check |
| Summarization |
ROUGE/BERTScore can help, but inspect factuality and coverage |
| Translation |
BLEU / chrF / COMET, depending on setup |
| Open-ended chat |
rubric scoring, pairwise comparison, human or LLM judge |
| Instruction following |
constraint pass rate, format adherence |
| Coding |
unit tests, pass rate, hidden tests, code benchmark tasks |
| RAG/document QA |
answer correctness, context recall, faithfulness, citation usefulness |
| Deployment |
latency, throughput, VRAM, cost per request |

I would avoid relying on only one weak signal, such as:

- “the answer looks good to me”
- training loss
- validation loss
- one public leaderboard score
- one cherry-picked demo

Those can be useful, but they are not the whole evaluation.

##
More detail: eval loss vs task success

If you trained with `Trainer`

, `SFTTrainer`

, or another training loop, validation/eval loss is useful as a training signal. It can help you notice overfitting or training instability.

But for final evaluation, especially generation tasks, I would not stop at eval loss.

A lower loss does not automatically mean:

- the answer is factually better
- the model follows your requested format better
- the model is safer
- the model is less hallucination-prone
- the model is better for your actual users
- the model did not regress on general behavior

For generation tasks, save actual model outputs and inspect them.

Relevant docs:

4. Inspect failures, not only scores

Aggregate scores are useful, but a lot of the value comes from looking at individual examples.

For each output, I would save something like:

| Field |
Why |
| Input prompt |
Reproduces the case |
| Expected answer / rubric |
Defines what “good” means |
| Base model output |
Baseline |
| Fine-tuned output |
Comparison |
| Score / pass-fail |
Aggregate metric |
| Short error label |
Helps find patterns |

Useful error labels might be:

- wrong answer
- incomplete answer
- hallucinated detail
- ignored instruction
- wrong format
- invalid JSON
- too verbose
- too short
- unsafe refusal
- should have refused but did not
- correct answer but bad explanation
- regression from base model

This is one reason sample-level logging is useful. Lighteval, for example, emphasizes sample-by-sample results for deeper inspection:

5. Public benchmarks are a second layer

Once your own task-specific eval is in place, public benchmarks can be useful.

But I would choose them by target capability.

| If you care about… |
Look at… |
Main caution |
| General reasoning / knowledge |
MMLU-Pro, GPQA, LiveBench, HELM-like suites |
May not match your domain |
| Instruction following |
IFEval-style tests |
Measures verifiable constraints, not all helpfulness |
| Open-ended chat quality |
MT-Bench / Arena-style pairwise evals / AlpacaEval-like setups |
Judge and preference biases matter |
| Coding |
LiveCodeBench, BigCodeBench, SWE-bench depending on task |
Code completion and repo-level issue fixing are different |
| RAG / document QA |
RAGAS-style component metrics |
Retrieval and generation should be separated |
| Deployment |
latency, throughput, VRAM, cost |
Not the same as quality |

Common benchmark-running tools:

I would not start with:

Which leaderboard should I optimize for?

I would start with:

Which capability did I fine-tune for, and does this benchmark actually measure it?

##
More detail: public benchmark map

Public benchmarks are useful, but they answer different questions.

| Benchmark / eval style |
Closer to measuring |
Main caution |
| MMLU-Pro / GPQA |
Broad academic knowledge and reasoning |
Not necessarily your domain task |
| LiveBench |
General capability with newer, contamination-aware questions |
Still not a replacement for your own eval set |
| IFEval |
Verifiable instruction-following constraints |
Does not measure all helpfulness or factuality |
| Chatbot Arena / MT-Bench style evals |
Open-ended assistant preference |
Judge bias and preference bias matter |
| SWE-bench |
Repo-level software engineering issue solving |
Not the same as simple code completion |
| Serving benchmarks |
Latency, throughput, cost, memory |
Not a direct quality benchmark |

Useful links:

The trap is treating a leaderboard score as universal. It is not. A benchmark has:

- a task distribution
- a prompt format
- a scoring rule
- model/backend assumptions
- possible contamination issues
- sometimes hidden or changing evaluation details

So I would use leaderboard scores as orientation, not as the only decision rule.

6. Special cases

The right evaluation changes a lot depending on what kind of fine-tune this is.

##
If the task is open-ended chat

For open-ended chat, there may not be one exact correct answer. In that case, I would compare outputs with a small rubric.

Example rubric:

| Criterion |
Question |
| Correctness |
Is the answer factually right? |
| Completeness |
Does it answer the whole question? |
| Instruction following |
Did it follow user constraints? |
| Grounding |
If sources/context are provided, does it stay grounded? |
| Format |
Does it output the requested structure? |
| Safety |
Does it refuse only when appropriate? |
| Usefulness |
Would a user know what to do next? |

A practical method:

- Take the same prompt.
- Generate an answer from the base model.
- Generate an answer from the fine-tuned model.
- Hide which is which.
- Compare them side by side.
- Record the winner and the reason.

You can use a human judge, an LLM judge, or both. If using an LLM judge, I would treat it as a useful signal, not ground truth.

References:

If you use LLM-as-a-judge, I would at least:

- use a clear rubric
- swap answer order A/B vs B/A
- avoid rewarding longer answers automatically
- manually inspect a sample of judgments
- keep judge prompts and judge model/version recorded

##
If the task is formatting or instruction following

If the fine-tune was meant to improve formatting, do not rely only on a general benchmark.

For example, if the model must output JSON, measure:

- valid JSON rate
- schema-valid rate
- required fields present
- forbidden fields absent
- correct field values
- refusal behavior when input is invalid
- robustness to longer or messy inputs

If the task is instruction following, you can make small tests inspired by IFEval:

- “Use exactly N bullet points”
- “Mention keyword X at least twice”
- “Do not use word Y”
- “Return only JSON”
- “Answer in one sentence”
- “Include these fields and no others”

IFEval itself is here:

The general idea is simple:

If the desired behavior is objectively checkable, write checks for it.

##
If this is RAG or document QA

If your fine-tuned model is part of a RAG/document QA system, I would avoid giving the model one single score too early.

Split the evaluation:

| Component |
What to check |
| Retrieval |
Did the system retrieve the right documents/chunks? |
| Context use |
Did the answer use the provided context? |
| Answer correctness |
Is the final answer correct? |
| Faithfulness |
Is the answer supported by the retrieved context? |
| Abstention |
Does it say “not enough information” when the answer is not in context? |
| Citation quality |
Do citations actually support the claim? |

This matters because a bad answer can come from different places:

- bad retrieval
- bad generation
- bad prompting
- bad chunking
- bad source data
- bad evaluation examples

Those need different fixes.

##
If this is a coding fine-tune

For coding models, I would avoid evaluating only with general chat/reasoning benchmarks.

Try to match the benchmark to the coding task:

| Coding task |
More relevant eval |
| Function-level generation |
Unit tests, HumanEval-like tasks, BigCodeBench-like tasks |
| Competitive programming |
LiveCodeBench-like tasks |
| Repo-level bug fixing |
SWE-bench-like tasks |
| Internal code assistant |
Your own held-out tasks from your codebase, with tests |

Also separate:

- code completion
- instruction-to-code
- test repair
- repo-level patch generation
- agentic issue solving

Those are different capabilities.

##
If you plan to deploy the model

Quality is not the only thing you might want to benchmark.

If you plan to serve the model, also measure:

- latency
- throughput
- VRAM
- cost per request
- context length
- max output length
- batch size
- quantization impact
- serving backend

But keep this separate from quality evaluation.

A faster model is not automatically a better model, and a more accurate model may still be too slow or expensive for the intended use.

7. Document the benchmark setup

If you publish the model or share results, make the benchmark reproducible enough for someone else to understand what happened.

Record:

- base model name and revision
- fine-tuned model name and revision
- whether it is LoRA/PEFT, QLoRA, or a merged/full fine-tune
- dataset name and split
- whether eval examples were excluded from training
- prompt template
- chat template
- decoding parameters
- metric
- benchmark tool and version
- hardware/backend, if reporting latency or throughput
- sample outputs or failure examples
- known limitations

Hugging Face model cards are a good place to document intended use, limitations, training details, and evaluation results:

There is also Hub support for structured evaluation results, although I would still document the plain-English setup clearly:

8. What information would help people recommend a concrete benchmark?

If you want more specific suggestions, I would add:

- What base model did you fine-tune?
- Is it LoRA/PEFT, QLoRA, or full fine-tuning?
- What task did you fine-tune for?
- What dataset did you train on?
- Do you have a held-out test set?
- Is the goal accuracy, instruction following, chat quality, formatting, coding, RAG, latency, or something else?
- Are you trying to publish a model card/leaderboard result, or just check whether the fine-tune helped?

With those details, people can suggest a much more specific eval setup.
