What's your method for benchmarking?

wpnews.pro

Uh… there are a lot of ways to do this, so I’d split it up a bit. My first suggestion would be: measure the capability you actually care about first.

Short version:

Keep a held-out test set that was not used for training, run both the base model and your fine-tuned model on it under the same prompts/settings, and compare them with metrics that match your actual task.

Public benchmarks and leaderboards are useful, but I would treat them as a second layer. They are good for orientation, not always proof that the fine-tune worked for your use case. A practical first workflow

Define the target behavior

What was the fine-tune supposed to improve?

Create a held-out eval set

Use examples that were not in the training data.

Run the base model

Run the fine-tuned model

Same examples, same prompt template, same decoding settings.

Compare with task-appropriate metrics

Accuracy/F1 for classification, schema checks for structured output, rubric/pairwise eval for open-ended chat, unit tests for code, etc.

Inspect individual failures

Do not only look at the average score.

Add public benchmarks if they match the goal

Tools like Lighteval or lm-evaluation-harness can help, but only after you know what you want to measure.

Start with your actual task, not a leaderboard

Before choosing a benchmark, I would write the goal in plain language.

Public reasoning benchmarks may be relevant | The key question is:

Does this benchmark measure the thing I actually fine-tuned for?

A general benchmark can be useful, but it might not see the thing you changed. A small domain fine-tune might help your actual use case without moving a public leaderboard score much. Also, a model can do well on a public benchmark and still be poor at your format, domain, or workflow.

Useful starting reference:

Minimal checklist

That is already a useful benchmark for many fine-tunes.

#

More detail: building a held-out eval set

A held-out eval set means examples that were not used during fine-tuning.

This does not need to be huge at the beginning. A small, carefully chosen set is often better than a large but unclear one.

For example, if you fine-tuned a model to answer internal policy questions, I would not start with MMLU. I would first create a small set of policy questions with known expected answers, including:

easy questions
tricky questions
questions where the answer is not in the source material
questions where the model should refuse or say it does not know
questions the base model currently gets wrong
questions the base model currently gets right

If you evaluate on examples that were used in training, the result may mostly measure memorization or overfitting.

#

More detail: keeping the comparison fair

A common mistake is to evaluate only the fine-tuned model. That gives a score, but it does not answer:

Did fine-tuning help compared with the original model?

I would compare at least:

Try to keep these fixed:

prompt template
chat template
few-shot examples, if any
system message
decoding parameters temperature

top_p

max_new_tokens

retrieval context, if this is RAG
quantization / backend, if possible
model revision
adapter revision, if using LoRA/PEFT

Otherwise, you may accidentally benchmark a prompt change, decoding change, or serving change instead of the fine-tune.

For LoRA/PEFT models, I would also be careful to record whether you evaluated:

the base model alone
the base model plus adapter
a merged model
a quantized version
a served endpoint version

Those can behave differently enough that the exact setup matters.

Pick metrics by task

There is no single metric that works for all fine-tuned LLMs.

A useful reference is Hugging Face Evaluate’s metric guide:

Very roughly:

I would avoid relying on only one weak signal, such as:

“the answer looks good to me”
training loss
validation loss
one public leaderboard score
one cherry-picked demo Those can be useful, but they are not the whole evaluation.

#

More detail: eval loss vs task success

If you trained with Trainer , SFTTrainer

, or another training loop, validation/eval loss is useful as a training signal. It can help you notice overfitting or training instability.

But for final evaluation, especially generation tasks, I would not stop at eval loss.

A lower loss does not automatically mean:

the answer is factually better
the model follows your requested format better
the model is safer
the model is less hallucination-prone
the model is better for your actual users
the model did not regress on general behavior

For generation tasks, save actual model outputs and inspect them. Relevant docs:

Inspect failures, not only scores

Aggregate scores are useful, but a lot of the value comes from looking at individual examples.

Useful error labels might be:

wrong answer
incomplete answer
hallucinated detail
ignored instruction
wrong format
invalid JSON
too verbose
too short
unsafe refusal
should have refused but did not
correct answer but bad explanation
regression from base model

This is one reason sample-level logging is useful. Lighteval, for example, emphasizes sample-by-sample results for deeper inspection:

Public benchmarks are a second layer

Once your own task-specific eval is in place, public benchmarks can be useful.

But I would choose them by target capability.

Common benchmark-running tools: I would not start with:

Which leaderboard should I optimize for?

I would start with:

Which capability did I fine-tune for, and does this benchmark actually measure it?

#

More detail: public benchmark map

Useful links:

The trap is treating a leaderboard score as universal. It is not. A benchmark has:

a task distribution
a prompt format
a scoring rule
model/backend assumptions
possible contamination issues
sometimes hidden or changing evaluation details

So I would use leaderboard scores as orientation, not as the only decision rule.

Special cases

The right evaluation changes a lot depending on what kind of fine-tune this is.

#

If the task is open-ended chat

For open-ended chat, there may not be one exact correct answer. In that case, I would compare outputs with a small rubric.

Example rubric:

A practical method:

Take the same prompt.
Generate an answer from the base model.
Generate an answer from the fine-tuned model.
Hide which is which.
Compare them side by side.
Record the winner and the reason.

You can use a human judge, an LLM judge, or both. If using an LLM judge, I would treat it as a useful signal, not ground truth.

References:

If you use LLM-as-a-judge, I would at least:

use a clear rubric
swap answer order A/B vs B/A
avoid rewarding longer answers automatically
manually inspect a sample of judgments
keep judge prompts and judge model/version recorded

#

If the task is formatting or instruction following

If the fine-tune was meant to improve formatting, do not rely only on a general benchmark.

For example, if the model must output JSON, measure:

valid JSON rate
schema-valid rate
required fields present
forbidden fields absent
correct field values
refusal behavior when input is invalid
robustness to longer or messy inputs

If the task is instruction following, you can make small tests inspired by IFEval:

“Use exactly N bullet points”
“Mention keyword X at least twice”
“Do not use word Y”
“Return only JSON”
“Answer in one sentence”
“Include these fields and no others”

IFEval itself is here:

The general idea is simple:

If the desired behavior is objectively checkable, write checks for it.

#

If this is RAG or document QA

If your fine-tuned model is part of a RAG/document QA system, I would avoid giving the model one single score too early.

Split the evaluation:

This matters because a bad answer can come from different places:

bad retrieval
bad generation
bad prompting
bad chunking
bad source data
bad evaluation examples

Those need different fixes.

#

If this is a coding fine-tune

For coding models, I would avoid evaluating only with general chat/reasoning benchmarks.

Try to match the benchmark to the coding task:

Also separate:

code completion
instruction-to-code
test repair
repo-level patch generation
agentic issue solving

Those are different capabilities.

#

If you plan to deploy the model Quality is not the only thing you might want to benchmark.

If you plan to serve the model, also measure:

latency
throughput
VRAM
cost per request
context length
max output length
batch size
quantization impact
serving backend

But keep this separate from quality evaluation.

A faster model is not automatically a better model, and a more accurate model may still be too slow or expensive for the intended use.

Document the benchmark setup

If you publish the model or share results, make the benchmark reproducible enough for someone else to understand what happened. Record:

base model name and revision
fine-tuned model name and revision
whether it is LoRA/PEFT, QLoRA, or a merged/full fine-tune
dataset name and split
whether eval examples were excluded from training
prompt template
chat template
decoding parameters
metric
benchmark tool and version
hardware/backend, if reporting latency or throughput
sample outputs or failure examples
known limitations

Hugging Face model cards are a good place to document intended use, limitations, training details, and evaluation results:

There is also Hub support for structured evaluation results, although I would still document the plain-English setup clearly:

What information would help people recommend a concrete benchmark?

If you want more specific suggestions, I would add:

- What base model did you fine-tune?

Is it LoRA/PEFT, QLoRA, or full fine-tuning?
What task did you fine-tune for?
What dataset did you train on?
Do you have a held-out test set?
Is the goal accuracy, instruction following, chat quality, formatting, coding, RAG, latency, or something else?
Are you trying to publish a model card/leaderboard result, or just check whether the fine-tune helped?

With those details, people can suggest a much more specific eval setup.

source & further reading

discuss.huggingface.co — original article Rakarrack-0.6.1 port making progress! ( AI assisted ) Cloud Storage Poll Welcome to Haiku basic(Haiku Docs, Haiku slide and Haiku sheets)

What's your method for benchmarking?

Run your AI side-project on zahid.host