{"slug": "what-s-your-method-for-benchmarking", "title": "What's your method for benchmarking?", "summary": "A practical guide for benchmarking fine-tuned models recommends starting with a held-out test set matching the actual task rather than relying solely on public benchmarks. The workflow includes defining target behavior, creating a diverse eval set, comparing base and fine-tuned models under identical conditions, and inspecting individual failures. Public benchmarks serve as a secondary layer for orientation.", "body_md": "Uh… there are a lot of ways to do this, so I’d split it up a bit. My first suggestion would be: measure the capability you actually care about first.\n\nShort version:\n\n**Keep a held-out test set that was not used for training, run both the base model and your fine-tuned model on it under the same prompts/settings, and compare them with metrics that match your actual task.**\n\nPublic benchmarks and leaderboards are useful, but I would treat them as a second layer. They are good for orientation, not always proof that the fine-tune worked for your use case.\n\nA practical first workflow\n\n-\n**Define the target behavior**\n\n- What was the fine-tune supposed to improve?\n\n-\n**Create a held-out eval set**\n\n- Use examples that were not in the training data.\n\n-\n**Run the base model**\n\n-\n**Run the fine-tuned model**\n\n- Same examples, same prompt template, same decoding settings.\n\n-\n**Compare with task-appropriate metrics**\n\n- Accuracy/F1 for classification, schema checks for structured output, rubric/pairwise eval for open-ended chat, unit tests for code, etc.\n\n-\n**Inspect individual failures**\n\n- Do not only look at the average score.\n\n-\n**Add public benchmarks if they match the goal**\n\n- Tools like Lighteval or lm-evaluation-harness can help, but only after you know what you want to measure.\n\n1. Start with your actual task, not a leaderboard\n\nBefore choosing a benchmark, I would write the goal in plain language.\n\n| If the fine-tune is meant to improve… |\nA better first eval is usually… |\n| Support-ticket classification |\nAccuracy / F1 on held-out tickets |\n| Domain QA |\nHeld-out question/answer examples |\n| JSON or structured output |\nJSON validity, schema validity, field accuracy |\n| Chat helpfulness |\nPairwise comparison, rubric, human spot-checks |\n| Summarization |\nCoverage, factuality, maybe ROUGE/BERTScore as supporting metrics |\n| Code generation |\nUnit tests, hidden tests, code-specific benchmark tasks |\n| General reasoning |\nPublic reasoning benchmarks may be relevant |\n\nThe key question is:\n\nDoes this benchmark measure the thing I actually fine-tuned for?\n\nA general benchmark can be useful, but it might not see the thing you changed. A small domain fine-tune might help your actual use case without moving a public leaderboard score much. Also, a model can do well on a public benchmark and still be poor at your format, domain, or workflow.\n\nUseful starting reference:\n\n2. Minimal checklist\n\nIf you only do one thing, I would do this:\n\n| Step |\nCheck |\n| Goal |\nWhat should improve? |\n| Data |\nAre eval examples excluded from training? |\n| Baseline |\nDid you run the base model too? |\n| Fairness |\nAre prompt/settings the same? |\n| Metric |\nDoes the metric match the task? |\n| Samples |\nDid you inspect actual outputs? |\n| Regression |\nDid anything get worse? |\n| Notes |\nCan someone else reproduce the setup? |\n\nThat is already a useful benchmark for many fine-tunes.\n\n##\nMore detail: building a held-out eval set\n\nA held-out eval set means examples that were not used during fine-tuning.\n\nFor a small first pass, I would include several slices:\n\n| Eval slice |\nPurpose |\n| Normal cases |\nRepresentative examples from the real task |\n| Hard cases |\nInputs the base model often gets wrong |\n| Regression cases |\nInputs the base model already handles well |\n| Format cases |\nInputs that test exact response structure |\n| Edge cases |\nAmbiguous, long, missing-context, or messy inputs |\n\nThis does not need to be huge at the beginning. A small, carefully chosen set is often better than a large but unclear one.\n\nFor example, if you fine-tuned a model to answer internal policy questions, I would not start with MMLU. I would first create a small set of policy questions with known expected answers, including:\n\n- easy questions\n- tricky questions\n- questions where the answer is not in the source material\n- questions where the model should refuse or say it does not know\n- questions the base model currently gets wrong\n- questions the base model currently gets right\n\nIf you evaluate on examples that were used in training, the result may mostly measure memorization or overfitting.\n\n##\nMore detail: keeping the comparison fair\n\nA common mistake is to evaluate only the fine-tuned model. That gives a score, but it does not answer:\n\nDid fine-tuning help compared with the original model?\n\nI would compare at least:\n\n| Model |\nPurpose |\n| Base model |\nBaseline |\n| Fine-tuned model |\nMeasures the effect of fine-tuning |\n| Optional similar model |\nExtra reference point |\n\nTry to keep these fixed:\n\n- prompt template\n- chat template\n- few-shot examples, if any\n- system message\n- decoding parameters\n`temperature`\n\n`top_p`\n\n`max_new_tokens`\n\n- retrieval context, if this is RAG\n- quantization / backend, if possible\n- model revision\n- adapter revision, if using LoRA/PEFT\n\nOtherwise, you may accidentally benchmark a prompt change, decoding change, or serving change instead of the fine-tune.\n\nFor LoRA/PEFT models, I would also be careful to record whether you evaluated:\n\n- the base model alone\n- the base model plus adapter\n- a merged model\n- a quantized version\n- a served endpoint version\n\nThose can behave differently enough that the exact setup matters.\n\n3. Pick metrics by task\n\nThere is no single metric that works for all fine-tuned LLMs.\n\nA useful reference is Hugging Face Evaluate’s metric guide:\n\nVery roughly:\n\n| Task type |\nPossible metrics / checks |\n| Classification |\naccuracy, precision, recall, F1 |\n| Extraction |\nexact match, field-level accuracy, schema validity |\n| QA with known answers |\nexact match, F1, semantic correctness, human spot-check |\n| Summarization |\nROUGE/BERTScore can help, but inspect factuality and coverage |\n| Translation |\nBLEU / chrF / COMET, depending on setup |\n| Open-ended chat |\nrubric scoring, pairwise comparison, human or LLM judge |\n| Instruction following |\nconstraint pass rate, format adherence |\n| Coding |\nunit tests, pass rate, hidden tests, code benchmark tasks |\n| RAG/document QA |\nanswer correctness, context recall, faithfulness, citation usefulness |\n| Deployment |\nlatency, throughput, VRAM, cost per request |\n\nI would avoid relying on only one weak signal, such as:\n\n- “the answer looks good to me”\n- training loss\n- validation loss\n- one public leaderboard score\n- one cherry-picked demo\n\nThose can be useful, but they are not the whole evaluation.\n\n##\nMore detail: eval loss vs task success\n\nIf you trained with `Trainer`\n\n, `SFTTrainer`\n\n, or another training loop, validation/eval loss is useful as a training signal. It can help you notice overfitting or training instability.\n\nBut for final evaluation, especially generation tasks, I would not stop at eval loss.\n\nA lower loss does not automatically mean:\n\n- the answer is factually better\n- the model follows your requested format better\n- the model is safer\n- the model is less hallucination-prone\n- the model is better for your actual users\n- the model did not regress on general behavior\n\nFor generation tasks, save actual model outputs and inspect them.\n\nRelevant docs:\n\n4. Inspect failures, not only scores\n\nAggregate scores are useful, but a lot of the value comes from looking at individual examples.\n\nFor each output, I would save something like:\n\n| Field |\nWhy |\n| Input prompt |\nReproduces the case |\n| Expected answer / rubric |\nDefines what “good” means |\n| Base model output |\nBaseline |\n| Fine-tuned output |\nComparison |\n| Score / pass-fail |\nAggregate metric |\n| Short error label |\nHelps find patterns |\n\nUseful error labels might be:\n\n- wrong answer\n- incomplete answer\n- hallucinated detail\n- ignored instruction\n- wrong format\n- invalid JSON\n- too verbose\n- too short\n- unsafe refusal\n- should have refused but did not\n- correct answer but bad explanation\n- regression from base model\n\nThis is one reason sample-level logging is useful. Lighteval, for example, emphasizes sample-by-sample results for deeper inspection:\n\n5. Public benchmarks are a second layer\n\nOnce your own task-specific eval is in place, public benchmarks can be useful.\n\nBut I would choose them by target capability.\n\n| If you care about… |\nLook at… |\nMain caution |\n| General reasoning / knowledge |\nMMLU-Pro, GPQA, LiveBench, HELM-like suites |\nMay not match your domain |\n| Instruction following |\nIFEval-style tests |\nMeasures verifiable constraints, not all helpfulness |\n| Open-ended chat quality |\nMT-Bench / Arena-style pairwise evals / AlpacaEval-like setups |\nJudge and preference biases matter |\n| Coding |\nLiveCodeBench, BigCodeBench, SWE-bench depending on task |\nCode completion and repo-level issue fixing are different |\n| RAG / document QA |\nRAGAS-style component metrics |\nRetrieval and generation should be separated |\n| Deployment |\nlatency, throughput, VRAM, cost |\nNot the same as quality |\n\nCommon benchmark-running tools:\n\nI would not start with:\n\nWhich leaderboard should I optimize for?\n\nI would start with:\n\nWhich capability did I fine-tune for, and does this benchmark actually measure it?\n\n##\nMore detail: public benchmark map\n\nPublic benchmarks are useful, but they answer different questions.\n\n| Benchmark / eval style |\nCloser to measuring |\nMain caution |\n| MMLU-Pro / GPQA |\nBroad academic knowledge and reasoning |\nNot necessarily your domain task |\n| LiveBench |\nGeneral capability with newer, contamination-aware questions |\nStill not a replacement for your own eval set |\n| IFEval |\nVerifiable instruction-following constraints |\nDoes not measure all helpfulness or factuality |\n| Chatbot Arena / MT-Bench style evals |\nOpen-ended assistant preference |\nJudge bias and preference bias matter |\n| SWE-bench |\nRepo-level software engineering issue solving |\nNot the same as simple code completion |\n| Serving benchmarks |\nLatency, throughput, cost, memory |\nNot a direct quality benchmark |\n\nUseful links:\n\nThe trap is treating a leaderboard score as universal. It is not. A benchmark has:\n\n- a task distribution\n- a prompt format\n- a scoring rule\n- model/backend assumptions\n- possible contamination issues\n- sometimes hidden or changing evaluation details\n\nSo I would use leaderboard scores as orientation, not as the only decision rule.\n\n6. Special cases\n\nThe right evaluation changes a lot depending on what kind of fine-tune this is.\n\n##\nIf the task is open-ended chat\n\nFor open-ended chat, there may not be one exact correct answer. In that case, I would compare outputs with a small rubric.\n\nExample rubric:\n\n| Criterion |\nQuestion |\n| Correctness |\nIs the answer factually right? |\n| Completeness |\nDoes it answer the whole question? |\n| Instruction following |\nDid it follow user constraints? |\n| Grounding |\nIf sources/context are provided, does it stay grounded? |\n| Format |\nDoes it output the requested structure? |\n| Safety |\nDoes it refuse only when appropriate? |\n| Usefulness |\nWould a user know what to do next? |\n\nA practical method:\n\n- Take the same prompt.\n- Generate an answer from the base model.\n- Generate an answer from the fine-tuned model.\n- Hide which is which.\n- Compare them side by side.\n- Record the winner and the reason.\n\nYou can use a human judge, an LLM judge, or both. If using an LLM judge, I would treat it as a useful signal, not ground truth.\n\nReferences:\n\nIf you use LLM-as-a-judge, I would at least:\n\n- use a clear rubric\n- swap answer order A/B vs B/A\n- avoid rewarding longer answers automatically\n- manually inspect a sample of judgments\n- keep judge prompts and judge model/version recorded\n\n##\nIf the task is formatting or instruction following\n\nIf the fine-tune was meant to improve formatting, do not rely only on a general benchmark.\n\nFor example, if the model must output JSON, measure:\n\n- valid JSON rate\n- schema-valid rate\n- required fields present\n- forbidden fields absent\n- correct field values\n- refusal behavior when input is invalid\n- robustness to longer or messy inputs\n\nIf the task is instruction following, you can make small tests inspired by IFEval:\n\n- “Use exactly N bullet points”\n- “Mention keyword X at least twice”\n- “Do not use word Y”\n- “Return only JSON”\n- “Answer in one sentence”\n- “Include these fields and no others”\n\nIFEval itself is here:\n\nThe general idea is simple:\n\nIf the desired behavior is objectively checkable, write checks for it.\n\n##\nIf this is RAG or document QA\n\nIf your fine-tuned model is part of a RAG/document QA system, I would avoid giving the model one single score too early.\n\nSplit the evaluation:\n\n| Component |\nWhat to check |\n| Retrieval |\nDid the system retrieve the right documents/chunks? |\n| Context use |\nDid the answer use the provided context? |\n| Answer correctness |\nIs the final answer correct? |\n| Faithfulness |\nIs the answer supported by the retrieved context? |\n| Abstention |\nDoes it say “not enough information” when the answer is not in context? |\n| Citation quality |\nDo citations actually support the claim? |\n\nThis matters because a bad answer can come from different places:\n\n- bad retrieval\n- bad generation\n- bad prompting\n- bad chunking\n- bad source data\n- bad evaluation examples\n\nThose need different fixes.\n\n##\nIf this is a coding fine-tune\n\nFor coding models, I would avoid evaluating only with general chat/reasoning benchmarks.\n\nTry to match the benchmark to the coding task:\n\n| Coding task |\nMore relevant eval |\n| Function-level generation |\nUnit tests, HumanEval-like tasks, BigCodeBench-like tasks |\n| Competitive programming |\nLiveCodeBench-like tasks |\n| Repo-level bug fixing |\nSWE-bench-like tasks |\n| Internal code assistant |\nYour own held-out tasks from your codebase, with tests |\n\nAlso separate:\n\n- code completion\n- instruction-to-code\n- test repair\n- repo-level patch generation\n- agentic issue solving\n\nThose are different capabilities.\n\n##\nIf you plan to deploy the model\n\nQuality is not the only thing you might want to benchmark.\n\nIf you plan to serve the model, also measure:\n\n- latency\n- throughput\n- VRAM\n- cost per request\n- context length\n- max output length\n- batch size\n- quantization impact\n- serving backend\n\nBut keep this separate from quality evaluation.\n\nA faster model is not automatically a better model, and a more accurate model may still be too slow or expensive for the intended use.\n\n7. Document the benchmark setup\n\nIf you publish the model or share results, make the benchmark reproducible enough for someone else to understand what happened.\n\nRecord:\n\n- base model name and revision\n- fine-tuned model name and revision\n- whether it is LoRA/PEFT, QLoRA, or a merged/full fine-tune\n- dataset name and split\n- whether eval examples were excluded from training\n- prompt template\n- chat template\n- decoding parameters\n- metric\n- benchmark tool and version\n- hardware/backend, if reporting latency or throughput\n- sample outputs or failure examples\n- known limitations\n\nHugging Face model cards are a good place to document intended use, limitations, training details, and evaluation results:\n\nThere is also Hub support for structured evaluation results, although I would still document the plain-English setup clearly:\n\n8. What information would help people recommend a concrete benchmark?\n\nIf you want more specific suggestions, I would add:\n\n- What base model did you fine-tune?\n- Is it LoRA/PEFT, QLoRA, or full fine-tuning?\n- What task did you fine-tune for?\n- What dataset did you train on?\n- Do you have a held-out test set?\n- Is the goal accuracy, instruction following, chat quality, formatting, coding, RAG, latency, or something else?\n- Are you trying to publish a model card/leaderboard result, or just check whether the fine-tune helped?\n\nWith those details, people can suggest a much more specific eval setup.", "url": "https://wpnews.pro/news/what-s-your-method-for-benchmarking", "canonical_source": "https://discuss.huggingface.co/t/whats-your-method-for-benchmarking/177127#post_3", "published_at": "2026-06-25 12:04:22+00:00", "updated_at": "2026-06-25 12:52:20.380765+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "ai-research", "ai-tools", "developer-tools"], "entities": ["Lighteval", "lm-evaluation-harness", "MMLU", "ROUGE", "BERTScore"], "alternates": {"html": "https://wpnews.pro/news/what-s-your-method-for-benchmarking", "markdown": "https://wpnews.pro/news/what-s-your-method-for-benchmarking.md", "text": "https://wpnews.pro/news/what-s-your-method-for-benchmarking.txt", "jsonld": "https://wpnews.pro/news/what-s-your-method-for-benchmarking.jsonld"}}