{"slug": "33-llm-metrics-to-watch-closely", "title": "33 LLM metrics to watch closely", "summary": "A comprehensive list of 33 metrics for evaluating large language models (LLMs) has been compiled, covering performance indicators such as time to first token, average tokens per second, throughput, error rate, token efficiency, tail latency, and cost per token. These metrics aim to help developers and users measure and manage LLM behavior in real-time applications, agentic systems, and high-throughput environments.", "body_md": "We’ve all heard the mantra from the quants in the business community: you can’t manage what you can’t measure. And if that’s true for human intelligence, it should be true for the [artificial kind](https://www.infoworld.com/article/4061121/a-brief-history-of-ai.html) too.\n\nHow do we measure [agents](https://www.infoworld.com/article/3812583/what-you-need-to-know-about-developing-ai-agents.html) and [large language models](https://www.infoworld.com/article/2335213/large-language-models-the-foundations-of-generative-ai.html) (LLMs)? We’re just beginning to come up with statistical metrics. Here are several of the most common metrics that designers and users toss about when they’re evaluating a model.\n\nHow long does it take to generate the first token? For real-time applications with time constraints, faster responses can be essential. It’s well-known that people hate waiting even a few milliseconds. The teams that develop user interfaces learned decades ago that it’s important for the software to respond quickly when a human is waiting for an answer. Even a few seconds of delay mean that the human will wander off to another window to check some email or place some bet on a prediction market. Time to first token is a good measure for models that will be working directly with the fickle human intelligences and their latent attention deficit disorder.\n\nTake the total time it takes to respond and divide by the total number of tokens. The time to first token measures how long it takes to start a response and this measures the average speed as the model through all of the tokens. In basic LLMs, this value is generally fairly constant. Once the prefill is done and the LLM enters the decode phase, the output tokens usually appear at a constant stream. When the output is long enough, the startup time to first token is amortized away. In some of the more complicated architectures with loops for planning or gathering data from various tools, the average speed can vary as the model shifts in and out of making agentic decisions.\n\nThis is just the reciprocal of the average time per token. Sometimes it is reported separately for different stages in the pipeline.\n\nIf a system supports more than a single user, tracking the number of different requests makes sense. These throughput numbers can be quite useful for measuring the power of some of the newer pipelines that are more efficient when they’re answering multiple prompts at the same time.\n\nNot every request gets an answer. The error rate tracks how often rate limits, timeouts, or model “refusals” get in the way. Better accounting tracks each independently because the number of failures in each category can be very different.\n\nNot all work tokens are visible and not all tokens are part of the final outcome. This measures how much work is done to produce the final result. As models become more complex or agentic and the pipelines become more sophisticated, the efficiency tends to drop. Agentic reasoning and strategic planning typically require more tokens that don’t appear in the final answer. This is generally a measure of how expensive a model might be to run.\n\nIt’s all well and good to measure the average time to answer, but in some cases a few very slow responses can really color people’s judgement. Some applications require good performance all of the time. Would you want to ride in an autonomous car that gets steering instructions very quickly “on average” instead of always? What if that’s only 99% of the time? Tail latency uses a mixture of queuing theory and detailed measurements to track the worst moments in the long tail of the latency graph. It’s useful when even occasional delays are problematic.\n\nProjects that use an API or buy output from providers just look at the cost per 1M tokens. They’re effectively renters. The groups that are buying GPUs and paying for electricity, though, will add up these costs and other indirect costs like depreciation and maintenance to come up with a number that estimates how much the tokens really cost to produce. This value will depend upon demand and utilization rates—that is, on how many users are sending in prompts and how efficiently the model fits in a particular GPU and its RAM.\n\nMany models have numbers in their name followed by a B. This is meant to roughly capture the number of parameters, or the number of variables the model uses to generate outputs from inputs. The number “70B” means that there are about 70 billion parameters in the model. This is a good estimate for the complexity of the model and the size of the training set that has been stuffed into it. Generally bigger numbers mean a larger amount of information is hiding inside the model. It often means that it will take a bigger GPU with more RAM to generate an answer with it. It’s not a very precise number, though, because there are many other areas of the architecture that can influence whether the model can generate the answer you want inside your budget. There continue to be advances and it’s not uncommon for someone to claim that a new model with X parameters is better than an old model with 2X or 3X parameters.\n\nWhile everyone wants LLMs to generate accurate output, measuring it can be difficult because deciding what’s accurate is sometimes complicated. One approach is to ask the LLM to summarize a document. Then another model evaluates how well the summary matches the original. While this may not catch all subtle slips, it will capture enough of the worst departures from reality. Some researchers have built complex test sets with curated answers. The LLMs that deliver the expected answers get the highest scores. Some common benchmarks are [TruthfulQA](https://github.com/sylinrl/TruthfulQA), [HaluEval](https://arxiv.org/abs/2305.11747), [QAFactEval](https://github.com/salesforce/QAFactEval), and Vectara’s [Hallucination Evaluation Model](https://github.com/vectara/hallucination-leaderboard) (HHEM).\n\nIf measuring accuracy is difficult, building a metric to detect toxic or biased output is even more challenging because the definitions can be so protean. Still, some teams have built LLMs that key on particular concepts or word choices. They can detect some of the most obvious red flags that could generate political trouble. Some well-known versions include [Granica Screen](https://www.granica.ai/blog/granica-launches-ai-data-safety-solution-granica-screen-on-aws-marketplace) and [Perspective API](https://perspectiveapi.com/).\n\nOne of the biggest fears is that LLMs will somehow absorb information that may be considered personal and private. Some of the simplest measures can be as simple as regular expressions that look for the sixteen digit numbers used for credit card transactions. Many of the model builders work on eliminating personally identifiable information (PII) from the training set before beginning.\n\nAs models grow more complex and agentic, they often gain access to various tools or [Model Context Protocol](https://www.infoworld.com/article/4029634/what-is-model-context-protocol-how-mcp-bridges-ai-and-external-services.html) (MCP) gateways that can help them find the best answers. Not all models take advantage of this help. The tool-calling accuracy scores count how often the models choose the best tool for the job. One particular example of this measurement is [BFCL](https://gorilla.cs.berkeley.edu/leaderboard.html) (Berkeley Function Calling Leaderboard).\n\nThe value captures how small changes in the language of the prompt induces the model to produce different results. It’s like a derivative from calculus class, although it’s generally computed experimentally using some collection of test prompts. There are a number of different approaches that depend upon different types of changes. Some test sets are built with small rephrasing of the request that are semantically the same. Others mix together different ways of specifying the problem, some with examples, say, and some without. Some specific examples include [PromptSE](https://arxiv.org/html/2509.13680) and [ProSA](https://arxiv.org/abs/2410.12405).\n\nSome metrics evaluate the answers by comparing them to a set of gold standard answers. This often involves feeding them to a [vector embedding](https://www.infoworld.com/article/2335281/vector-databases-in-llms-and-search.html) model and searching a [retrieval-augmented generation](https://www.infoworld.com/article/2335814/what-is-retrieval-augmented-generation-more-accurate-and-reliable-llms.html) (RAG) database for similar answers. This can track how concise or fluffy the answers might be as well as looking for how much variability might be introduced through changing parameters like the temperature. One common example is the [BERTScore](https://bertscore.com/).\n\nMany systems that combine an LLM with a vector search tool for RAG measure the effectiveness of the combination with a benchmark like the grounding score. The LLM is presented with extra data from the vector search and the benchmark measures how closely it follows this extra information. That is, how much of the answer comes from the provided source documents and how much is synthesized using the data in its training set. Some examples include [RAGAS](https://aclanthology.org/2024.eacl-demo.16/), [TruLens](https://www.trulens.org/), [ARES](https://ares-ai.vercel.app/) (Automated RAG Evaluation System), [RGB](https://github.com/chen700564/RGB) (Retrieval-Augmented Generation Benchmark), [HaluEval](https://arxiv.org/abs/2305.11747), and [HalluHard](https://halluhard.com/). A similar concept is called “context adherence,” “context precision,” “context recall,” or “faithfulness.”\n\nMost LLMs fold in a certain amount of random entropy, and this amount is often controlled by a parameter called the “temperature.” The model variability is a measure of how much the answers will change between runs. Some applications like chatbots require a certain amount of variability because the randomness adds a bit of “life” to the answers. Other applications like those in mission-critical areas like law or medicine will undermine confidence if the answers vary.\n\nIn some roles, LLMs are asked to produce data in strict formats like JSON or CSV. This is often important if the data will be fed into some pipeline for further processing or storage. The format compliance rate tests a number of common formats and measures how often the LLM returns semantically correct data. Agentic systems that glue together multiple LLMs and other tools rely heavily on LLMs with good scores on this benchmark.\n\nSome prompts include very specific instructions and the adherence can be measured empirically. For example, some prompts will ask the LLM to produce exactly 300 words or a poem in rhyming couplets. These tests use a collection of sample prompts that ask for answers that can be easily measured. Some specific examples include [IFEval](https://arxiv.org/abs/2311.07911), [FollowBench](https://github.com/YJiangcm/FollowBench), and the [BFCL](https://gorilla.cs.berkeley.edu/leaderboard.html) (Berkeley Function Calling Leaderboard), a value that is mentioned above in the section on tool usage.\n\nAs agentic models become more common, it’s helpful to track how well the model performs on each of the various parts of the agent’s strategic plan. All of the metrics here can be broken down and tracked for each of the subgoals.\n\nAgentic models start with a plan. Some of them are smart enough to abandon the plan or at least adjust it as the work evolves. Plan stability measures how often the plans are adjusted. A high rate of adjustment could mean that the agent is a bad planner or just flexible or maybe both.\n\nSome agents are able to dive deeper and recognize their mistakes. The self-correction score measures how often the model will make a mistake and then recognize it, either on its own or after being prompted with the question, “Are you really sure?”\n\nSome users try to find clever ways to lure the LLM into tossing aside any restrictions on topics or answers. In the past, some LLMs could be fooled by being told the answer was part of a play or a work of fiction. So discussing forbidden subjects wasn’t a problem because it was all pretend. Newer models have more elaborate defenses. Measures of the ability to resist deception include [JailbreakBench](https://jailbreakbench.github.io/), [AgentHarm](https://arxiv.org/abs/2410.09024), and [Tele-AI-Safety](https://arxiv.org/pdf/2512.05485).\n\nSometimes untrusted data from extra sources or skills may include malicious instructions that can exploit the LLM. Benchmarks such as [Skill-Inject](https://arxiv.org/abs/2602.20156) and [SPIKEE](https://spikee.ai/) (Simple Prompt Injection Kit for Evaluation and Exploitation) work with known attack vectors and measure how susceptible a model is to targeted prompt injection attacks.\n\nSome LLMs can regurgitate the data in their training corpus in a way that seems like plagiarism or copyright infringement. This can be an issue when the training material wasn’t carefully licensed. The copyright infringement score measures how often the LLM may parrot the training material a bit too closely. Tools for defending against this include [CopyrightCatcher](https://www.patronus.ai/blog/introducing-copyright-catcher) and [DE-COP](https://arxiv.org/abs/2402.09910).\n\nHow well can a model extract information from the entire context? [NIAH](https://github.com/gkamradt/needle-in-a-haystack) (needle-in-a haystack) [benchmarks](https://arxiv.org/pdf/2504.04713) measure how well a model can retrieve small, crucial bits of information from long contexts. [RULER](https://github.com/NVIDIA/RULER) takes NIAH tests further with the ability to vary the types and quantities of needles, the size of the haystack, and the complexity of the task.\n\nThe developers of [GSM8K](https://arxiv.org/abs/2110.14168) (Grade School Math 8K) set out to benchmark an LLM’s ability to tackle multistep mathematical problems, so they gathered [8,500 problems](https://huggingface.co/datasets/openai/gsm8k) that are common in grade school math classes. While the focus is explicitly on solving math homework problems, the benchmark also measures the ability to construct reasoning chains.\n\nThe [Graduate-Level Google-Proof Q&A](https://arxiv.org/pdf/2311.12022) is composed of hundreds of hard questions that might normally be answered by humans in graduate school, generally in science. To make the benchmark harder, the researchers focused on questions that non-experts often get wrong. The term “Google-proof” means that the benchmark includes questions that can’t be easily answered by asking a search engine.\n\nThe [MMLU-Pro](https://github.com/TIGER-AI-Lab/MMLU-Pro) benchmark builds on the Massive Multitask Language Understanding dataset to test a model’s understanding of a broad set of scientific knowledge. It includes more than 12,000 questions about general scientific fields like biology, chemistry, economics, and law.\n\nGoogle created [MBPP](https://github.com/google-research/google-research/tree/master/mbpp) (Mostly Basic Python Problems) to evaluate how well a model was solving coding questions. Each problem comes with a statement, a gold standard solution, and several similar test cases. The number of accurate answers to these questions is a good measure of how well the model will solve many of the simpler Python coding problems presented by users.\n\nThis [collection](https://github.com/SWE-bench/SWE-bench) of several thousand software engineering challenges evaluates how well a model solves programming problems. The developers created it by selecting a number of issues and corresponding pull-requests from a dozen or so Python projects. After some limitations appeared, the creators expanded the set by creating [SWE-Bench+](https://arxiv.org/abs/2410.06992), [SWE Bench Verified](https://openai.com/index/introducing-swe-bench-verified/), and [SWE-Bench Pro](https://arxiv.org/abs/2509.16941).\n\nInstead of creating a fixed set of test prompts, the Large Model Systems Organization’s [Chatbot Arena](https://www.lmsys.org/) is a dynamic system that feeds the same prompt to different models and then asks humans to pick the best results. These head-to-head contests produce an [Elo](https://en.wikipedia.org/wiki/Elo_rating_system)-like rating that is similar to the one used to score chess players.\n\nThe rest of these metrics are useful, but as the real estate agents say, the three most important numbers on a property listing are price, price, and price. The cost is a bit less important for measuring AIs, but only a bit. Price can make a huge difference between a project being profitable and a moneysink. When the cost for each inference is a tad too high, it’s impossible to make it up with volume.\n\nThe key caveat is that a cheaper model isn’t a good idea if it generates answers that are filled with hallucinations or worse. The quality of the answers can differ greatly, and saving a few pennies can be a mistake. To make matters more complicated, there’s an explosion in different styles and approaches. Sometimes it makes sense to pay a bit more for a model that delivers answers with the right vibe.", "url": "https://wpnews.pro/news/33-llm-metrics-to-watch-closely", "canonical_source": "https://www.infoworld.com/article/4183716/33-llm-metrics-to-watch-closely.html", "published_at": "2026-06-15 09:00:00+00:00", "updated_at": "2026-06-15 09:13:50.572159+00:00", "lang": "en", "topics": ["large-language-models", "ai-agents", "ai-tools", "ai-infrastructure", "ai-research"], "entities": ["LLM", "GPU"], "alternates": {"html": "https://wpnews.pro/news/33-llm-metrics-to-watch-closely", "markdown": "https://wpnews.pro/news/33-llm-metrics-to-watch-closely.md", "text": "https://wpnews.pro/news/33-llm-metrics-to-watch-closely.txt", "jsonld": "https://wpnews.pro/news/33-llm-metrics-to-watch-closely.jsonld"}}