Measuring LLM Inference: A Practical Look at token-sec-calc I published on GitHub. A developer published token-sec-calc, an open-source Python CLI tool that benchmarks LLM inference throughput, latency, time-to-first-token, and queue wait against any OpenAI-compatible endpoint. The tool supports both closed-loop and open-loop Poisson arrival modes to help size model deployments accurately. Measuring LLM Inference: A Practical Look at token-sec-calc I published on GitHub. When you self-host an LLM — vLLM, SGLang, TGI, llama.cpp server — or wire your app to a hosted gateway, one question dominates every capacity decision: how many tokens per second can this thing actually deliver? That number is harder to pin down than it sounds. Output length varies because of EOS. Prompt length varies because real prompts vary. Streaming adds time-to-first-token. Concurrency changes everything. And the moment you put it under a sustained request rate, queueing shows up. token-sec-calc https://github.com/TechPreacher/token-sec-calc?ref=corti.com is a small Python CLI that focuses on exactly this problem: producing the throughput, latency, TTFT, and queue-wait numbers you actually need to size a model deployment — against any OpenAI-compatible endpoint, with no SDK lock-in. The repo is available at: https://github.com/TechPreacher/token-sec-calc https://github.com/TechPreacher/token-sec-calc?ref=corti.com This post walks through what it does, how to use it, where it shines, and what is still missing. What it is A single console command — benchmark — that hits /v1/completions or /v1/chat/completions and reports: Aggregate throughput in tokens/sec across the entire run. Per-request latency percentiles — p50, p90, p95, p99. Time-to-first-token TTFT under SSE streaming. Dispatcher queue wait under a sustained Poisson arrival rate. Side-by-side comparisons across multiple endpoints or models in one invocation. It is intentionally not an SDK. The transport is plain requests over HTTPS, the streaming branch is a hand-rolled SSE consumer, and the only mandatory third-party dependency at runtime is python-dotenv . tiktoken is an optional extra for accurate token counts. The project lives at https://github.com/TechPreacher/token-sec-calc and is MIT-licensed. Installing Python ≥ 3.11. The project uses uv https://docs.astral.sh/uv/?ref=corti.com for dependency and environment management. git clone https://github.com/TechPreacher/token-sec-calc.git cd token-sec-calc uv sync runtime deps + editable install uv sync --group dev add pytest + pyyaml + ruff for the test suite uv sync --extra tiktoken optional: accurate tokenization After sync, both forms work: uv run benchmark --help uv run python -m benchmark --help A .env file is the easiest way to configure the common case: cp .env.example .env $EDITOR .env set ENDPOINT, API KEY, MODEL uv run benchmark CLI flags always win over .env , so you can keep a sane default and override per run. The two run modes Closed-loop default N concurrent requests fire in parallel; the slowest completion ends the trial; the next trial starts. Throughput is total tokens / total wall time . uv run benchmark --concurrent 8 --trials 10 --max tokens 256 Use this when you want to characterize steady-state batched serving at a known concurrency level. It is the right model for "we want to run 8 concurrent generations and we want to know what the server does." Open-loop Poisson QPS Set --qps and --duration and the runner switches to a pre-scheduled Poisson arrival pattern. Requests are submitted to a shared worker pool at their scheduled times; if the pool is saturated, requests queue inside the executor and that queueing is surfaced as queue wait s per request, plus a Dispatcher queue wait row in the summary. uv run benchmark --qps 25 --duration 60 --max tokens 128 This is the mode that answers the question that closed-loop cannot: what happens to my serving stack at a target request rate, including head-of-line blocking and saturation? The achieved request rate is reported alongside the target so you immediately see if the server fell behind. Streaming + TTFT uv run benchmark --stream true --concurrent 4 --trials 5 With streaming on, the SSE consumer captures the wall-clock time of the first content delta per request. A Time-to-first-token percentile block appears in the summary and a ttft s column appears in the per-request log. Combine with --qps for serving-style benchmarks where TTFT is the SLO that actually matters. Removing the most common sources of noise Two features deserve special attention because they are the difference between a benchmark you trust and a benchmark you don't. Pinned output length. Tokens per second is meaningless if every request stops at a different output length because of EOS. With --ignore eos true default , the runner sends both ignore eos: true and min tokens: max tokens vLLM/SGLang extensions . Every request emits exactly max tokens . The aggregate throughput number becomes load-comparable across prompts. Strict hosted gateways may reject those fields — set --ignore eos false for OpenAI, Together, Anthropic-shaped endpoints, etc. Pinned input length. Prompt length variance biases output throughput because longer prompts take longer to prefill. --prompt tokens 256 pads or truncates every prompt to exactly 256 tokens, as counted by the active tokenizer, before sending. The normalization uses a binary search on character length to land on the target token count. uv run benchmark --prompt tokens 256 --max tokens 256 With both pinned, output throughput is comparable across runs, models, and backends in a way that ad-hoc benchmarks rarely manage. Multi-endpoint comparison Pass comma-separated values to --endpoint , --model , and/or --api key . Singleton lists are broadcast; non-singleton lists must share the same length. uv run benchmark \ --endpoint http://a:8000/v1/completions,http://b:8000/v1/completions \ --model llama-8b,mistral-7b \ --api key EMPTY \ --concurrent 4 --trials 3 Each config runs sequentially with the same flags and a comparison table is printed at the end: ================================================================= Comparison ================================================================= model endpoint req fail tok/s p50 lat p95 lat p99 lat mean pT - ---------- -------------------------------- --- ---- ------ ------- ------- ------- ------- 1 llama-8b http://a:8000/v1/completions 12 0 158.42 0.821 1.118 1.144 12 2 mistral-7b http://b:8000/v1/completions 12 0 142.07 0.913 1.241 1.288 12 When --output runs.jsonl is set in matrix mode, per-config logs are auto-suffixed: runs.0.jsonl , runs.1.jsonl , ... TTFT and achv qps columns appear automatically when those values are populated. Per-request logging --output runs.jsonl writes one JSON object per non-warmup request: { "trial": 1, "request index": 0, "prompt chars": 47, "prompt tokens": 12, "output tokens": 128, "latency s": 0.821, "ttft s": null, "scheduled offset s": null, "queue wait s": null, "ok": true, "estimated": false, "error": "" } .csv extension writes the same schema with a header row. Field semantics: output tokens — server-reported usage.completion tokens when present; otherwise estimated from text length, with estimated: true flagged on that row. ttft s — populated only under streaming successes. scheduled offset s / queue wait s — populated only under open-loop QPS. error — empty string on success. Warmup requests are intentionally excluded so that startup latency does not leak into the dataset. Tokenizer choices The active tokenizer is used both for per-request prompt-token counts and as the fallback when the server omits usage.completion tokens . The server's reported count always wins when present. --tokenizer auto default — tiktoken:cl100k base if installed, otherwise chars/4 with a one-time stderr notice. --tokenizer chars4 — len text // 4 . Fast, ASCII-biased, dependency-free. --tokenizer tiktoken: