Monitoring LLM Inference with Prometheus and Grafana (vLLM, TGI, Llama.cpp) A new guide details how to monitor LLM inference in production using Prometheus and Grafana, covering metrics like tokens/sec, queue duration, and KV cache pressure for servers such as vLLM, TGI, and llama.cpp. The guide emphasizes that traditional API metrics are insufficient and provides PromQL examples and deployment patterns for Docker Compose and Kubernetes. Monitor LLM Inference in Production 2026 : Prometheus & Grafana for vLLM, TGI, llama.cpp Monitor LLM with Prometheus and Grafana LLM inference looks like “just another API” — until latency spikes, queues back up, and your GPUs sit at 95% memory with no obvious explanation. Monitoring becomes mission-critical the moment you move beyond a single-node setup or start optimizing for throughput. At that point, traditional API metrics aren’t enough. You need visibility into tokens, batching behavior, queue time, and KV cache pressure - the real bottlenecks of modern LLM systems. This article is part of my broader observability and monitoring guide , where I cover monitoring vs observability fundamentals, Prometheus architecture, and production best practices. Here, we’ll focus specifically on monitoring https://www.glukhov.org/observability/monitoring-llm-inference-prometheus-grafana/ . LLM inference workloads If you’re deciding on infrastructure, see my guide to LLM hosting in 2026 https://www.glukhov.org/llm-hosting/ . If you want a deep dive into batching mechanics, VRAM limits, and throughput vs latency trade-offs, see the LLM performance engineering guide https://www.glukhov.org/llm-performance/ . Unlike typical REST services, LLM serving is shaped by tokens , continuous batching , KV cache utilization , GPU/CPU saturation , and queue dynamics . Two requests with identical payload sizes can have radically different latency depending on max new tokens , concurrency , and cache reuse . This guide is a practical, production-focused walkthrough for building LLM inference monitoring with Prometheus and Grafana : - What to measure p95/p99 latency, tokens/sec, queue duration, cache utilization, error rate - How to scrape /metrics from common servers vLLM , Hugging Face TGI , llama.cpp - PromQL examples for percentiles, saturation, and throughput - Deployment patterns with Docker Compose and Kubernetes - Troubleshooting the issues that only appear under real load The examples are intentionally vendor-neutral. Whether you later add OpenTelemetry tracing, autoscaling, or a service mesh, the same metric model applies. Why you should monitor LLM inference differently Traditional API monitoring RPS, p95 latency, error rate is necessary but not sufficient. LLM serving adds additional axes: 1 Latency has two meanings E2E latency : time from request received → final token returned. Inter-token latency : time per token during decode critical for streaming UX . Some servers expose both. For example, TGI exposes request duration and mean time-per-token as histograms. 2 Throughput is in tokens, not requests A “fast” service that returns 5 tokens is not comparable to one returning 500 tokens. Your “RPS” should often be “ tokens/sec ”. 3 The queue is the product If you run continuous batching, queue depth is what you sell. Watching queue duration and queue size tells you whether you’re meeting user expectations. 4 Cache pressure is an outage precursor KV cache exhaustion or fragmentation often shows up as sudden latency spikes and timeouts. vLLM exposes KV cache usage as a gauge. Metrics checklist for LLM inference monitoring Use this as your north star. You don’t need everything on day one—but you’ll want most of it eventually. Golden signals LLM-flavored Traffic: requests/sec, tokens/sec Errors: error rate, timeouts, OOMs, 429s rate limiting Latency: p50/p95/p99 request duration; prefill vs decode latency; inter-token latency Saturation: GPU utilization, memory usage, KV cache usage, queue size If you need low-level visibility into GPU memory usage, temperature, and utilization outside of Prometheus for debugging or single-node setups , see my guide to GPU monitoring applications in Linux / Ubuntu https://www.glukhov.org/observability/gpu-monitoring-apps-linux/ . For a broader view of LLM observability beyond metrics — including tracing, structured logs, synthetic testing, GPU profiling, and SLO design — see my in-depth guide on observability for LLM systems https://www.glukhov.org/observability/observability-for-llm-systems/ . Useful dimensions labels Keep label cardinality low. Good labels: model , endpoint , method prefill/decode , status success/error , instance Avoid labels like: - raw prompt , raw user id , request ids — these explode series count. Exposing metrics: built-in /metrics endpoints vLLM, TGI, llama.cpp The easiest path is: use the metrics the server already exposes . vLLM: Prometheus-compatible /metrics vLLM exposes a Prometheus-compatible /metrics endpoint via its Prometheus metrics logger and publishes server/request metrics with the vllm: prefix, including gauges like running requests and KV cache usage. For container setup, OpenAI-compatible serving, and throughput-oriented runtime tuning, see vLLM Quickstart: High-Performance LLM Serving https://www.glukhov.org/llm-hosting/vllm/vllm-quickstart/ . Example metrics you’ll typically see: vllm:num requests running vllm:num requests waiting vllm:kv cache usage perc Hugging Face TGI: /metrics with queue + request histograms TGI exposes many production-grade metrics on /metrics , including queue size, request duration, queue duration, and mean time per token. Notable ones: tgi queue size gauge tgi request duration histogram, e2e latency tgi request queue duration histogram tgi request mean time per token duration histogram Operational setup—Docker, GPUs, launch flags, and the failures that show up as empty or misleading scrapes—is covered in TGI - Text Generation Inference - Install, Config, Troubleshoot https://www.glukhov.org/llm-hosting/tgi/ . llama.cpp server: enable metrics endpoint The llama.cpp server supports a Prometheus-compatible metrics endpoint that must be enabled with a flag e.g., --metrics . For installation paths, key runtime flags, and OpenAI-compatible server usage, see llama.cpp Quickstart with CLI and Server https://www.glukhov.org/llm-hosting/llama-cpp/ . If you’re running llama.cpp behind a proxy, scrape the server directly whenever possible to avoid proxy-level latency hiding the actual inference behavior . Prometheus configuration: scraping your inference servers This example assumes: - vLLM at http://vllm:8000/metrics - TGI at http://tgi:8080/metrics - llama.cpp at http://llama:8080/metrics - scrape interval tuned for fast feedback prometheus.yml global: scrape interval: 5s evaluation interval: 15s scrape configs: - job name: "vllm" metrics path: /metrics static configs: - targets: "vllm:8000" - job name: "tgi" metrics path: /metrics static configs: - targets: "tgi:8080" - job name: "llama cpp" metrics path: /metrics static configs: - targets: "llama:8080" If you’re new to Prometheus or want a deeper explanation of scrape configs, exporters, relabeling, and alerting rules, see my full Prometheus monitoring setup guide https://www.glukhov.org/observability/monitoring-with-prometheus/ . Pro tip: add a “service label” If you run multiple models/replicas, add relabeling to include a stable service label for dashboards. relabel configs: - target label: service replacement: "llm-inference" PromQL examples you can copy/paste Request rate RPS sum rate tgi request count 5m For vLLM, use its request counters names vary by version , but the pattern is the same: sum rate