{"slug": "monitoring-llm-inference-with-prometheus-and-grafana-vllm-tgi-llama-cpp", "title": "Monitoring LLM Inference with Prometheus and Grafana (vLLM, TGI, Llama.cpp)", "summary": "A new guide details how to monitor LLM inference in production using Prometheus and Grafana, covering metrics like tokens/sec, queue duration, and KV cache pressure for servers such as vLLM, TGI, and llama.cpp. The guide emphasizes that traditional API metrics are insufficient and provides PromQL examples and deployment patterns for Docker Compose and Kubernetes.", "body_md": "# Monitor LLM Inference in Production (2026): Prometheus & Grafana for vLLM, TGI, llama.cpp\n\nMonitor LLM with Prometheus and Grafana\n\nLLM inference looks like “just another API” — until latency spikes, queues back up, and your GPUs sit at 95% memory with no obvious explanation.\n\nMonitoring becomes mission-critical the moment you move beyond a single-node setup or start optimizing for throughput. At that point, traditional API metrics aren’t enough. You need visibility into tokens, batching behavior, queue time, and KV cache pressure - the real bottlenecks of modern LLM systems.\n\nThis article is part of my broader ** observability and monitoring guide**, where I cover monitoring vs observability fundamentals, Prometheus architecture, and production best practices. Here, we’ll focus specifically on\n\n[monitoring](https://www.glukhov.org/observability/monitoring-llm-inference-prometheus-grafana/).\n\n**LLM inference workloads**(If you’re deciding on infrastructure, see my guide to [LLM hosting in 2026](https://www.glukhov.org/llm-hosting/). If you want a deep dive into batching mechanics, VRAM limits, and throughput vs latency trade-offs, see the [LLM performance engineering guide](https://www.glukhov.org/llm-performance/).)\n\nUnlike typical REST services, LLM serving is shaped by **tokens**, **continuous batching**, **KV cache utilization**, **GPU/CPU saturation**, and **queue dynamics**. Two requests with identical payload sizes can have radically different latency depending on *max_new_tokens*, *concurrency*, and *cache reuse*.\n\nThis guide is a practical, production-focused walkthrough for building **LLM inference monitoring with Prometheus and Grafana**:\n\n- What to measure (p95/p99 latency, tokens/sec, queue duration, cache utilization, error rate)\n- How to scrape\n`/metrics`\n\nfrom common servers (**vLLM**,** Hugging Face TGI**,** llama.cpp**) - PromQL examples for percentiles, saturation, and throughput\n- Deployment patterns with\n**Docker Compose** and**Kubernetes** - Troubleshooting the issues that only appear under real load\n\nThe examples are intentionally vendor-neutral. Whether you later add OpenTelemetry tracing, autoscaling, or a service mesh, the same metric model applies.\n\n## Why you should monitor LLM inference differently\n\nTraditional API monitoring (RPS, p95 latency, error rate) is necessary but not sufficient. LLM serving adds additional axes:\n\n### 1) Latency has two meanings\n\n**E2E latency**: time from request received → final token returned.** Inter-token latency**: time per token during decode (critical for streaming UX).\n\nSome servers expose both. For example, TGI exposes request duration and mean time-per-token as histograms.\n\n### 2) Throughput is in tokens, not requests\n\nA “fast” service that returns 5 tokens is not comparable to one returning 500 tokens. Your “RPS” should often be “**tokens/sec**”.\n\n### 3) The queue is the product\n\nIf you run continuous batching, queue depth is what you sell. Watching **queue duration** and **queue size** tells you whether you’re meeting user expectations.\n\n### 4) Cache pressure is an outage precursor\n\nKV cache exhaustion (or fragmentation) often shows up as sudden latency spikes and timeouts. vLLM exposes KV cache usage as a gauge.\n\n## Metrics checklist for LLM inference monitoring\n\nUse this as your north star. You don’t need everything on day one—but you’ll want most of it eventually.\n\n### Golden signals (LLM-flavored)\n\n**Traffic:** requests/sec, tokens/sec**Errors:** error rate, timeouts, OOMs, 429s (rate limiting)**Latency:** p50/p95/p99 request duration; prefill vs decode latency; inter-token latency**Saturation:** GPU utilization, memory usage, KV cache usage, queue size\n\nIf you need low-level visibility into GPU memory usage, temperature, and utilization outside of Prometheus (for debugging or single-node setups), see my guide to [GPU monitoring applications in Linux / Ubuntu](https://www.glukhov.org/observability/gpu-monitoring-apps-linux/).\n\nFor a broader view of LLM observability beyond metrics — including tracing, structured logs, synthetic testing, GPU profiling, and SLO design — see my in-depth guide on [observability for LLM systems](https://www.glukhov.org/observability/observability-for-llm-systems/).\n\n### Useful dimensions (labels)\n\nKeep label cardinality low. Good labels:\n\n`model`\n\n,`endpoint`\n\n,`method`\n\n(prefill/decode),`status`\n\n(success/error),`instance`\n\nAvoid labels like:\n\n- raw\n`prompt`\n\n, raw`user_id`\n\n, request ids — these explode series count.\n\n## Exposing metrics: built-in `/metrics`\n\nendpoints (vLLM, TGI, llama.cpp)\n\nThe easiest path is: **use the metrics the server already exposes**.\n\n### vLLM: Prometheus-compatible `/metrics`\n\nvLLM exposes a Prometheus-compatible `/metrics`\n\nendpoint (via its Prometheus metrics logger) and publishes server/request metrics with the `vllm:`\n\nprefix, including gauges like running requests and KV cache usage.\n\nFor container setup, OpenAI-compatible serving, and throughput-oriented runtime tuning, see [vLLM Quickstart: High-Performance LLM Serving](https://www.glukhov.org/llm-hosting/vllm/vllm-quickstart/).\n\nExample metrics you’ll typically see:\n\n`vllm:num_requests_running`\n\n`vllm:num_requests_waiting`\n\n`vllm:kv_cache_usage_perc`\n\n### Hugging Face TGI: `/metrics`\n\nwith queue + request histograms\n\nTGI exposes many production-grade metrics on `/metrics`\n\n, including queue size, request duration, queue duration, and mean time per token.\n\nNotable ones:\n\n`tgi_queue_size`\n\n(gauge)`tgi_request_duration`\n\n(histogram, e2e latency)`tgi_request_queue_duration`\n\n(histogram)`tgi_request_mean_time_per_token_duration`\n\n(histogram)\n\nOperational setup—Docker, GPUs, launch flags, and the failures that show up as empty or misleading scrapes—is covered in [TGI - Text Generation Inference - Install, Config, Troubleshoot](https://www.glukhov.org/llm-hosting/tgi/).\n\n### llama.cpp server: enable metrics endpoint\n\nThe llama.cpp server supports a Prometheus-compatible metrics endpoint that must be enabled with a flag (e.g., `--metrics`\n\n).\n\nFor installation paths, key runtime flags, and OpenAI-compatible server usage, see [llama.cpp Quickstart with CLI and Server](https://www.glukhov.org/llm-hosting/llama-cpp/).\n\nIf you’re running llama.cpp behind a proxy, scrape the server directly whenever possible (to avoid proxy-level latency hiding the actual inference behavior).\n\n## Prometheus configuration: scraping your inference servers\n\nThis example assumes:\n\n- vLLM at\n`http://vllm:8000/metrics`\n\n- TGI at\n`http://tgi:8080/metrics`\n\n- llama.cpp at\n`http://llama:8080/metrics`\n\n- scrape interval tuned for fast feedback\n\n`prometheus.yml`\n\n```\nglobal:\n  scrape_interval: 5s\n  evaluation_interval: 15s\n\nscrape_configs:\n  - job_name: \"vllm\"\n    metrics_path: /metrics\n    static_configs:\n      - targets: [\"vllm:8000\"]\n\n  - job_name: \"tgi\"\n    metrics_path: /metrics\n    static_configs:\n      - targets: [\"tgi:8080\"]\n\n  - job_name: \"llama_cpp\"\n    metrics_path: /metrics\n    static_configs:\n      - targets: [\"llama:8080\"]\n```\n\nIf you’re new to Prometheus or want a deeper explanation of scrape configs, exporters, relabeling, and alerting rules, see my full [Prometheus monitoring setup guide](https://www.glukhov.org/observability/monitoring-with-prometheus/).\n\n### Pro tip: add a “service label”\n\nIf you run multiple models/replicas, add relabeling to include a stable `service`\n\nlabel for dashboards.\n\n```\nrelabel_configs:\n  - target_label: service\n    replacement: \"llm-inference\"\n```\n\n## PromQL examples you can copy/paste\n\n### Request rate (RPS)\n\n```\nsum(rate(tgi_request_count[5m]))\n```\n\nFor vLLM, use its request counters (names vary by version), but the pattern is the same: `sum(rate(<counter>[5m]))`\n\n.\n\n### Error rate (%)\n\nIf you have `*_success`\n\ncounters, compute failure ratio:\n\n```\n1 - (\n  sum(rate(tgi_request_success[5m]))\n  /\n  sum(rate(tgi_request_count[5m]))\n)\n```\n\n### p95 latency for histogram metrics (Prometheus)\n\nPrometheus histograms are bucketed counts; use `histogram_quantile()`\n\nover `rate()`\n\nof the buckets. Prometheus documents this model and the histogram vs summary tradeoffs.\n\n```\nhistogram_quantile(\n  0.95,\n  sum by (le) (rate(tgi_request_duration_bucket[5m]))\n)\n```\n\n### p99 queue time\n\n```\nhistogram_quantile(\n  0.99,\n  sum by (le) (rate(tgi_request_queue_duration_bucket[5m]))\n)\n```\n\n### Mean time per token (inter-token latency)\n\n```\nhistogram_quantile(\n  0.95,\n  sum by (le) (rate(tgi_request_mean_time_per_token_duration_bucket[5m]))\n)\n```\n\nInter-token latency is often constrained by decode bottlenecks and memory bandwidth - topics covered in detail in\n[LLM performance optimization guide](https://www.glukhov.org/llm-performance/).\n\n### Queue depth (instant)\n\n```\nmax(tgi_queue_size)\n```\n\n### vLLM KV cache utilization (instant)\n\n```\nmax(vllm:kv_cache_usage_perc)\n```\n\n## Grafana dashboards: panels that actually help on-call\n\nGrafana can visualize histograms in multiple ways (percentiles, heatmaps, bucket distributions). Grafana Labs has a detailed guide to Prometheus histogram visualization.\n\nA minimal, high-signal dashboard layout:\n\n### Row 1 — User experience\n\n**p95 request latency**(time series)** p95 inter-token latency**(time series)** Error rate**(time series + stat)\n\n### Row 2 — Capacity and saturation\n\n**Queue size**(time series)** Running vs waiting requests**(stacked)** KV cache usage %**(gauge)\n\n### Row 3 — Throughput\n\n**Requests/sec****Generated tokens/request (p50/p95)**\n\nIf you have streaming, add a panel for “first token latency” (TTFT) when available.\n\n### Example Grafana queries\n\n- p95 latency panel: the\n`histogram_quantile(0.95, …)`\n\nquery above - heatmap panel: graph the bucket rates (\n`*_bucket`\n\n) as a heatmap (Grafana supports this approach)\n\n## Deployment option 1: Docker Compose (fast local + single-node)\n\nIf you’re deciding between local, self-hosted, or cloud-based inference architectures, see the full breakdown in my\n[LLM hosting comparison guide](https://www.glukhov.org/llm-hosting/).\n\nCreate a folder like:\n\n```\nmonitoring/\n  docker-compose.yml\n  prometheus/\n    prometheus.yml\n  grafana/\n    provisioning/\n      datasources/datasource.yml\n      dashboards/dashboards.yml\n    dashboards/\n      llm-inference.json\n```\n\n`docker-compose.yml`\n\n```\nservices:\n  prometheus:\n    image: prom/prometheus:latest\n    volumes:\n      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro\n    ports:\n      - \"9090:9090\"\n\n  grafana:\n    image: grafana/grafana:latest\n    environment:\n      - GF_SECURITY_ADMIN_USER=admin\n      - GF_SECURITY_ADMIN_PASSWORD=admin\n    volumes:\n      - ./grafana/provisioning:/etc/grafana/provisioning\n      - ./grafana/dashboards:/var/lib/grafana/dashboards\n    ports:\n      - \"3000:3000\"\n    depends_on:\n      - prometheus\n```\n\nIf you prefer a manual Grafana installation instead of Docker, see my step-by-step guide on [installing and using Grafana on Ubuntu](https://www.glukhov.org/observability/grafana-installing-using-in-ubuntu/).\n\n### Grafana datasource provisioning (`grafana/provisioning/datasources/datasource.yml`\n\n)\n\n```\napiVersion: 1\ndatasources:\n  - name: Prometheus\n    type: prometheus\n    access: proxy\n    url: http://prometheus:9090\n    isDefault: true\n```\n\n### Dashboard provisioning (`grafana/provisioning/dashboards/dashboards.yml`\n\n)\n\n```\napiVersion: 1\nproviders:\n  - name: \"LLM\"\n    folder: \"LLM\"\n    type: file\n    disableDeletion: true\n    options:\n      path: /var/lib/grafana/dashboards\n```\n\n## Deployment option 2: Kubernetes (Prometheus Operator + ServiceMonitor)\n\nIf you use **kube-prometheus-stack** (Prometheus Operator), scrape targets via `ServiceMonitor`\n\n.\n\nFor infrastructure trade-offs between Kubernetes, single-node Docker, and managed inference providers,\nsee my\n[LLM hosting in 2026 guide](https://www.glukhov.org/llm-hosting/).\n\n### 1) Expose your inference deployment with a Service\n\n```\napiVersion: v1\nkind: Service\nmetadata:\n  name: tgi\n  labels:\n    app: tgi\nspec:\n  selector:\n    app: tgi\n  ports:\n    - name: http\n      port: 8080\n      targetPort: 8080\n```\n\n### 2) Create a `ServiceMonitor`\n\n```\napiVersion: monitoring.coreos.com/v1\nkind: ServiceMonitor\nmetadata:\n  name: tgi\n  labels:\n    release: kube-prometheus-stack\nspec:\n  selector:\n    matchLabels:\n      app: tgi\n  endpoints:\n    - port: http\n      path: /metrics\n      interval: 5s\n```\n\nRepeat for vLLM and llama.cpp services. This scales cleanly as you add replicas.\n\n### 3) Alerting: SLO-style rules (example)\n\nHere are good starter alerts:\n\n**High p95 latency**(burn rate)** Queue time p99 too high**(users waiting)** Error rate > 1%****KV cache usage > 90%** sustained (capacity cliff)\n\nExample rule (p95 request duration):\n\n```\n- alert: LLMHighP95Latency\n  expr: histogram_quantile(0.95, sum by (le) (rate(tgi_request_duration_bucket[5m]))) > 3\n  for: 10m\n  labels:\n    severity: page\n  annotations:\n    summary: \"TGI p95 latency > 3s (10m)\"\n```\n\n## Troubleshooting: common Prometheus + Grafana failures in LLM stacks\n\n### 1) Prometheus target is “DOWN”\n\n**Symptoms**\n\n- Prometheus UI → Targets shows\n`DOWN`\n\n- “context deadline exceeded” or connection refused\n\n**Checklist**\n\n- Is the server actually exposing\n`/metrics`\n\n? - Wrong port? Wrong scheme (http vs https)?\n- Kubernetes: is the Service selecting pods? Is the ServiceMonitor label\n`release`\n\ncorrect?\n\n**Quick test**\n\n```\ncurl -sS http://tgi:8080/metrics | head\n```\n\n### 2) You can scrape metrics, but panels are empty\n\n**Most common causes**\n\n- Wrong metric name (server version changed)\n- Dashboard expects\n`_bucket`\n\nbut you only have a gauge/counter - Prometheus scrape interval too long for short windows (e.g.,\n`[1m]`\n\nwith 30s scrape can be noisy)\n\n**Fix**\n\n- Use Grafana Explore to search metric prefixes (e.g.,\n`tgi_`\n\n/`vllm:`\n\n) - Increase range window from\n`[1m]`\n\n→`[5m]`\n\n### 3) Histogram percentiles look “flat” or wrong\n\nPrometheus histograms require correct aggregation:\n\n- use\n`rate(metric_bucket[5m])`\n\n- then\n`sum by (le)`\n\n(and*optionally*other stable labels) - then\n`histogram_quantile()`\n\nPrometheus documents the bucket model and server-side quantile calculation.\n\nGrafana’s histogram visualization guide includes practical panel patterns.\n\n### 4) Cardinality explosion (Prometheus memory spikes)\n\n**Symptoms**\n\n- Prometheus RAM usage climbs\n- “too many series” errors\n\n**Typical root cause**\n\n- You added\n`prompt`\n\n,`user_id`\n\n, or request ids as labels in a custom exporter.\n\n**Fix**\n\n- Remove high-cardinality labels\n- Pre-aggregate into low-cardinality labels (model, endpoint, status)\n- Consider using logs/traces for per-request debugging instead of labels\n\n### 5) “We have metrics, but no idea why it’s slow”\n\nMetrics are necessary, but sometimes you need correlation:\n\n- Add\n**structured logs** with request metadata (model, token counts, TTFT) - Add\n**tracing**(OpenTelemetry) around your gateway + inference server - Use exemplars (when supported) to jump from a latency spike to a trace\n\nA good workflow: Grafana dashboard spike -> click into Explore -> narrow by instance/model -> check logs/traces for that period.\n\nThis follows the classic metrics -> logs -> traces model described in\n[observability and monitoring architecture guide](https://www.glukhov.org/observability/).\n\n### 6) vLLM / multi-process metric quirks\n\nIf your serving stack runs in multiple processes, you may need Prometheus multi-process configuration (depends on how the process exposes metrics). The vLLM docs emphasize exposing metrics via `/metrics`\n\nfor Prometheus polling; check the server’s metrics mode when deploying.\n\n## A practical “day-1” dashboard and alert set\n\nIf you want a lean setup that still works in production, start with:\n\n**Dashboard panels**\n\n- p95 request latency\n- p95 mean time per token\n- queue size\n- p95 queue duration\n- error rate\n- KV cache usage %\n\n**Alerts**\n\n- p95 request latency > X for 10m\n- p99 queue duration > Y for 10m\n- error rate > 1% for 5m\n- KV cache usage > 90% for 15m\n- Prometheus target down (always)\n\n## Related Observability Guides\n\n[Observability Guide: Prometheus, Grafana & Production Monitoring](https://www.glukhov.org/observability/)[Prometheus Monitoring: Setup & Best Practices](https://www.glukhov.org/observability/monitoring-with-prometheus/)[Install and Use Grafana on Ubuntu: Complete Guide](https://www.glukhov.org/observability/grafana-installing-using-in-ubuntu/)\n\n## Related LLM Infrastructure Guides\n\n[LLM Hosting in 2026: Local, Self-Hosted & Cloud Compared](https://www.glukhov.org/llm-hosting/)[LLM Performance in 2026: Benchmarks & Optimization](https://www.glukhov.org/llm-performance/)\n\n## Closing notes\n\nPrometheus + Grafana gives you the “always-on” view of inference health. Once you have the basics, the next big wins usually come from:\n\n- SLOs per model / tenant\n- request shaping (max tokens, concurrency limits)\n- autoscaling tied to\n**queue time** and**KV cache headroom**\n\nFor a broader explanation of monitoring vs observability, Prometheus fundamentals, and production patterns, see my complete [observability guide](https://www.glukhov.org/observability/).", "url": "https://wpnews.pro/news/monitoring-llm-inference-with-prometheus-and-grafana-vllm-tgi-llama-cpp", "canonical_source": "https://www.glukhov.org/observability/monitoring-llm-inference-prometheus-grafana/", "published_at": "2026-06-15 02:34:15+00:00", "updated_at": "2026-06-15 02:41:50.772959+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-tools", "mlops", "ai-products"], "entities": ["Prometheus", "Grafana", "vLLM", "Hugging Face TGI", "llama.cpp", "Docker Compose", "Kubernetes"], "alternates": {"html": "https://wpnews.pro/news/monitoring-llm-inference-with-prometheus-and-grafana-vllm-tgi-llama-cpp", "markdown": "https://wpnews.pro/news/monitoring-llm-inference-with-prometheus-and-grafana-vllm-tgi-llama-cpp.md", "text": "https://wpnews.pro/news/monitoring-llm-inference-with-prometheus-and-grafana-vllm-tgi-llama-cpp.txt", "jsonld": "https://wpnews.pro/news/monitoring-llm-inference-with-prometheus-and-grafana-vllm-tgi-llama-cpp.jsonld"}}