# VoltanaLLM: Energy-Efficient LLM Serving

> Source: <https://supercomputing-system-ai-lab.github.io/projects/voltana/>
> Published: 2026-06-24 05:24:36+00:00

## Relevance and Early Observation

**LLMs are deployed at unprecedented scale**, making inference a major driver of energy consumption and total cost of ownership (TOC). Recent studies show inference can account for **90% of AI infrastructure utilization**, pushing datacenter power and thermal limits. Large datacenters today can consume electricity equivalent to **millions of households**.

At the same time, **latency-sensitive applications** like chat assistants and agent pipelines rely on strict **Service Level Objectives (SLOs)**, such as Time-To-First-Token (TTFT) and Inter-Token Latency (ITL). Violating these SLOs degrades user experience and downstream responsiveness.

*The central challenge: how can we serve LLMs under tight SLOs while reducing their energy footprint?*

Our empirical profiling of LLM inference reveals a non-monotonic energy–frequency relationship . As shown above, while reducing GPU frequency from 1410 MHz to 1005 MHz (by ~28.7%) does increase execution time, the increase is sub-linear. Consequently, the total energy follows a U-shaped curve with respect to GPU frequency. This trend indicates that at low frequencies, execution time dominates energy , whereas at high frequencies, power dominates ; in the middle lies an energy sweet point .
