# The energy efficiency of agent networks

> Source: <https://vdf.ai/white-papers/energy-efficiency-benchmark/>
> Published: 2026-06-05 11:29:31+00:00

# The energy efficiency of agent networks.

A controlled benchmark of how VDF AI reduces the energy footprint of enterprise AI — by decomposing
work into **DAG-based agent networks** and dispatching each step through
**SEEMR self-evolving model routing**. The result: up to a 94.9% reduction in predicted
energy, with output quality held non-inferior in aggregate.

Most of the energy an AI system consumes in production is spent at *inference* — the same
request answered again and again [6]. That energy is not a fixed
property of a model. It is the outcome of a decision: which model runs, broken into how many steps,
under what objective.

This paper reports a benchmark of that decision inside VDF AI. We compare a high-intensity baseline —
one large model answering the whole task — against two compounding strategies: routing each request
under an energy-aware objective, and decomposing a workload into a directed graph of smaller,
independently-routed stages. Across 71 configurations spanning four token budgets and five scenario
families, energy-led routing reduced predicted energy by **81–95%**, with a stable
**~94.8%** reduction for the frontier-versus-compact pairing.

Crucially, savings without quality are meaningless. In a separate execution benchmark with a quality
score recorded per task, the routed condition reduced predicted energy by **94.9%** while
remaining **non-inferior** in aggregate under a margin fixed in advance — with the
task-level exceptions disclosed in full. The contribution here is not a single number; it is an
auditable account of how routing and decomposition turn energy into something an enterprise can
measure, steer, and defend.

###### AT A GLANCE

## Six numbers from the benchmark

```
  Peak energy avoided 94.9% predicted energy removed by eco routing vs. a pinned frontier baseline  Efficiency multiple ≈20× less predicted energy per workload at the same task, frontier vs. routed  Quality outcome Non-inferior routed quality held within a pre-registered 0.10 margin in aggregate  Benchmark depth 71 configurations across five scenario families and four token budgets  Savings range 81–95% reduction band observed across different model pairings  Selective frontier 54% energy still avoided when one DAG stage deliberately keeps the frontier model
```

###### FIGURE 1

## The same work, a fraction of the energy

Aggregate of the quality-constrained execution benchmark: a pinned high-intensity baseline versus energy-aware routing, with the quality guardrail satisfied.

*Wh*

*Wh*

**Fig. 1.** Predicted energy in watt-hours for an identical task set. Figures are
coefficient-based predictions under benchmark conditions, not measured wall power.

## Why inference energy is a decision, not a constant

A model is trained once and served billions of times. The integral of that serving tail now dominates
the one-off training spike[[6]](#ref-6) [8], which
means the most leveraged place to reduce AI's footprint is the dispatcher that decides, per request,
which model runs and how the work is split.

Enterprises increasingly have to *attribute* that energy — for sustainability reporting, for
internal chargeback, and for procurement decisions that no longer accept a single annual number.
So the question this paper answers is concrete: **if you hold the task fixed and change only the
routing and decomposition strategy, how much energy moves?** And does quality survive the
change?

We answer with a benchmark rather than an assertion. Two forms of evidence are reported: a coefficient-based comparison that isolates the effect of routing policy under fixed token assumptions, and a quality-constrained execution benchmark that pairs each energy figure with a measured quality score. The first tells us how big the lever is; the second tells us whether pulling it costs anything.

## The routing objective is a dial you control

The same candidate pool, three presets. Eco leans into energy; Max-Quality deliberately holds the heavy model. That Max-Quality lands at exactly 0% saving is the point — it proves the savings come from the policy, not from a benchmark quietly favouring the small model.

### Frontier-class vs. compact local model

### Heavy tier vs. light tier

The reduction is not a single magic figure. It scales with the energy gap between the candidates available to the router: a wide gap (frontier vs. compact) yields ~95%, a narrower one (heavy tier vs. light tier) yields ~81%. We report the band honestly because that is what a buyer needs to size their own deployment.

| Token budget | Baseline (Wh) | Routed (Wh) | Energy avoided |
|---|---|---|---|
| 500 in · 500 out | 4.30 | 0.225 | 94.77% |
| 1 000 in · 1 000 out | 8.60 | 0.450 | 94.77% |
| 256 in · 512 out | 3.79 | 0.200 | 94.73% |
| 2 000 in · 500 out | 7.90 | 0.405 | 94.87% |

## Don't send one big model. Send a network.

A monolithic call routes the entire workload to a single heavy model. A VDF agent network breaks the same workload into a directed graph of smaller stages — each routed on its own — so the expensive model is used only where it earns its keep.

**Fig. 3.** Fixed total workload (2 400 input · 1 800 output tokens). The last row keeps one
stage on the frontier model on purpose and still avoids 54% of predicted energy — selective use, not
all-or-nothing.

## Energy fell. Quality was watched the whole time.

A separate execution benchmark scored routed output against the pinned baseline on a curated task set. In aggregate the routed arm stayed non-inferior under a 0.10 margin set in advance — and we publish the one task that slipped rather than hide it.

Two tasks preserved quality exactly while shedding ~95% of energy. One — factual recall — degraded at
the task level, and one — exact arithmetic — was equally imperfect on both sides, so it neither helped
nor hurt the comparison. The defensible claim is therefore precise: **large energy reductions
with quality non-inferior on average across the evaluated set**, not a blanket promise that
every single task is untouched. That distinction is what separates a credible result from a marketing
number.

## What produces the saving

Four mechanisms compound. None of them is exotic; the result comes from making each one explicit and letting them work together.

### Energy as a first-class routing objective

Every candidate model is scored on quality, latency, cost, and energy together. Named presets — Eco, Balanced, Max-Quality — shift the weight on energy explicitly, so sustainability is a setting an operator chooses, not an accident of which model happened to be wired in.

### DAG-based agent networks

Instead of sending an entire workload to one large model, a network decomposes it into a directed graph of smaller stages. Each stage is routed independently, so the heavy model is reserved only for the steps that genuinely need it.

### Self-evolving model routing (SEEMR)

Routing is a continuously-learning decision rather than a fixed map. The dispatcher re-ranks candidates as evidence accumulates, converging on the lowest-energy model that still clears the quality bar for the task in front of it.

### Pre-registered quality guardrail

Energy savings are only meaningful if quality holds. A separate execution benchmark scores routed output against a pinned high-intensity baseline under a non-inferiority margin fixed in advance, so the quality claim is bounded and testable — not asserted.

## What this looks like at enterprise scale

The per-task numbers are small by design. Their significance is in the multiplier. Take the aggregate quality-constrained result — 3.61 Wh of predicted energy avoided per task set — and apply it to a workload running that comparison one million times:

The kWh figure scales directly from the benchmark's predicted savings; the carbon figure is an illustrative conversion at a stated grid intensity. Both are extrapolations from coefficient-based predictions, offered to convey magnitude — not as a measured datacenter result.

The strategic point is that this is a software lever. There is no capital expenditure and no migration: the same task runs through a network instead of a monolith, under an objective that an operator sets. For organisations running AI on their own infrastructure, that lever also compounds with the savings they already get from owning the silicon.

## Limitations & honest framing

A result is only as strong as the caveats it is willing to state. These bound the claims above.

- Headline energy figures are predictions from per-model energy coefficients under controlled conditions, not direct wall-power measurements of a specific datacenter.
- The achievable saving depends on the energy gap between available candidates; a narrower gap yields a smaller reduction, which is why we report a band (81–95%) rather than one universal number.
- The quality benchmark uses a curated task set. Aggregate non-inferiority held, but one individual task showed measurable degradation — disclosed in Figure 4 rather than smoothed over.
- Staged-network figures assume clean token partitioning between stages and may understate the overhead of repeated context in some real workflows.

Stated conservatively: *in a controlled benchmark using explicit per-model energy coefficients,
energy-aware routing and DAG decomposition substantially reduced predicted energy across multiple token
budgets and workflow shapes, and the routed condition remained non-inferior in aggregate under a
pre-registered margin.* That is a claim we can defend line by line — which is the only kind worth
publishing.

## Conclusion

Inference energy is a decision variable, and VDF AI exposes the decision. Choose an energy-aware objective and the router moves work to the most efficient model that still clears the bar. Express the work as a network rather than a monolith and the heavy model is reserved for the steps that need it. Done together, across 71 benchmark configurations, these moves removed 81–95% of predicted energy — and the quality guardrail held.

The headline is not a single percentage. It is that energy became *visible, steerable, and
accountable* without sacrificing the answer — and visibility is the precondition for every
improvement that follows.

## References

- [1] Schwartz, R., Dodge, J., Smith, N. A., & Etzioni, O. (2020).
*Green AI.*Communications of the ACM 63(12), 54–63. - [2] Strubell, E., Ganesh, A., & McCallum, A. (2019).
*Energy and Policy Considerations for Deep Learning in NLP.*ACL. - [3] Patterson, D. et al. (2021).
*Carbon Emissions and Large Neural Network Training.*arXiv:2104.10350. - [4] Henderson, P. et al. (2020).
*Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning.*JMLR 21(248). - [5] Samsi, S. et al. (2023).
*From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference.*IEEE HPEC. - [6] Desislavov, R., Martínez-Plumed, F., & Hernández-Orallo, J. (2023).
*Trends in AI inference energy consumption.*Sustainable Computing. - [7] Dodge, J. et al. (2022).
*Measuring the Carbon Intensity of AI in Cloud Instances.*FAccT. - [8] Wu, C.-J. et al. (2022).
*Sustainable AI: Environmental Implications, Challenges and Opportunities.*MLSys. - [9] MLCommons (2023).
*MLPerf Power Benchmark — Methodology and Rules.* - [10] Piaggesi, D. et al. (2017).
*Non-inferiority testing: design and interpretation.*Statistical methods reference.

## Get the full benchmark white paper

Enter your work email and name and we'll send a download link for the print-optimised PDF — with the complete figure set, the full results tables, and the methodology notes for internal review and citation.