cd /news/ai-agents/the-future-will-be-millions-agents-r… Β· home β€Ί topics β€Ί ai-agents β€Ί article
[ARTICLE Β· art-18666] src=github.com pub= topic=ai-agents verified=true sentiment=Β· neutral

The future will be millions agents running task everyday?

A new benchmark comparing agent runtime performance across C++, Python, TypeScript, and Rust found that C++ achieved a peak memory footprint of approximately 93 MiB while running 100 concurrent coding agents on HumanEval tasks. The controlled test isolated runtime overhead from model performance by using identical tasks, models, hardware, and loop logic across all four language stacks. The results demonstrate that runtime selection significantly impacts memory consumption and concurrency costs at scale, with C++ showing the lowest resource usage for production agent deployments.

read5 min publishedMay 30, 2026

A controlled, apples-to-apples benchmark of agent runtimes β€” the orchestration layer that drives an LLM through a write β†’ execute β†’ self-correct loop β€” across C++, Python, TypeScript, and Rust.

When people compare "coding agents" they almost always compare the model (pass@1 on HumanEval, SWE-bench, etc.). But in production the model runs behind a runtime: the code that fans out hundreds of agents, streams tokens, spawns test processes, retries on failure, and tracks state. That runtime β€” not the model β€” decides:

Memory footprint when you run 100+ agents at once,Concurrency ceiling and tail behavior under load,Overhead added on top of model latency.

These costs dominate the bill once agents move to scale, yet there is no controlled cross-language comparison of agent runtimes. Published numbers aren't comparable: different hardware, different model, different framework. This project fixes the variables β€” same tasks, same model, same hardware, same loop logic β€” and changes only the language runtime, so the runtime's cost is isolated and measurable.

HumanEval (first 100 problems). For each task the runtime runs a real agentic loop, not one-shot codegen:

build prompt (spec + pytest)
  β†’ LLM completion (streamed)
  β†’ extract the Python code block
  β†’ write solution.py into an isolated workspace
  β†’ run `python3 -B -m pytest`
  β†’ pass?  β†’ Done
     fail? β†’ feed the pytest error back into the prompt, retry (max 3)
  β†’ still failing β†’ Failed

The agent must write code, run the tests, read the failure, and fix itself β€” which is what exercises the runtime (concurrency, process spawning, I/O, memory), not just the model.

Component What it does
ThreadPool
100 std::jthread workers, per-worker work-stealing deques
LLMClient / AsyncLLMClient
libcurl + SSE streaming to any OpenAI-compatible endpoint (sync, and curl_multi async)
ToolDispatcher
atomic write_file ; bash via fork/exec with separate stdout/stderr, timeout (process-group SIGKILL) and per-call workspace; plus read_file / list_dir / search
AgentLoop
the write β†’ pytest β†’ retry loop, one isolated workspace per agent
Telemetry
background RSS sampler (peak), per-task metrics, CSV + summary JSON with p50/p95/p99
Dataset + runner loads dataset/humaneval_100.json , fans the tasks across the pool, writes the report

No heap-heavy framework: just the standard library, libcurl, nlohmann/json and spdlog. Every component is covered by tests built with -UNDEBUG

so assertions stay live even in Release.

100 HumanEval tasks, qwen2.5-coder:7b

, 100-way concurrency, single GPU:

Metric Value
Peak RSS (100 concurrent agents)
~93 MiB
pass@1 (with up to 3 self-review retries) 96 % (96/100)
first-attempt pass 87/100
recovered via self-review 6
failed after 3 retries 4
avg retries 0.27
wall time (100 tasks) 126 s

How to read these honestly:

Peak RSS is the runtime number.~93 MiB for 100 concurrent agents is the headline for the C++ stack β€” and the metric that will actually differ between languages.pass@1 and retries are model properties, not runtime properties β€” they will be identical across stacks. They're here to prove the harness runs arealagentic loop (the self-review recovered 6 tasks), not to compare runtimes.Per-task latency is intentionally omitted from the headline. At 100-way concurrency against one GPU, per-task time is dominated by server-side queueing, not the runtime. Throughput (wall time

) is likewise model-bound here. For comparable latency, run with bounded concurrency (BENCH_CONCURRENCY

).

The agent loop verifies generated code with python3 -B -m pytest

, so the python3

first on your PATH

must be able to import pytest

.

python3 -m pip install -r runner/requirements.txt

runner/run_bench.sh --check

cmake -B build -G Ninja && cmake --build build

LLM_BASE_URL="https://your-host/v1" LLM_MODEL="qwen2.5-coder:7b" runner/run_bench.sh

run_bench.sh

picks an interpreter that already has pytest

and puts it first on PATH

β€” it does not create a virtualenv. Results land in results/raw/

as a timestamped cpp_<ts>.csv

(one row per task) and cpp_<ts>_summary.json

. Override LLM_BASE_URL

/ LLM_MODEL

/ BENCH_DATASET

/ BENCH_CONCURRENCY

via env.

dataset/humaneval_100.json

holds the 100 tasks the runner feeds to the AgentLoop

. It is an array of { "id", "spec", "test" }

objects matching the runner's struct Task

:

id

β€” the HumanEval task id (e.g.HumanEval/0

).spec

β€” the HumanEval prompt (function signature + docstring) given to the model.test

β€” a pytest module that imports the entry point fromsolution

and runs HumanEval'scheck()

viatest_humaneval

.

Source. OpenAI HumanEval (164 problems), pinned to commit 463c980b

of openai/human-eval and verified against a recorded SHA-256.

Selection criterion. The first 100 problems by ascending numeric task index (HumanEval/0

… HumanEval/99

). Fully deterministic β€” no sampling or randomness.

Regenerate (and validate that every canonical solution passes its test under pytest):

python3 scripts/build_humaneval_dataset.py            # build the JSON
python3 scripts/build_humaneval_dataset.py --validate # build + verify 100/100

C++ stack β€” done. Full runtime + telemetry + 100-task run.Python / TypeScript / Rust β€” pending. Each needs a minimal runner with the same contract (thread pool / async, OpenAI client, write+bash tools, the loop, telemetry). A cross-language comparison is only valid when every stack runs thesame tasks, model, hardware and concurrencyβ€” which is the whole point, and why external numbers can't simply be cited for the runtime metrics.

The deliverable is the controlled measurement (peak RSS + runtime overhead at fixed concurrency), not the model's pass-rate β€” that part is already well documented elsewhere.

── more in #ai-agents 4 stories Β· sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/the-future-will-be-m…] indexed:0 read:5min 2026-05-30 Β· β€”