A controlled, apples-to-apples benchmark of agent runtimes β the orchestration layer that drives an LLM through a write β execute β self-correct loop β across C++, Python, TypeScript, and Rust.
When people compare "coding agents" they almost always compare the model (pass@1 on HumanEval, SWE-bench, etc.). But in production the model runs behind a runtime: the code that fans out hundreds of agents, streams tokens, spawns test processes, retries on failure, and tracks state. That runtime β not the model β decides:
Memory footprint when you run 100+ agents at once,Concurrency ceiling and tail behavior under load,Overhead added on top of model latency.
These costs dominate the bill once agents move to scale, yet there is no controlled cross-language comparison of agent runtimes. Published numbers aren't comparable: different hardware, different model, different framework. This project fixes the variables β same tasks, same model, same hardware, same loop logic β and changes only the language runtime, so the runtime's cost is isolated and measurable.
HumanEval (first 100 problems). For each task the runtime runs a real agentic loop, not one-shot codegen:
build prompt (spec + pytest)
β LLM completion (streamed)
β extract the Python code block
β write solution.py into an isolated workspace
β run `python3 -B -m pytest`
β pass? β Done
fail? β feed the pytest error back into the prompt, retry (max 3)
β still failing β Failed
The agent must write code, run the tests, read the failure, and fix itself β which is what exercises the runtime (concurrency, process spawning, I/O, memory), not just the model.
| Component | What it does |
|---|---|
ThreadPool |
|
100 std::jthread workers, per-worker work-stealing deques |
|
LLMClient / AsyncLLMClient |
|
| libcurl + SSE streaming to any OpenAI-compatible endpoint (sync, and curl_multi async) | |
ToolDispatcher |
|
atomic write_file ; bash via fork/exec with separate stdout/stderr, timeout (process-group SIGKILL) and per-call workspace; plus read_file / list_dir / search |
|
AgentLoop |
|
| the write β pytest β retry loop, one isolated workspace per agent | |
Telemetry |
|
| background RSS sampler (peak), per-task metrics, CSV + summary JSON with p50/p95/p99 | |
| Dataset + runner | loads dataset/humaneval_100.json , fans the tasks across the pool, writes the report |
No heap-heavy framework: just the standard library, libcurl, nlohmann/json and
spdlog. Every component is covered by tests built with -UNDEBUG
so assertions stay live even in Release.
100 HumanEval tasks, qwen2.5-coder:7b
, 100-way concurrency, single GPU:
| Metric | Value |
|---|---|
| Peak RSS (100 concurrent agents) | |
| ~93 MiB | |
| pass@1 (with up to 3 self-review retries) | 96 % (96/100) |
| first-attempt pass | 87/100 |
| recovered via self-review | 6 |
| failed after 3 retries | 4 |
| avg retries | 0.27 |
| wall time (100 tasks) | 126 s |
How to read these honestly:
Peak RSS is the runtime number.~93 MiB for 100 concurrent agents is the headline for the C++ stack β and the metric that will actually differ between languages.pass@1 and retries are model properties, not runtime properties β they will be identical across stacks. They're here to prove the harness runs arealagentic loop (the self-review recovered 6 tasks), not to compare runtimes.Per-task latency is intentionally omitted from the headline. At 100-way concurrency against one GPU, per-task time is dominated by server-side queueing, not the runtime. Throughput (wall time
) is likewise model-bound here. For comparable latency, run with bounded concurrency (BENCH_CONCURRENCY
).
The agent loop verifies generated code with python3 -B -m pytest
, so the
python3
first on your PATH
must be able to import pytest
.
python3 -m pip install -r runner/requirements.txt
runner/run_bench.sh --check
cmake -B build -G Ninja && cmake --build build
LLM_BASE_URL="https://your-host/v1" LLM_MODEL="qwen2.5-coder:7b" runner/run_bench.sh
run_bench.sh
picks an interpreter that already has pytest
and puts it first on
PATH
β it does not create a virtualenv. Results land in results/raw/
as a
timestamped cpp_<ts>.csv
(one row per task) and cpp_<ts>_summary.json
.
Override LLM_BASE_URL
/ LLM_MODEL
/ BENCH_DATASET
/ BENCH_CONCURRENCY
via env.
dataset/humaneval_100.json
holds the 100 tasks the runner feeds to the
AgentLoop
. It is an array of { "id", "spec", "test" }
objects matching the
runner's struct Task
:
id
β the HumanEval task id (e.g.HumanEval/0
).spec
β the HumanEval prompt (function signature + docstring) given to the model.test
β a pytest module that imports the entry point fromsolution
and runs HumanEval'scheck()
viatest_humaneval
.
Source. OpenAI HumanEval (164 problems), pinned to commit 463c980b
of openai/human-eval and verified against a recorded SHA-256.
Selection criterion. The first 100 problems by ascending numeric task index
(HumanEval/0
β¦ HumanEval/99
). Fully deterministic β no sampling or randomness.
Regenerate (and validate that every canonical solution passes its test under pytest):
python3 scripts/build_humaneval_dataset.py # build the JSON
python3 scripts/build_humaneval_dataset.py --validate # build + verify 100/100
C++ stack β done. Full runtime + telemetry + 100-task run.Python / TypeScript / Rust β pending. Each needs a minimal runner with the same contract (thread pool / async, OpenAI client, write+bash tools, the loop, telemetry). A cross-language comparison is only valid when every stack runs thesame tasks, model, hardware and concurrencyβ which is the whole point, and why external numbers can't simply be cited for the runtime metrics.
The deliverable is the controlled measurement (peak RSS + runtime overhead at fixed concurrency), not the model's pass-rate β that part is already well documented elsewhere.