{"slug": "the-future-will-be-millions-agents-running-task-everyday", "title": "The future will be millions agents running task everyday?", "summary": "A new benchmark comparing agent runtime performance across C++, Python, TypeScript, and Rust found that C++ achieved a peak memory footprint of approximately 93 MiB while running 100 concurrent coding agents on HumanEval tasks. The controlled test isolated runtime overhead from model performance by using identical tasks, models, hardware, and loop logic across all four language stacks. The results demonstrate that runtime selection significantly impacts memory consumption and concurrency costs at scale, with C++ showing the lowest resource usage for production agent deployments.", "body_md": "A controlled, apples-to-apples benchmark of **agent runtimes** — the orchestration\nlayer that drives an LLM through a write → execute → self-correct loop — across\nC++, Python, TypeScript, and Rust.\n\nWhen people compare \"coding agents\" they almost always compare the *model*\n(pass@1 on HumanEval, SWE-bench, etc.). But in production the model runs behind a\n**runtime**: the code that fans out hundreds of agents, streams tokens, spawns\ntest processes, retries on failure, and tracks state. That runtime — not the\nmodel — decides:\n\n**Memory footprint** when you run 100+ agents at once,**Concurrency ceiling** and tail behavior under load,**Overhead** added on top of model latency.\n\nThese costs dominate the bill once agents move to scale, yet there is **no\ncontrolled cross-language comparison** of agent runtimes. Published numbers aren't\ncomparable: different hardware, different model, different framework. This project\nfixes the variables — *same tasks, same model, same hardware, same loop logic* —\nand changes only the language runtime, so the runtime's cost is isolated and\nmeasurable.\n\n[HumanEval](https://github.com/openai/human-eval) (first 100 problems). For each\ntask the runtime runs a real agentic loop, not one-shot codegen:\n\n```\nbuild prompt (spec + pytest)\n  → LLM completion (streamed)\n  → extract the Python code block\n  → write solution.py into an isolated workspace\n  → run `python3 -B -m pytest`\n  → pass?  → Done\n     fail? → feed the pytest error back into the prompt, retry (max 3)\n  → still failing → Failed\n```\n\nThe agent must *write code, run the tests, read the failure, and fix itself* —\nwhich is what exercises the runtime (concurrency, process spawning, I/O, memory),\nnot just the model.\n\n| Component | What it does |\n|---|---|\n`ThreadPool` |\n100 `std::jthread` workers, per-worker work-stealing deques |\n`LLMClient` / `AsyncLLMClient` |\nlibcurl + SSE streaming to any OpenAI-compatible endpoint (sync, and curl_multi async) |\n`ToolDispatcher` |\natomic `write_file` ; `bash` via fork/exec with separate stdout/stderr, timeout (process-group SIGKILL) and per-call workspace; plus `read_file` / `list_dir` / `search` |\n`AgentLoop` |\nthe write → pytest → retry loop, one isolated workspace per agent |\n`Telemetry` |\nbackground RSS sampler (peak), per-task metrics, CSV + summary JSON with p50/p95/p99 |\n| Dataset loader + runner | loads `dataset/humaneval_100.json` , fans the tasks across the pool, writes the report |\n\nNo heap-heavy framework: just the standard library, libcurl, nlohmann/json and\nspdlog. Every component is covered by tests built with `-UNDEBUG`\n\nso assertions\nstay live even in Release.\n\n100 HumanEval tasks, `qwen2.5-coder:7b`\n\n, 100-way concurrency, single GPU:\n\n| Metric | Value |\n|---|---|\nPeak RSS (100 concurrent agents) |\n~93 MiB |\n| pass@1 (with up to 3 self-review retries) | 96 % (96/100) |\n| first-attempt pass | 87/100 |\n| recovered via self-review | 6 |\n| failed after 3 retries | 4 |\n| avg retries | 0.27 |\n| wall time (100 tasks) | 126 s |\n\n**How to read these honestly:**\n\n**Peak RSS is the runtime number.**~93 MiB for 100 concurrent agents is the headline for the C++ stack — and the metric that will actually differ between languages.**pass@1 and retries are model properties**, not runtime properties — they will be identical across stacks. They're here to prove the harness runs a*real*agentic loop (the self-review recovered 6 tasks), not to compare runtimes.**Per-task latency is intentionally omitted from the headline.** At 100-way concurrency against one GPU, per-task time is dominated by server-side queueing, not the runtime. Throughput (`wall time`\n\n) is likewise model-bound here. For comparable latency, run with bounded concurrency (`BENCH_CONCURRENCY`\n\n).\n\nThe agent loop verifies generated code with `python3 -B -m pytest`\n\n, so the\n`python3`\n\nfirst on your `PATH`\n\nmust be able to import `pytest`\n\n.\n\n```\n# 1. Install pytest into whichever python3 you use (no virtualenv required)\npython3 -m pip install -r runner/requirements.txt\n\n# 2. Verify the environment is ready\nrunner/run_bench.sh --check\n\n# 3. Build\ncmake -B build -G Ninja && cmake --build build\n\n# 4. Run against an OpenAI-compatible endpoint\nLLM_BASE_URL=\"https://your-host/v1\" LLM_MODEL=\"qwen2.5-coder:7b\" runner/run_bench.sh\n```\n\n`run_bench.sh`\n\npicks an interpreter that already has `pytest`\n\nand puts it first on\n`PATH`\n\n— it does not create a virtualenv. Results land in `results/raw/`\n\nas a\ntimestamped `cpp_<ts>.csv`\n\n(one row per task) and `cpp_<ts>_summary.json`\n\n.\nOverride `LLM_BASE_URL`\n\n/ `LLM_MODEL`\n\n/ `BENCH_DATASET`\n\n/ `BENCH_CONCURRENCY`\n\nvia\nenv.\n\n`dataset/humaneval_100.json`\n\nholds the 100 tasks the runner feeds to the\n`AgentLoop`\n\n. It is an array of `{ \"id\", \"spec\", \"test\" }`\n\nobjects matching the\nrunner's `struct Task`\n\n:\n\n`id`\n\n— the HumanEval task id (e.g.`HumanEval/0`\n\n).`spec`\n\n— the HumanEval prompt (function signature + docstring) given to the model.`test`\n\n— a pytest module that imports the entry point from`solution`\n\nand runs HumanEval's`check()`\n\nvia`test_humaneval`\n\n.\n\n**Source.** OpenAI HumanEval (164 problems), pinned to commit `463c980b`\n\nof\n[ openai/human-eval](https://github.com/openai/human-eval) and verified against a\nrecorded SHA-256.\n\n**Selection criterion.** The first 100 problems by ascending numeric task index\n(`HumanEval/0`\n\n… `HumanEval/99`\n\n). Fully deterministic — no sampling or randomness.\n\nRegenerate (and validate that every canonical solution passes its test under pytest):\n\n```\npython3 scripts/build_humaneval_dataset.py            # build the JSON\npython3 scripts/build_humaneval_dataset.py --validate # build + verify 100/100\n```\n\n**C++ stack — done.** Full runtime + telemetry + 100-task run.**Python / TypeScript / Rust — pending.** Each needs a minimal runner with the same contract (thread pool / async, OpenAI client, write+bash tools, the loop, telemetry). A cross-language comparison is only valid when every stack runs the**same tasks, model, hardware and concurrency**— which is the whole point, and why external numbers can't simply be cited for the runtime metrics.\n\nThe deliverable is the **controlled measurement** (peak RSS + runtime overhead at\nfixed concurrency), not the model's pass-rate — that part is already well\ndocumented elsewhere.", "url": "https://wpnews.pro/news/the-future-will-be-millions-agents-running-task-everyday", "canonical_source": "https://github.com/wilmanrojas/sinqua", "published_at": "2026-05-30 17:11:56+00:00", "updated_at": "2026-05-30 17:46:38.319171+00:00", "lang": "en", "topics": ["ai-agents", "ai-infrastructure", "ai-tools", "ai-research", "large-language-models"], "entities": ["HumanEval", "OpenAI"], "alternates": {"html": "https://wpnews.pro/news/the-future-will-be-millions-agents-running-task-everyday", "markdown": "https://wpnews.pro/news/the-future-will-be-millions-agents-running-task-everyday.md", "text": "https://wpnews.pro/news/the-future-will-be-millions-agents-running-task-everyday.txt", "jsonld": "https://wpnews.pro/news/the-future-will-be-millions-agents-running-task-everyday.jsonld"}}