cd /news/ai-agents/show-hn-multi-agent-protocol-for-ai-โ€ฆ ยท home โ€บ topics โ€บ ai-agents โ€บ article
[ARTICLE ยท art-39259] src=github.com โ†— pub= topic=ai-agents verified=true sentiment=โ†‘ positive

Show HN: Multi Agent Protocol for AI Scientist by Hexo Labs

Hexo Labs released Socrates, an open-source multi-agent protocol that pairs a tool-using research agent with a question-only advisor, improving MLE-bench Kaggle competition scores by an average of 55.9% over the agent running alone.

read6 min views1 publishedJun 25, 2026
Show HN: Multi Agent Protocol for AI Scientist by Hexo Labs
Image: source

Pair a tool-using research agent with a

question-onlyadvisor that can never give answers, never issue directives, and has no tools of its own. The advisor must approve every plan via[APPROVED]

before the Scientist runs the next experiment. On five MLE-bench Kaggle competitions this lifts test scores by an average of+55.9%over the same agent running alone.

Left: Socrates asks questions only and is stateful across sessions; the Scientist is stateless, executes code, and reads/writes the shared environment. Right: Statoil example โ€” Socrates asks whether incremental tuning is closing the gap, the Scientist pivots to domain features (+10.2%); the Baseline PI offers generic encouragement and the Scientist stays on pixel statistics (+1.6%).

Note

The asciinema badge above is a placeholder. To record your own: bash scripts/record_demo.sh

, then asciinema upload

and paste the returned cast ID into this README in place of YOUR_CAST_ID

(two occurrences).

Quick startRepository layoutThe two scaffoldsThe three conditionsReproducing the paper resultsConfiguration referenceRunning testsCitationLicense

Tested on Python 3.10โ€“3.12, Linux/macOS. GPU is optional (only required for tasks that train deep models โ€” Statoil and NFL benefit, the others run fine on CPU).

git clone https://github.com/hexo-ai/socrates.git
cd socrates

conda create -n socrates python=3.11 -y
conda activate socrates
python -m venv .venv && source .venv/bin/activate

pip install -r requirements.txt
pip install --no-deps -r socratic-evolve/public-repo/requirements_base.txt
pip install --no-deps -r socratic-evolve/public-repo/requirements_ml.txt
pip install --no-deps -r socratic-evolve/public-repo/requirements_domain.txt

export ANTHROPIC_API_KEY="sk-ant-..."        # required
export OPENAI_API_KEY="sk-..."               # optional, only if you use OpenAI models

cp socratic-evolve/test_config.yaml.example socratic-evolve/test_config.yaml
cp discover/test_config.yaml.example          discover/test_config.yaml

python discover/test_agent_locally.py

If step 6 prints a Socrates question and an [APPROVED]

from a short discussion loop, the install is good.

socrates/
โ”œโ”€โ”€ discover/                 # Sequential scaffold (single agent, one experiment at a time)
โ”‚   โ”œโ”€โ”€ custom_agent.py       # Agent implementation
โ”‚   โ”œโ”€โ”€ base_agent.py         # Base class with webhook protocol
โ”‚   โ”œโ”€โ”€ models.py             # Message models
โ”‚   โ””โ”€โ”€ test_agent_locally.py # Local smoke test
โ”‚
โ”œโ”€โ”€ socratic-evolve/          # Evolutionary scaffold (MLevolve + MCGS tree search)
โ”‚   โ”œโ”€โ”€ custom_agent.py       # Agent wrapper
โ”‚   โ”œโ”€โ”€ base_agent.py         # Base class
โ”‚   โ”œโ”€โ”€ models.py             # Message models
โ”‚   โ””โ”€โ”€ public-repo/          # MLevolve core
โ”‚       โ”œโ”€โ”€ run.py            # Main entry point for full experiments
โ”‚       โ”œโ”€โ”€ config/           # Default configuration
โ”‚       โ”œโ”€โ”€ engine/           # MCGS tree search, code execution
โ”‚       โ”œโ”€โ”€ agents/           # Multi-agent subsystem
โ”‚       โ”‚   โ”œโ”€โ”€ socrates/     # Socrates PI implementation
โ”‚       โ”‚   โ”‚   โ”œโ”€โ”€ prompts.py        # Question-only system prompt + [APPROVED] gate
โ”‚       โ”‚   โ”‚   โ”œโ”€โ”€ approval_loop.py  # Multi-round discussion loop
โ”‚       โ”‚   โ”‚   โ””โ”€โ”€ config.py         # Toggle flags
โ”‚       โ”‚   โ”œโ”€โ”€ evolution_agent.py    # Paradigm-shift mutations
โ”‚       โ”‚   โ””โ”€โ”€ fusion_agent.py       # Cross-branch solution merging
โ”‚       โ””โ”€โ”€ llm/              # LLM client wrappers
โ”‚
โ”œโ”€โ”€ assets/
โ”‚   โ””โ”€โ”€ protocol.png          # Protocol diagram
โ”œโ”€โ”€ scripts/
โ”‚   โ””โ”€โ”€ record_demo.sh        # Records the asciinema demo cast
โ”œโ”€โ”€ conda.sh                  # Quick env activation helper
โ”œโ”€โ”€ requirements.txt          # Top-level dependency manifest
โ”œโ”€โ”€ LICENSE                   # MIT
โ””โ”€โ”€ README.md                 # This file

A single agent writes and executes experiments one at a time. No built-in exploration mechanism. The Scientist retains tool access during Socratic review, so when Socrates asks "how many features have zero importance?" the Scientist runs the analysis right then. Best when per-step quality matters more than raw experiment volume.

An evolutionary code-generation system (MLevolve) maintaining a tree of candidate solutions across parallel branches. Includes evolution stages (paradigm-shift mutations), fusion stages (cross-branch solution merging), and runs multiple branches in parallel. During review, the Scientist can only revise plan text (no tool access). Best when the search space rewards high iteration volume.

All controlled via configuration flags (use_socrates_review

and use_baseline_pi

in config.yaml

/ config.py

):

Condition Flags Behavior
Scientist-only use_socrates_review=False , use_baseline_pi=False
Single agent, no supervision.
Baseline PI use_socrates_review=False , use_baseline_pi=True
Second agent giving generic encouragement (control condition).
Socrates
use_socrates_review=True
Full protocol: question-only PI, [APPROVED] gate.

We evaluate on five tasks from MLE-bench:

Task Metric Notes
Statoil Iceberg Log Loss โ†“ Radar imagery
Stanford COVID Vaccine MCRMSE โ†“ RNA degradation
Ventilator Pressure MAE โ†“ Tabular time-series
NFL Contact Detection MCC โ†‘ Player tracking + video
Smartphone Decimeter Haversine โ†“ GPS positioning

Follow the MLE-bench instructions to download the five competition datasets. Place each one under a local directory and remember its path โ€” you'll plug it into the config in the next step.

cd discover/
python test_agent_locally.py

This writes per-experiment folders, a best_score.txt

, and a submission.csv

in dataset_dir

. Submit submission.csv

to Kaggle to get the test score.

cd socratic-evolve/public-repo/
python run.py \
  exp_id="statoil-iceberg-classifier-challenge" \
  agent.use_socrates_review=True \
  agent.steps=50

For each task, run it once per condition (toggling the flags above) so you can compare Scientist-only / Baseline PI / Socrates side by side.

cd socratic-evolve/public-repo/
python collect_and_plot.py   # aggregates per-experiment logs into the paper's tables/plots
python dashboard.py          # optional live dashboard

| Task | Scientist-only (test) | Baseline PI (test) | Socrates (test) | ฮ” vs Scientist | |---|---|---|---|---| | Statoil | 0.255 | 0.251 | 0.229 | +10.5% | | COVID | 0.389 | 0.308 | 0.294 | +24.4% | | Ventilator | 1.534 | 0.815 | 0.853 | +44.4% | | NFL | 0.198 | 0.537 | 0.584 | +195.4% | | Smartphone | 6.285 | 5.993 | 5.984 | +4.8% |

Note: LLM agents are high-variance run-to-run. We saw a standard deviation of ~15% of the mean across 10 Scientist-only seeds on Smartphone. Expect single-seed numbers to vary; the direction of the effect (Socrates โ‰ฅ Baseline PI > Scientist-only) is the reproducible claim.

The key flags live in socratic-evolve/public-repo/config/config.yaml

and discover/test_config.yaml

:

Flag Default Meaning
agent.use_socrates_review
false
Enable the full Socrates question-only protocol.
agent.use_baseline_pi
false
Enable the generic-encouragement control condition.
agent.steps
50 (evolve) / 30 (seq)
Total experiment budget.
agent.K
3
Max discussion rounds before forced approval.
agent.model
claude-opus-4-6
Scientist LLM.
agent.feedback_model
(same as model )
Socrates LLM (can differ from the Scientist).
agent.respect_finished
true
Whether the agent may stop early via [FINISHED] .
agent.enforce_gpu_usage
false
Inject the GPU-required block into the system prompt.

A more detailed flag-level reference for the prompts (which blocks get injected when) is in socratic-evolve/public-repo/agents/socrates/

.

python discover/test_agent_locally.py --dry-run

cd socratic-evolve/public-repo/
pytest tests/test_socrates_live.py -k "test_socrates_basic"
@inproceedings{vrabac2026socrates,
  title     = {Socrates: Structured Questioning Unlocks Latent Knowledge in AI Research Agents},
  author    = {Vrabac, Damir and Hebbar, Prannay and Manawat, Yogendra and Palanimalai, Selvam and Verboomen, Samuel and Juneja, Gurusha and Bhatia, Kunal and Baskaran, Vignesh},
  booktitle = {Conference on Language Modeling (COLM)},
  year      = {2026}
}

MIT. See LICENSE.

โ”€โ”€ more in #ai-agents 4 stories ยท sorted by recency
โ”€โ”€ more on @hexo labs 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain โ€” perfect for shipping the agent you just read about.

$git push zahid main
โ†’ Live at https://your-agent.zahid.host โœ“
Get free account โ†’ Pricing
from โ‚ฌ0/mo ยท no card required
LIVE [news/show-hn-multi-agent-โ€ฆ] indexed:0 read:6min 2026-06-25 ยท โ€”