{"slug": "show-hn-multi-agent-protocol-for-ai-scientist-by-hexo-labs", "title": "Show HN: Multi Agent Protocol for AI Scientist by Hexo Labs", "summary": "Hexo Labs released Socrates, an open-source multi-agent protocol that pairs a tool-using research agent with a question-only advisor, improving MLE-bench Kaggle competition scores by an average of 55.9% over the agent running alone.", "body_md": "Pair a tool-using research agent with a\n\nquestion-onlyadvisor that can never give answers, never issue directives, and has no tools of its own. The advisor must approve every plan via`[APPROVED]`\n\nbefore the Scientist runs the next experiment. On five MLE-bench Kaggle competitions this lifts test scores by an average of+55.9%over the same agent running alone.\n\n*Left: Socrates asks questions only and is stateful across sessions; the\nScientist is stateless, executes code, and reads/writes the shared\nenvironment. Right: Statoil example — Socrates asks whether incremental\ntuning is closing the gap, the Scientist pivots to domain features\n(+10.2%); the Baseline PI offers generic encouragement and the Scientist\nstays on pixel statistics (+1.6%).*\n\nNote\n\nThe asciinema badge above is a placeholder. To record your own:\n`bash scripts/record_demo.sh`\n\n, then `asciinema upload`\n\nand paste the\nreturned cast ID into this README in place of `YOUR_CAST_ID`\n\n(two\noccurrences).\n\n[Quick start](#quick-start)[Repository layout](#repository-layout)[The two scaffolds](#the-two-scaffolds)[The three conditions](#the-three-conditions)[Reproducing the paper results](#reproducing-the-paper-results)[Configuration reference](#configuration-reference)[Running tests](#running-tests)[Citation](#citation)[License](#license)\n\nTested on Python 3.10–3.12, Linux/macOS. GPU is optional (only required for tasks that train deep models — Statoil and NFL benefit, the others run fine on CPU).\n\n```\n# 1. Clone the repo\ngit clone https://github.com/hexo-ai/socrates.git\ncd socrates\n\n# 2. Create an isolated environment (conda or venv — pick one)\nconda create -n socrates python=3.11 -y\nconda activate socrates\n#   or\npython -m venv .venv && source .venv/bin/activate\n\n# 3. Install dependencies\npip install -r requirements.txt\npip install --no-deps -r socratic-evolve/public-repo/requirements_base.txt\npip install --no-deps -r socratic-evolve/public-repo/requirements_ml.txt\npip install --no-deps -r socratic-evolve/public-repo/requirements_domain.txt\n\n# 4. Set API keys\nexport ANTHROPIC_API_KEY=\"sk-ant-...\"        # required\nexport OPENAI_API_KEY=\"sk-...\"               # optional, only if you use OpenAI models\n\n# 5. Create a local test config (gitignored)\ncp socratic-evolve/test_config.yaml.example socratic-evolve/test_config.yaml\ncp discover/test_config.yaml.example          discover/test_config.yaml\n# Edit each to set dataset_dir and model.\n\n# 6. Smoke-test the sequential scaffold\npython discover/test_agent_locally.py\n```\n\nIf step 6 prints a Socrates question and an `[APPROVED]`\n\nfrom a\nshort discussion loop, the install is good.\n\n```\nsocrates/\n├── discover/                 # Sequential scaffold (single agent, one experiment at a time)\n│   ├── custom_agent.py       # Agent implementation\n│   ├── base_agent.py         # Base class with webhook protocol\n│   ├── models.py             # Message models\n│   └── test_agent_locally.py # Local smoke test\n│\n├── socratic-evolve/          # Evolutionary scaffold (MLevolve + MCGS tree search)\n│   ├── custom_agent.py       # Agent wrapper\n│   ├── base_agent.py         # Base class\n│   ├── models.py             # Message models\n│   └── public-repo/          # MLevolve core\n│       ├── run.py            # Main entry point for full experiments\n│       ├── config/           # Default configuration\n│       ├── engine/           # MCGS tree search, code execution\n│       ├── agents/           # Multi-agent subsystem\n│       │   ├── socrates/     # Socrates PI implementation\n│       │   │   ├── prompts.py        # Question-only system prompt + [APPROVED] gate\n│       │   │   ├── approval_loop.py  # Multi-round discussion loop\n│       │   │   └── config.py         # Toggle flags\n│       │   ├── evolution_agent.py    # Paradigm-shift mutations\n│       │   └── fusion_agent.py       # Cross-branch solution merging\n│       └── llm/              # LLM client wrappers\n│\n├── assets/\n│   └── protocol.png          # Protocol diagram\n├── scripts/\n│   └── record_demo.sh        # Records the asciinema demo cast\n├── conda.sh                  # Quick env activation helper\n├── requirements.txt          # Top-level dependency manifest\n├── LICENSE                   # MIT\n└── README.md                 # This file\n```\n\nA single agent writes and executes experiments one at a time. No\nbuilt-in exploration mechanism. The Scientist retains tool access\nduring Socratic review, so when Socrates asks *\"how many features\nhave zero importance?\"* the Scientist runs the analysis right then.\nBest when per-step quality matters more than raw experiment volume.\n\nAn evolutionary code-generation system (MLevolve) maintaining a tree of candidate solutions across parallel branches. Includes evolution stages (paradigm-shift mutations), fusion stages (cross-branch solution merging), and runs multiple branches in parallel. During review, the Scientist can only revise plan text (no tool access). Best when the search space rewards high iteration volume.\n\nAll controlled via configuration flags\n(`use_socrates_review`\n\nand `use_baseline_pi`\n\nin\n`config.yaml`\n\n/ `config.py`\n\n):\n\n| Condition | Flags | Behavior |\n|---|---|---|\n| Scientist-only | `use_socrates_review=False` , `use_baseline_pi=False` |\nSingle agent, no supervision. |\n| Baseline PI | `use_socrates_review=False` , `use_baseline_pi=True` |\nSecond agent giving generic encouragement (control condition). |\nSocrates |\n`use_socrates_review=True` |\nFull protocol: question-only PI, `[APPROVED]` gate. |\n\nWe evaluate on five tasks from [MLE-bench](https://github.com/openai/mle-bench):\n\n| Task | Metric | Notes |\n|---|---|---|\n| Statoil Iceberg | Log Loss ↓ | Radar imagery |\n| Stanford COVID Vaccine | MCRMSE ↓ | RNA degradation |\n| Ventilator Pressure | MAE ↓ | Tabular time-series |\n| NFL Contact Detection | MCC ↑ | Player tracking + video |\n| Smartphone Decimeter | Haversine ↓ | GPS positioning |\n\nFollow the MLE-bench instructions to download the five competition datasets. Place each one under a local directory and remember its path — you'll plug it into the config in the next step.\n\n``` php\ncd discover/\n# Edit test_config.yaml to set:\n#   AGENT_CONFIG.exp_id        -> the MLE-bench task id (e.g. \"statoil-iceberg-classifier-challenge\")\n#   AGENT_CONFIG.dataset_dir   -> the local path you put the data in\n#   AGENT_CONFIG.model         -> the LLM (default: claude-opus-4-6)\npython test_agent_locally.py\n```\n\nThis writes per-experiment folders, a `best_score.txt`\n\n, and a\n`submission.csv`\n\nin `dataset_dir`\n\n. Submit `submission.csv`\n\nto Kaggle\nto get the test score.\n\n```\ncd socratic-evolve/public-repo/\npython run.py \\\n  exp_id=\"statoil-iceberg-classifier-challenge\" \\\n  agent.use_socrates_review=True \\\n  agent.steps=50\n```\n\nFor each task, run it once per condition (toggling the flags above) so you can compare Scientist-only / Baseline PI / Socrates side by side.\n\n```\ncd socratic-evolve/public-repo/\npython collect_and_plot.py   # aggregates per-experiment logs into the paper's tables/plots\npython dashboard.py          # optional live dashboard\n```\n\n| Task | Scientist-only (test) | Baseline PI (test) | Socrates (test) |\nΔ vs Scientist |\n|---|---|---|---|---|\n| Statoil | 0.255 | 0.251 | 0.229 |\n+10.5% |\n| COVID | 0.389 | 0.308 | 0.294 |\n+24.4% |\n| Ventilator | 1.534 | 0.815 |\n0.853 | +44.4% |\n| NFL | 0.198 | 0.537 | 0.584 |\n+195.4% |\n| Smartphone | 6.285 | 5.993 | 5.984 |\n+4.8% |\n\nNote: LLM agents are high-variance run-to-run. We saw a standard\ndeviation of ~15% of the mean across 10 Scientist-only seeds on\nSmartphone. Expect single-seed numbers to vary; the *direction* of\nthe effect (Socrates ≥ Baseline PI > Scientist-only) is the\nreproducible claim.\n\nThe key flags live in\n`socratic-evolve/public-repo/config/config.yaml`\n\nand\n`discover/test_config.yaml`\n\n:\n\n| Flag | Default | Meaning |\n|---|---|---|\n`agent.use_socrates_review` |\n`false` |\nEnable the full Socrates question-only protocol. |\n`agent.use_baseline_pi` |\n`false` |\nEnable the generic-encouragement control condition. |\n`agent.steps` |\n`50` (evolve) / `30` (seq) |\nTotal experiment budget. |\n`agent.K` |\n`3` |\nMax discussion rounds before forced approval. |\n`agent.model` |\n`claude-opus-4-6` |\nScientist LLM. |\n`agent.feedback_model` |\n(same as `model` ) |\nSocrates LLM (can differ from the Scientist). |\n`agent.respect_finished` |\n`true` |\nWhether the agent may stop early via `[FINISHED]` . |\n`agent.enforce_gpu_usage` |\n`false` |\nInject the GPU-required block into the system prompt. |\n\nA more detailed flag-level reference for the prompts (which blocks\nget injected when) is in `socratic-evolve/public-repo/agents/socrates/`\n\n.\n\n```\n# Sequential scaffold smoke test (no real run; mocks the LLM):\npython discover/test_agent_locally.py --dry-run\n\n# Evolutionary scaffold live test (requires API key):\ncd socratic-evolve/public-repo/\npytest tests/test_socrates_live.py -k \"test_socrates_basic\"\n@inproceedings{vrabac2026socrates,\n  title     = {Socrates: Structured Questioning Unlocks Latent Knowledge in AI Research Agents},\n  author    = {Vrabac, Damir and Hebbar, Prannay and Manawat, Yogendra and Palanimalai, Selvam and Verboomen, Samuel and Juneja, Gurusha and Bhatia, Kunal and Baskaran, Vignesh},\n  booktitle = {Conference on Language Modeling (COLM)},\n  year      = {2026}\n}\n```\n\nMIT. See [LICENSE](/hexo-ai/socrates/blob/master/LICENSE).", "url": "https://wpnews.pro/news/show-hn-multi-agent-protocol-for-ai-scientist-by-hexo-labs", "canonical_source": "https://github.com/hexo-ai/socrates", "published_at": "2026-06-25 11:35:51+00:00", "updated_at": "2026-06-25 12:14:29.234342+00:00", "lang": "en", "topics": ["ai-agents", "machine-learning", "ai-research"], "entities": ["Hexo Labs", "Socrates", "MLE-bench", "Kaggle", "Statoil"], "alternates": {"html": "https://wpnews.pro/news/show-hn-multi-agent-protocol-for-ai-scientist-by-hexo-labs", "markdown": "https://wpnews.pro/news/show-hn-multi-agent-protocol-for-ai-scientist-by-hexo-labs.md", "text": "https://wpnews.pro/news/show-hn-multi-agent-protocol-for-ai-scientist-by-hexo-labs.txt", "jsonld": "https://wpnews.pro/news/show-hn-multi-agent-protocol-for-ai-scientist-by-hexo-labs.jsonld"}}