Show HN: Gandalf the Grader Handshake Research released Gandalf the Grader, an open-source reactive agent-as-judge that evaluates AI agents against binary rubric criteria by operating inside the same environment and using the same tools as the agent being graded. The system grades criteria based on artifacts and state—such as formulas in a workbook, files on disk, or whether an email was sent—rather than relying solely on final text responses. In evaluations, Gandalf outperformed text-only, snapshot-based, and workflow-based verifiers at a fraction of the cost, and is available on PyPI with integrations for the BankerToolBench benchmark and the rle-pkg runtime. Read the launch blog post https://joinhandshake.com/research/ai/gandalf-the-grader/ for the motivation, benchmark results, and design rationale behind Gandalf. Gandalf is a reactive agent-as-judge for rubric-graded agent environments. Given a rubric of binary criteria, it runs inside the rollout environment, uses the same tools as the rollout agent, and decides at inference time which files to open and which tool state to query. That lets Gandalf grade criteria that depend on artifacts or state — formulas in a workbook, charts in a deck, files on disk, MCP tool state, or whether an email was actually sent — rather than just the final text response. Gandalf is built around three design choices: - Environment alignment: Gandalf runs in the same filesystem, Python interpreter, installed packages, and tool environment as the rollout agent, using the OpenHands https://github.com/All-Hands-AI/OpenHands SDK as the agent harness. - Reactive verification: Gandalf chooses what evidence to inspect while grading, instead of relying on a precomputed transcript or serialized snapshot. - Swappable domain guidance: Domain knowledge enters as natural-language guidance at runtime, making the same verifier portable across domains. In our evaluation, this design beat text-only, snapshot-based, and workflow-based agentic verifiers at a fraction of the cost — see the blog post https://joinhandshake.com/research/ai/gandalf-the-grader/ for the full meta-eval. Examples and integrations: BankerToolBench https://github.com/Handshake-AI-Research/bankertoolbench is a public agentic RL benchmark environment that uses Gandalf as the verifier. rle-pkg https://github.com/Handshake-AI-Research/rle-pkg is a reference runtime that integrates Gandalf. Both run under the Harbor https://github.com/harbor-framework/harbor framework, but Gandalf's design and implementation are framework-agnostic. Gandalf is published on PyPI https://pypi.org/project/gandalf-the-grader/ . uv tool install gandalf-the-grader For production use, we recommend that you pin a specific version of Gandalf, and furthermore use the pinned version to pin all transitive dependencies https://github.com/edgarrmondragon/hatch-pinned-extra . uv tool install 'gandalf-the-grader pinned ==1.0.0' The repo ships a runnable example under examples/quickstart/ /Handshake-AI-Research/gandalf-the-grader/blob/main/examples/quickstart that grades a pre-staged workspace + ATIF trajectory against a 3-criterion rubric. Two criteria are designed to be met and one is designed to fail, so you can see Gandalf's partial-credit grading and per-criterion reasoning in one run. From a fresh clone: 1. Install uv tool install gandalf-the-grader 2. Provide a Gemini API key any litellm-compatible model works; see Configuration export LLM API KEY="