{"slug": "show-hn-gandalf-the-grader", "title": "Show HN: Gandalf the Grader", "summary": "Handshake Research released Gandalf the Grader, an open-source reactive agent-as-judge that evaluates AI agents against binary rubric criteria by operating inside the same environment and using the same tools as the agent being graded. The system grades criteria based on artifacts and state—such as formulas in a workbook, files on disk, or whether an email was sent—rather than relying solely on final text responses. In evaluations, Gandalf outperformed text-only, snapshot-based, and workflow-based verifiers at a fraction of the cost, and is available on PyPI with integrations for the BankerToolBench benchmark and the rle-pkg runtime.", "body_md": "Read the [launch blog post](https://joinhandshake.com/research/ai/gandalf-the-grader/) for the motivation, benchmark results, and design rationale behind Gandalf.\n\nGandalf is a reactive agent-as-judge for rubric-graded agent environments. Given a rubric of binary criteria, it runs inside the rollout environment, uses the same tools as the rollout agent, and decides at inference time which files to open and which tool state to query.\n\nThat lets Gandalf grade criteria that depend on artifacts or state — formulas in a workbook, charts in a deck, files on disk, MCP tool state, or whether an email was actually sent — rather than just the final text response.\n\nGandalf is built around three design choices:\n\n-\n**Environment alignment:** Gandalf runs in the same filesystem, Python interpreter, installed packages, and tool environment as the rollout agent, using the[OpenHands](https://github.com/All-Hands-AI/OpenHands)SDK as the agent harness. -\n**Reactive verification:** Gandalf chooses what evidence to inspect while grading, instead of relying on a precomputed transcript or serialized snapshot. -\n**Swappable domain guidance:** Domain knowledge enters as natural-language guidance at runtime, making the same verifier portable across domains.\n\nIn our evaluation, this design beat text-only, snapshot-based, and workflow-based agentic verifiers at a fraction of the cost — see the [blog post](https://joinhandshake.com/research/ai/gandalf-the-grader/) for the full meta-eval.\n\n**Examples and integrations:** [BankerToolBench](https://github.com/Handshake-AI-Research/bankertoolbench) is a public agentic RL benchmark environment that uses Gandalf as the verifier. [rle-pkg](https://github.com/Handshake-AI-Research/rle-pkg) is a reference runtime that integrates Gandalf. Both run under the [Harbor](https://github.com/harbor-framework/harbor) framework, but Gandalf's design and implementation are framework-agnostic.\n\nGandalf is published [on PyPI](https://pypi.org/project/gandalf-the-grader/).\n\n```\nuv tool install gandalf-the-grader\n```\n\nFor production use, we recommend that you pin a specific version of Gandalf, and furthermore use the `[pinned]`\n\nversion to [pin all transitive dependencies](https://github.com/edgarrmondragon/hatch-pinned-extra).\n\n```\nuv tool install 'gandalf-the-grader[pinned]==1.0.0'\n```\n\nThe repo ships a runnable example under [ examples/quickstart/](/Handshake-AI-Research/gandalf-the-grader/blob/main/examples/quickstart) that grades a pre-staged workspace + ATIF trajectory against a 3-criterion rubric. Two criteria are designed to be met and one is designed to fail, so you can see Gandalf's partial-credit grading and per-criterion reasoning in one run. From a fresh clone:\n\n```\n# 1. Install\nuv tool install gandalf-the-grader\n\n# 2. Provide a Gemini API key (any litellm-compatible model works; see Configuration)\nexport LLM_API_KEY=\"<your-gemini-api-key>\"\n\n# 3. Run from the repo root\ngandalf-the-grader --config examples/quickstart/grader.toml\n\n# 4. Inspect the result\ncat examples/quickstart/output/reward.json   # -> {\"reward\": 0.75}\ncat examples/quickstart/output/info.json     # per-criterion verdicts + reasoning\n```\n\nExpected verdicts: the `welcome.txt`\n\nfile exists (met), the message mentions Gandalf (met), and the message is *not* longer than 50 words (unmet, by design). Raw score 3.0 of a possible 4.0, for a reward of 0.75.\n\nThe example uses [ gemini/gemini-2.5-flash](/Handshake-AI-Research/gandalf-the-grader/blob/main/examples/quickstart/grader.toml) and runs the inner judge as the current user (no\n\n`sandbox_user`\n\n, no sudo). To adapt it to your own setup, edit [. See the](/Handshake-AI-Research/gandalf-the-grader/blob/main/examples/quickstart/grader.toml)\n\n`examples/quickstart/grader.toml`\n\n[Configuration](#configuration)section below for the full field reference.\n\n| Field | Required | Default | Description |\n|---|---|---|---|\n`instructions` |\nYes* | Inline task instructions given to the original agent (mutually exclusive with `instructions_path` ) |\n|\n`instructions_path` |\nYes* | Path to a file with task instructions (mutually exclusive with `instructions` ) |\n|\n`rubric` |\nYes* | Inline rubric as a TOML array of tables (mutually exclusive with `rubric_path` ) |\n|\n`rubric_path` |\nYes* | Path to rubric JSON file (mutually exclusive with `rubric` ) |\n|\n`judge_guidance` |\nNo | Inline judge guidance text (mutually exclusive with `judge_guidance_path` ) |\n|\n`judge_guidance_path` |\nNo | Path to a file with extra judge instructions (mutually exclusive with `judge_guidance` ) |\n|\n`workdir` |\nYes | Agent workspace directory | |\n`trajectory_path` |\nYes | Path to ATIF trajectory JSON | |\n`output_dir` |\nYes | Directory for grader output files | |\n`model` |\nNo | `gemini/gemini-2.5-flash` |\nLLM model for the judge agent |\n`mode` |\nNo | `batch` |\nEvaluation mode: `batch` or `individual` |\n`judge_timeout` |\nNo | `300` |\nMax seconds per judge invocation |\n`batch_timeout` |\nNo | Max total seconds for batch mode (caps `judge_timeout * N` ) |\n|\n`judge_retries` |\nNo | `1` |\nNumber of retry attempts for criteria that error due to infrastructure failures |\n`batch_splits` |\nNo | Split criteria into N chunks in batch mode (>= 2). Each chunk is evaluated as a separate batch session. Only valid with `mode = \"batch\"` . |\n|\n`max_concurrency` |\nNo | Max parallel judge sessions (>= 1). Defaults to 1 for individual mode, `batch_splits` for batch mode. |\n|\n`sandbox_user` |\nNo | Username for running the inner judge (via sudo). When omitted the judge runs as the current user. | |\n`judge_prompt` |\nNo | Inline Jinja2 template that completely overrides the built-in judge task prompt (mutually exclusive with `judge_prompt_path` ) |\n|\n`judge_prompt_path` |\nNo | Path to a Jinja2 template file that completely overrides the built-in judge task prompt (mutually exclusive with `judge_prompt` ) |\n\nMCP servers can be configured as TOML array of tables:\n\n```\n[[mcp_servers]]\nname = \"magic-server\"\ntransport = \"stdio\"\ncommand = \"/usr/bin/mcp-server\"\nargs = [\"--verbose\"]\n```\n\nBy default, the grader uses a built-in prompt template to kick off each judge session. `judge_prompt`\n\n/ `judge_prompt_path`\n\nlet you replace it entirely with a custom [Jinja2](https://jinja.palletsprojects.com/) template.\n\nNote:This prompt is sent as the openinguser messageto the judge agent, not the LLM system prompt. The underlying agent framework (OpenHands) has its own immutable system message with coding and tool-use instructions that we never modify. Our prompt sits on top of that as the first user turn, setting up the grading task.\n\nFor most use cases, `judge_guidance`\n\n/ `judge_guidance_path`\n\nis all you need: it injects extra instructions into the built-in prompt without replacing it. Fully overriding the judge prompt is an uncommon escape hatch for situations where the built-in prompt structure itself is unsuitable.\n\nThe template receives these variables:\n\n| Variable | Type | Mode | Description |\n|---|---|---|---|\n`instructions` |\n`str` |\nboth | Task instructions given to the original agent |\n`final_output` |\n`str` |\nboth | Agent's final message from the trajectory |\n`criterion` |\n`str` |\nindividual | The single criterion string to evaluate |\n`criteria` |\n`list[str]` |\nbatch | List of all criterion strings to evaluate |\n`verdict_path` |\n`str` |\nboth | File path the judge must write its verdict to |\n`judge_guidance` |\n`str` |\nboth | Additional guidance text (may be empty) |\n\nIndividual and batch modes use separate built-in templates. In a custom template, use `{% if criterion is defined %}`\n\nvs `{% if criteria is defined %}`\n\nif you need to distinguish modes. In batch mode, use `loop.index0`\n\nfor the criterion index (e.g., `{% for c in criteria %}[{{ loop.index0 }}] {{ c }}{% endfor %}`\n\n).\n\nA JSON array of objects with `criterion`\n\n(string) and `weight`\n\n(float). Weights can be negative to penalise undesired outcomes:\n\n```\n[\n  {\"criterion\": \"The output file exists\", \"weight\": 2.0},\n  {\"criterion\": \"The output contains correct totals\", \"weight\": 3.0},\n  {\"criterion\": \"The agent used hardcoded values instead of computing\", \"weight\": -1.0}\n]\n```\n\n**Positive weight**: adds to the raw score when the criterion's condition is met** Negative weight**: deducts from the raw score when the criterion's condition is met (the bad thing happened)- The judge evaluates each criterion on its own merits; it never sees weights\n\nThe grader reads agent trajectories in [Agent Trajectory Interchange Format (ATIF)](https://www.harborframework.com/docs/agents/trajectory-format). An ATIF file is a JSON object with a `steps`\n\narray:\n\n```\n{\n  \"steps\": [\n    {\"source\": \"user\", \"message\": \"Build a hello world web app\"},\n    {\"source\": \"agent\", \"message\": \"I'll create the file now\", \"tool_calls\": [...]},\n    {\"source\": \"agent\", \"message\": \"Done! I created index.html with a Hello World page.\"}\n  ]\n}\n```\n\nThe grader extracts the final agent message (last `\"source\": \"agent\"`\n\nstep with a non-empty message and no `tool_calls`\n\n) and passes it to the judge as context.\n\n| Variable | Description |\n|---|---|\n`LLM_API_KEY` |\nAPI key for the LLM provider |\n`LLM_BASE_URL` |\nBase URL for the LLM API (optional) |\n`GRADER_INSTRUCTIONS_PATH` |\nFallback path to task instructions file (if not set in TOML) |\n`GRADER_JUDGE_GUIDANCE_PATH` |\nFallback path to judge guidance file (if not set in TOML) |\n`GRADER_JUDGE_PROMPT_PATH` |\nFallback path to custom judge prompt template (if not set in TOML) |\n`OTEL_EXPORTER_OTLP_ENDPOINT` |\nOTLP endpoint URL for trace export (optional) |\n`OTEL_EXPORTER_OTLP_HEADERS` |\nOTLP auth headers, URL-encoded (optional) |\n`OTEL_EXPORTER_OTLP_TRACES_PROTOCOL` |\nOTLP transport protocol, e.g. `http/protobuf` (optional) |\n\nGandalf builds on top of OpenHands, which has built-in OpenTelemetry tracing that automatically instruments LLM calls, tool executions, and agent steps. Set the `OTEL_EXPORTER_OTLP_*`\n\nvariables above to export traces to any OTEL-compatible backend with no code changes required.\n\n**Example: Langfuse**\n\n```\n# Encode your Langfuse keys\necho -n \"pk-lf-...:sk-lf-...\" | base64\n\n# Export the variables\nexport OTEL_EXPORTER_OTLP_ENDPOINT=https://cloud.langfuse.com/api/public/otel/v1/traces\nexport OTEL_EXPORTER_OTLP_HEADERS=\"Authorization=Basic%20<base64-encoded-keys>\"\nexport OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf\n```\n\nThe grader writes to `output_dir`\n\n:\n\n`reward.json`\n\n: Reward file (e.g.,`{\"reward\": 0.75}`\n\n) (always in [0, 1]).**Only written when all criteria are successfully evaluated.** If any criteria still have errors after retries, the grader writes`info.json`\n\nbut skips`reward.json`\n\nand exits with code 1.`info.json`\n\n: Always written. Per-criterion results with`met`\n\n/not-met, reasoning, evidence, LLM usage, plus`reward`\n\n,`raw_score`\n\n,`minimum_score`\n\n,`maximum_score`\n\n,`errored_criterion_count`\n\n, and`evaluated_criteria_pct`\n\n.`judge_trace_*.txt`\n\n: stdout/stderr capture for each judge invocation. Naming varies by mode:`judge_trace_{i}.txt`\n\n(individual),`judge_trace_batch.txt`\n\n(batch),`judge_trace_batch_split{i}.txt`\n\n(batch with splits). Retries append a`_retry{N}`\n\nsuffix.\n\nThe `reward`\n\nin `reward.json`\n\nis `clip(0, 1, raw_score / sum_of_positive_weights)`\n\n, always in [0, 1]. `info.json`\n\nadditionally includes `raw_score`\n\n(the raw sum of weights for met criteria, which can be negative) and `minimum_score`\n\n/`maximum_score`\n\nbounds for reference.\n\n**Try the benchmark environment.**[BankerToolBench on Hugging Face](https://huggingface.co/datasets/handshake-ai-research/bankertoolbench)is the public RL environment that Gandalf was originally evaluated against. Clone it, run rollouts, and grade them with Gandalf.**Adapt Gandalf to a new rollout environment.** Editto point at your workspace, trajectory, and rubric. See the`examples/quickstart/grader.toml`\n\n[Configuration](#configuration)and[Custom Judge Prompt](#custom-judge-prompt)sections for the full reference, including domain-specific judge guidance.\n\nCopyright (c) Handshake. Released under the Apache-2.0 license. See [LICENSE.txt](/Handshake-AI-Research/gandalf-the-grader/blob/main/LICENSE.txt) for details.", "url": "https://wpnews.pro/news/show-hn-gandalf-the-grader", "canonical_source": "https://github.com/Handshake-AI-Research/gandalf-the-grader", "published_at": "2026-05-27 18:52:57+00:00", "updated_at": "2026-05-27 19:15:33.706001+00:00", "lang": "en", "topics": ["ai-agents", "ai-tools", "ai-research", "large-language-models", "artificial-intelligence"], "entities": ["Gandalf", "OpenHands", "Handshake"], "alternates": {"html": "https://wpnews.pro/news/show-hn-gandalf-the-grader", "markdown": "https://wpnews.pro/news/show-hn-gandalf-the-grader.md", "text": "https://wpnews.pro/news/show-hn-gandalf-the-grader.txt", "jsonld": "https://wpnews.pro/news/show-hn-gandalf-the-grader.jsonld"}}