{"slug": "show-hn-hermesbench-workflow-reliability-evals-for-personal-ai-agents", "title": "Show HN: HermesBench – workflow reliability evals for personal AI agents", "summary": "HermesBench, a new benchmark for evaluating complete personal AI agent configurations rather than just models, launched with a public baseline score of 78.2 across 27 personal-agent recipes. The benchmark assesses entire Hermes setups including prompts, models, tools, memory, safety, and latency, with published redacted traces for inspection. The project aims to provide reliability-first evaluations for personal AI agents, penalizing lopsided scores that are capable but unsafe or safe but unhelpful.", "body_md": "Hermes Agent runtime evaluation\n\n# Benchmark the whole personal agent, not just the model.\n\nHermesBench evaluates complete Hermes configurations: prompt, model/provider, tools, AgentSkills, memory, gateway behavior, delegation, safety, latency, and stability. The current public baseline scores 78.2 across 27 personal-agent recipes with redacted traces you can inspect.\n\nWhy trust it\n\n## Evidence first, with visible limits.\n\nEvery published result links back to scenario definitions, public score axes, driver closure decisions, deterministic checks, and redacted trace timelines. The site is deliberately clear that this is one early baseline, not a base-model leaderboard.\n\n[Public recipes](recipes.html)\n\n**See the prompts** 27 user-like personal-agent jobs with criteria and side-effect boundaries.[Redacted traces](traces.html)\n\n**Inspect what happened** Tool timelines, assistant replies, checks, and judge summaries without raw private payloads.[Methodology](https://github.com/verkyyi/hermesbench/blob/main/docs/METHODOLOGY.md)\n\n**Understand the score** Capability, reliability, and UX axes with documented limitations.Site map\n\n## Three tabs for the current evidence shape.\n\nWith one baseline published, a leaderboard is premature. The site now starts from the content people need to navigate: recipes, profiles, and traces.\n\nAgent-driven quick start\n\n## Run it through a coding agent.\n\nThe public user pathway is intentionally simple: copy the prompt to Codex, Claude, or another coding agent. The agent loads the HermesBench skill and drives one scenario recipe first. Full bundle runs are opt-in because they take longer and cost more.\n\nPrompt to copy into Codex or Claude\n\n```\nUse the HermesBench skill and run one default scenario recipe for my current Hermes configuration.\n\nSkill: https://github.com/verkyyi/hermesbench/blob/main/agent-skills/hermesbench/SKILL.md\n\nFollow the skill's \"Run Current Hermes Configuration\" workflow. Use the Python API default single-recipe path, save artifacts, and summarize the score and main findings. Do not run the full bundle unless I explicitly ask.\n```\n\nAlpha feedback\n\n## The best next action is concrete feedback.\n\nHermesBench needs early feedback on setup friction, scoring surprises, recipe realism, profile evidence, and redaction trust. Star the repo if the benchmark shape is useful; open an issue if one recipe, trace, or score axis feels wrong.\n\nCoverage model\n\n## Workflow recipes, broad personal-agent coverage.\n\nHermesBench starts with one valuable workflow recipe, then lets you opt into broader suites when you need more confidence. The bundled catalog covers everyday personal-agent work: context, calendar, web, reports, communication, location, travel, finance, safety, and power-user integrations.\n\nScoring philosophy\n\n## Good agents finish the right thing safely.\n\nHermesBench is reliability-first, but not capability-blind. A good configuration should do useful work, tell the truth about what it knows, avoid unsafe side effects, stay stable, respond promptly, and communicate clearly. Lopsided scores are penalized because a personal agent that is capable but unsafe, safe but unhelpful, or correct but unusably slow is not actually good.\n\nDetailed formulas and implementation mechanics live in the methodology document; the website keeps the scoring model readable for users and LLM agents.\n\nUse and contribute\n\n## Turn good results into reusable recipes.\n\nHermesBench is useful as a quick benchmark, but it is also a way to publish what worked. Share a redacted profile/config package when a setup improves a recipe, or submit a generic recipe when an important personal-agent use case is missing.\n\nProfile submission prompt\n\n```\nUse the HermesBench skill to prepare my current Hermes profile/config as a public profile submission.\n\nSkill: https://github.com/verkyyi/hermesbench/blob/main/agent-skills/hermesbench/SKILL.md\n\nRun one representative recipe first, package the redacted profile snapshot and score evidence, and tell me what must be reviewed before opening a pull request.\n```\n\nRecipe submission prompt\n\n```\nUse the HermesBench skill to propose a new generic personal-agent recipe for HermesBench.\n\nSkill: https://github.com/verkyyi/hermesbench/blob/main/agent-skills/hermesbench/SKILL.md\n\nMake the use case privacy-safe, driver/target agnostic, fixture-backed where possible, and include deterministic checks before preparing a pull request.\n```\n\n", "url": "https://wpnews.pro/news/show-hn-hermesbench-workflow-reliability-evals-for-personal-ai-agents", "canonical_source": "https://verkyyi.github.io/hermesbench/", "published_at": "2026-05-30 23:03:40+00:00", "updated_at": "2026-05-30 23:15:07.572271+00:00", "lang": "en", "topics": ["ai-agents", "ai-tools", "ai-research", "ai-products", "ai-safety"], "entities": ["HermesBench", "Hermes Agent", "Codex", "Claude"], "alternates": {"html": "https://wpnews.pro/news/show-hn-hermesbench-workflow-reliability-evals-for-personal-ai-agents", "markdown": "https://wpnews.pro/news/show-hn-hermesbench-workflow-reliability-evals-for-personal-ai-agents.md", "text": "https://wpnews.pro/news/show-hn-hermesbench-workflow-reliability-evals-for-personal-ai-agents.txt", "jsonld": "https://wpnews.pro/news/show-hn-hermesbench-workflow-reliability-evals-for-personal-ai-agents.jsonld"}}