# HermesBench HermesBench is a reliability-first benchmark for Hermes Agent runtime configurations. It evaluates the whole configured agent harness, not just the base model. Primary pages: - Home and single-recipe quick start: / - Agent-driven quick start: /#quickstart - Alpha feedback: https://github.com/verkyyi/hermesbench/blob/main/FEEDBACK.md - Scenario recipe catalog: /recipes.html - Profile readouts: /profiles.html - Trace evidence browser: /traces.html - Submission contract: /../data/submissions/README.md Machine-readable artifacts: - Scenario recipe JSON: /../data/tasks/tasks.json - Trace evidence JSON: /../data/traces/index.json Agent-facing Python API: Skill source: - https://github.com/verkyyi/hermesbench/blob/main/agent-skills/hermesbench/SKILL.md Package source: - https://github.com/verkyyi/hermesbench ```python from hermesbench.api import agent_skill_text, list_scenarios, validate, run_scenario, build_public_artifacts print(agent_skill_text()) print(validate()) print(list_scenarios()[:5]) report = run_scenario("calendar_daily_brief", trials=1, run_llm_evals=True, persist=True) build_public_artifacts() ``` Current bundled public benchmark shape: - 27 workflow recipes - 9 job-area categories - 5 audience packages - Agent-driven evaluator by default - Default user-facing run is one scenario recipe: calendar_daily_brief - Full bundled benchmark runs are opt-in Publishing policy: - Public recipe pages show scenario prompts, expected outcomes, budgets, checks, capability intent, side-effect scope, and per-scenario evidence linked to the best public trace. - Public trace pages show per-case score, axes, mechanical closure, evaluator decision, judge summary, side effects, and redacted public transcripts when available. - Unredacted raw transcripts are private debugging artifacts and must be redacted before publication. - Public trace submissions are prepared by coding agents using the HermesBench skill workflow "Publish A Benchmark Result" and submitted by GitHub pull request under data/submissions/.