Payload Logo
Agent Frameworks,  Testing & Code Review

Hermes Agent: Self-Improving Guide

Date Published

# 🧬 Evolve Your Hermes Agent: A Practical Guide to Self-Improving AI ## TL;DR You can automatically improve your Hermes Agent's skills, prompts, and tools using evolutionary algorithms — no GPU, just API calls ($2–10/run). This guide shows you how to run the self-evolution pipeline on your existing Ubuntu VPS alongside Claude Code and Gemini CLI, using real execution traces to drive targeted improvements. --- ## Context You're running: - **Hermes Agent** instance on Ubuntu - **Claude Code CLI** (Anthropic) and **Gemini CLI** (Google) on same Ubuntu These CLIs generate valuable **execution traces** — real-world usage logs showing what worked, what failed, and *why* failures happened. The Hermes Agent Self-Evolution project uses these traces to drive evolutionary optimization of: - Skill files (`SKILL.md`) - Tool descriptions - System prompts - (Later) tool implementation code The key insight: **failure traces contain more signal than success traces**. The GEPA optimizer reads why things fail and proposes targeted mutations, then evaluates candidates against test suites + size limits + semantic preservation constraints. --- ## The Approach ### How the Evolution Pipeline Works ``` Execution traces (from Claude Code, Gemini, Hermes) │ ▼ GEPA Optimizer ──► Mutates prompts/skills based on failure reasons │ ▼ Candidate variants (e.g., 10 mutated skill descriptions) │ ▼ Constraint gates: • Full test suite (pytest) • Size limits (≤15KB for skills) • Semantic preservation │ ▼ Best variant ──► Creates PR for hermes-agent repo ``` ### On Your VPS: Step-by-Step #### 1. Install Hermes Agent Self-Evolution ```bash # Clone the repo git clone https://github.com/NousResearch/hermes-agent-self-evolution.git cd hermes-agent-self-evolution # Install with dev dependencies pip install -e ".[dev]" # Point at your existing hermes-agent repo export HERMES_AGENT_REPO=/path/to/your/hermes-agent ``` #### 2. Import real execution traces from Claude Code & Gemini ```bash # The repo includes importers for: # - Claude Code sessions # - Gemini CLI history # - Hermes own history python -m evolution.importers.import_claude_sessions \ --claude-log-dir ~/.claude/logs \ --output ./datasets/session_db # Similarly for Gemini python -m evolution.importers.import_gemini_sessions \ --gemini-history ~/.gemini/history.json \ --output ./datasets/session_db ``` #### 3. Run skill evolution using real session data ```bash # Evolve a specific skill using imported session history python -m evolution.skills.evolve_skill \ --skill github-code-review \ --iterations 10 \ --eval-source sessiondb \ --session-db ./datasets/session_db ``` The optimizer will: - Read execution traces where `github-code-review` skill failed/succeeded - Generate 10–20 mutated variants of the skill's `SKILL.md` - Evaluate each variant against your tests - Select the best performer that passes all guardrails - Open a PR against your hermes-agent repo #### 4. (Optional) Try synthetic evaluation first ```bash # Quick test without real traces python -m evolution.skills.evolve_skill \ --skill github-code-review \ --iterations 5 \ --eval-source synthetic ``` --- ## Why It Worked ### 1. **Failure-driven mutation is more efficient than random search** GEPA doesn't blindly mutate — it parses execution traces to understand *root causes*: - "Tool description too vague → model didn't call it" - "Skill exceeds context limit → got truncated" - "Example format inconsistent → parsing failed" Then it mutates specifically to address those causes. ### 2. **API-driven means no GPU bottleneck** Your Ubuntu VPS doesn't need expensive hardware. Each run: - Calls Claude/Gemini APIs for mutations (pennies) - Evaluates variants using existing test suite (CPU only) - Total cost: **$2–10 per optimization run** ### 3. **Multiple trace sources create richer evolution** Claude Code, Gemini CLI, and Hermes Agent all generate different failure modes: - Claude: verbose, good at multi-step planning - Gemini: fast, sometimes truncates - Hermes: function-calling specific Cross-pollinating traces = broader coverage. ### 4. **Guardrails prevent regressions** Every candidate must pass: - Full `pytest` suite (your existing tests) - Size limits (prevents bloat) - Semantic drift checks (no changing skill's core purpose) This means evolved versions are **strictly better** on metrics you care about, not just different. --- ## Apply It Yourself ### Prerequisites - Ubuntu VPS (20.04+) - Hermes Agent installed and tested - Claude Code CLI + Gemini CLI (optional but valuable) - API keys for at least one LLM (OpenAI, Anthropic, or Gemini) - Python 3.10+ ### Configuration Checklist ```bash # 1. Set API keys for mutation/evaluation export OPENAI_API_KEY="sk-..." # or export ANTHROPIC_API_KEY="..." export GOOGLE_API_KEY="..." # 2. Point at your hermes-agent repo export HERMES_AGENT_REPO="$HOME/hermes-agent" # 3. Run a single skill evolution as a test python -m evolution.skills.evolve_skill --skill your-skill-name --iterations 3 --eval-source synthetic # 4. Examine the candidate PR ls evolution/reports/latest/candidates/ ``` ### Typical Workflow for Your Team 1. **Weekly evolution run** against last 7 days of Claude Code + Gemini traces 2. **Human reviews** the generated PR (guardrails ensure it passes tests, but semantics need human check) 3. **Merge** the evolved skill back to your main branch 4. **Measure** improvement via before/after benchmarks (the tool generates comparison reports) ### Expected Outcomes | Metric | Typical Improvement | |--------|---------------------| | Skill success rate | +15–30% | | Token efficiency | -10–20% (shorter, more precise prompts) | | Edge case handling | +40–60% (from failure traces) | --- ## Source - **GitHub Repository**: [NousResearch/hermes-agent-self-evolution](https://github.com/NousResearch/hermes-agent-self-evolution) - **Research Paper**: ICLR 2026 Oral (see repo for citation) - **Full Architecture**: [PLAN.md](https://github.com/NousResearch/hermes-agent-self-evolution/blob/main/PLAN.md) in the repo