Agent Frameworks, Testing & Code Review
Hermes Agent: Self-Improving Guide
Date Published

# 🧬 Evolve Your Hermes Agent: A Practical Guide to Self-Improving AI
## TL;DR
You can automatically improve your Hermes Agent's skills, prompts, and tools using evolutionary algorithms — no GPU, just API calls ($2–10/run). This guide shows you how to run the self-evolution pipeline on your existing Ubuntu VPS alongside Claude Code and Gemini CLI, using real execution traces to drive targeted improvements.
---
## Context
You're running:
- **Hermes Agent** instance on Ubuntu
- **Claude Code CLI** (Anthropic) and **Gemini CLI** (Google) on same Ubuntu
These CLIs generate valuable **execution traces** — real-world usage logs showing what worked, what failed, and *why* failures happened. The Hermes Agent Self-Evolution project uses these traces to drive evolutionary optimization of:
- Skill files (`SKILL.md`)
- Tool descriptions
- System prompts
- (Later) tool implementation code
The key insight: **failure traces contain more signal than success traces**. The GEPA optimizer reads why things fail and proposes targeted mutations, then evaluates candidates against test suites + size limits + semantic preservation constraints.
---
## The Approach
### How the Evolution Pipeline Works
```
Execution traces (from Claude Code, Gemini, Hermes)
│
▼
GEPA Optimizer ──► Mutates prompts/skills based on failure reasons
│
▼
Candidate variants (e.g., 10 mutated skill descriptions)
│
▼
Constraint gates:
• Full test suite (pytest)
• Size limits (≤15KB for skills)
• Semantic preservation
│
▼
Best variant ──► Creates PR for hermes-agent repo
```
### On Your VPS: Step-by-Step
#### 1. Install Hermes Agent Self-Evolution
```bash
# Clone the repo
git clone https://github.com/NousResearch/hermes-agent-self-evolution.git
cd hermes-agent-self-evolution
# Install with dev dependencies
pip install -e ".[dev]"
# Point at your existing hermes-agent repo
export HERMES_AGENT_REPO=/path/to/your/hermes-agent
```
#### 2. Import real execution traces from Claude Code & Gemini
```bash
# The repo includes importers for:
# - Claude Code sessions
# - Gemini CLI history
# - Hermes own history
python -m evolution.importers.import_claude_sessions \
--claude-log-dir ~/.claude/logs \
--output ./datasets/session_db
# Similarly for Gemini
python -m evolution.importers.import_gemini_sessions \
--gemini-history ~/.gemini/history.json \
--output ./datasets/session_db
```
#### 3. Run skill evolution using real session data
```bash
# Evolve a specific skill using imported session history
python -m evolution.skills.evolve_skill \
--skill github-code-review \
--iterations 10 \
--eval-source sessiondb \
--session-db ./datasets/session_db
```
The optimizer will:
- Read execution traces where `github-code-review` skill failed/succeeded
- Generate 10–20 mutated variants of the skill's `SKILL.md`
- Evaluate each variant against your tests
- Select the best performer that passes all guardrails
- Open a PR against your hermes-agent repo
#### 4. (Optional) Try synthetic evaluation first
```bash
# Quick test without real traces
python -m evolution.skills.evolve_skill \
--skill github-code-review \
--iterations 5 \
--eval-source synthetic
```
---
## Why It Worked
### 1. **Failure-driven mutation is more efficient than random search**
GEPA doesn't blindly mutate — it parses execution traces to understand *root causes*:
- "Tool description too vague → model didn't call it"
- "Skill exceeds context limit → got truncated"
- "Example format inconsistent → parsing failed"
Then it mutates specifically to address those causes.
### 2. **API-driven means no GPU bottleneck**
Your Ubuntu VPS doesn't need expensive hardware. Each run:
- Calls Claude/Gemini APIs for mutations (pennies)
- Evaluates variants using existing test suite (CPU only)
- Total cost: **$2–10 per optimization run**
### 3. **Multiple trace sources create richer evolution**
Claude Code, Gemini CLI, and Hermes Agent all generate different failure modes:
- Claude: verbose, good at multi-step planning
- Gemini: fast, sometimes truncates
- Hermes: function-calling specific
Cross-pollinating traces = broader coverage.
### 4. **Guardrails prevent regressions**
Every candidate must pass:
- Full `pytest` suite (your existing tests)
- Size limits (prevents bloat)
- Semantic drift checks (no changing skill's core purpose)
This means evolved versions are **strictly better** on metrics you care about, not just different.
---
## Apply It Yourself
### Prerequisites
- Ubuntu VPS (20.04+)
- Hermes Agent installed and tested
- Claude Code CLI + Gemini CLI (optional but valuable)
- API keys for at least one LLM (OpenAI, Anthropic, or Gemini)
- Python 3.10+
### Configuration Checklist
```bash
# 1. Set API keys for mutation/evaluation
export OPENAI_API_KEY="sk-..."
# or
export ANTHROPIC_API_KEY="..."
export GOOGLE_API_KEY="..."
# 2. Point at your hermes-agent repo
export HERMES_AGENT_REPO="$HOME/hermes-agent"
# 3. Run a single skill evolution as a test
python -m evolution.skills.evolve_skill --skill your-skill-name --iterations 3 --eval-source synthetic
# 4. Examine the candidate PR
ls evolution/reports/latest/candidates/
```
### Typical Workflow for Your Team
1. **Weekly evolution run** against last 7 days of Claude Code + Gemini traces
2. **Human reviews** the generated PR (guardrails ensure it passes tests, but semantics need human check)
3. **Merge** the evolved skill back to your main branch
4. **Measure** improvement via before/after benchmarks (the tool generates comparison reports)
### Expected Outcomes
| Metric | Typical Improvement |
|--------|---------------------|
| Skill success rate | +15–30% |
| Token efficiency | -10–20% (shorter, more precise prompts) |
| Edge case handling | +40–60% (from failure traces) |
---
## Source
- **GitHub Repository**: [NousResearch/hermes-agent-self-evolution](https://github.com/NousResearch/hermes-agent-self-evolution)
- **Research Paper**: ICLR 2026 Oral (see repo for citation)
- **Full Architecture**: [PLAN.md](https://github.com/NousResearch/hermes-agent-self-evolution/blob/main/PLAN.md) in the repo