Evaluate and rank agent results by metric or LLM judge for an AgentHub session.
✓Works with OpenClaudeRank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.
Usage
/hub:eval # Eval latest session using configured criteria
/hub:eval 20260317-143022 # Eval specific session
/hub:eval --judge # Force LLM judge mode (ignore metric config)
What It Does
Metric Mode (eval command configured)
Run the evaluation command in each agent's worktree:
python {skill_path}/scripts/result_ranker.py \
--session {session-id} \
--eval-cmd "{eval_cmd}" \
--metric {metric} --direction {direction}
Output:
RANK AGENT METRIC DELTA FILES
1 agent-2 142ms -38ms 2
2 agent-1 165ms -15ms 3
3 agent-3 190ms +10ms 1
Winner: agent-2 (142ms)
LLM Judge Mode (no eval command, or --judge flag)
For each agent:
- Get the diff:
git diff {base_branch}...{agent_branch} - Read the agent's result post from
.agenthub/board/results/agent-{i}-result.md - Compare all diffs and rank by:
- Correctness — Does it solve the task?
- Simplicity — Fewer lines changed is better (when equal correctness)
- Quality — Clean execution, good structure, no regressions
Present rankings with justification.
Example LLM judge output for a content task:
RANK AGENT VERDICT WORD COUNT
1 agent-1 Strong narrative, clear CTA 1480
2 agent-3 Good data points, weak intro 1520
3 agent-2 Generic tone, no differentiation 1350
Winner: agent-1 (strongest narrative arc and call-to-action)
Hybrid Mode
- Run metric evaluation first
- If top agents are within 10% of each other, use LLM judge to break ties
- Present both metric and qualitative rankings
After Eval
- Update session state:
python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating
- Tell the user:
- Ranked results with winner highlighted
- Next step:
/hub:mergeto merge the winner - Or
/hub:merge {session-id} --agent {winner}to be explicit
Related AI/ML Integration Skills
Other Claude Code skills in the same category — free to download.
OpenAI Integration
Integrate OpenAI API with best practices
Claude API Setup
Set up Claude/Anthropic API integration
Embedding Search
Implement vector embedding search
RAG Pipeline
Build Retrieval-Augmented Generation pipeline
Prompt Template
Create reusable prompt templates with variables
AI Streaming
Implement streaming AI responses
LangChain Setup
Set up LangChain for AI workflows
Model Comparison
Compare responses from multiple AI models
Want a AI/ML Integration skill personalized to YOUR project?
This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.