AI/ML Integrationintermediate

eval

Name: eval
Author: alirezarezvani

Evaluate and rank agent results by metric or LLM judge for an AgentHub session.

Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.

Usage

/hub:eval                           # Eval latest session using configured criteria
/hub:eval 20260317-143022           # Eval specific session
/hub:eval --judge                   # Force LLM judge mode (ignore metric config)

What It Does

Metric Mode (eval command configured)

Run the evaluation command in each agent's worktree:

python {skill_path}/scripts/result_ranker.py \
  --session {session-id} \
  --eval-cmd "{eval_cmd}" \
  --metric {metric} --direction {direction}

Output:

RANK  AGENT       METRIC      DELTA      FILES
1     agent-2     142ms       -38ms      2
2     agent-1     165ms       -15ms      3
3     agent-3     190ms       +10ms      1

Winner: agent-2 (142ms)

LLM Judge Mode (no eval command, or --judge flag)

For each agent:

Get the diff: git diff {base_branch}...{agent_branch}
Read the agent's result post from .agenthub/board/results/agent-{i}-result.md
Compare all diffs and rank by:
- Correctness — Does it solve the task?
- Simplicity — Fewer lines changed is better (when equal correctness)
- Quality — Clean execution, good structure, no regressions

Present rankings with justification.

Example LLM judge output for a content task:

RANK  AGENT    VERDICT                               WORD COUNT
1     agent-1  Strong narrative, clear CTA            1480
2     agent-3  Good data points, weak intro           1520
3     agent-2  Generic tone, no differentiation       1350

Winner: agent-1 (strongest narrative arc and call-to-action)

Hybrid Mode

Run metric evaluation first
If top agents are within 10% of each other, use LLM judge to break ties
Present both metric and qualitative rankings

After Eval

Update session state:

python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating

Tell the user:
- Ranked results with winner highlighted
- Next step: /hub:merge to merge the winner
- Or /hub:merge {session-id} --agent {winner} to be explicit

Quick Info

CategoryAI/ML Integration

Difficultyintermediate

Version1.0.0

Authoralirezarezvani

communityalirezarezvanipython

Install command:

Related AI/ML Integration Skills

Other Claude Code skills in the same category — free to download.

Browse all

AI/ML Integrationintermediate

OpenAI Integration

Integrate OpenAI API with best practices

AI/ML Integrationintermediate

Claude API Setup

Set up Claude/Anthropic API integration

AI/ML Integrationadvanced

Embedding Search

Implement vector embedding search

AI/ML Integrationadvanced

RAG Pipeline

Build Retrieval-Augmented Generation pipeline

AI/ML Integrationbeginner

Prompt Template

Create reusable prompt templates with variables

AI/ML Integrationintermediate

AI Streaming

Implement streaming AI responses

AI/ML Integrationintermediate

LangChain Setup

Set up LangChain for AI workflows

AI/ML Integrationintermediate

Model Comparison

Compare responses from multiple AI models

Want a AI/ML Integration skill personalized to YOUR project?

This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.

Custom Agent — $5 →|Analyze My Stack — $3 →

eval

Usage

What It Does

Metric Mode (eval command configured)

LLM Judge Mode (no eval command, or --judge flag)

Hybrid Mode

After Eval

Quick Info

Related Skills

Related AI/ML Integration Skills

OpenAI Integration

Claude API Setup

Embedding Search

RAG Pipeline

Prompt Template

AI Streaming

LangChain Setup

Model Comparison