Evaluate AI agent performance with benchmarks and metrics
✓Works with OpenClaudeYou are an AI systems evaluator specializing in agent performance assessment. The user wants to set up comprehensive benchmarking and metric collection for AI agents using industry-standard frameworks.
What to check first
- Run
pip list | grep -E "langchain|anthropic|openai"to verify agent framework dependencies are installed - Check that you have access to a benchmark dataset (e.g.,
datasetslibrary or local JSON files with test cases) - Verify your API keys are set:
echo $ANTHROPIC_API_KEYorecho $OPENAI_API_KEY
Steps
- Install evaluation dependencies:
pip install langchain anthropic openai datasets ragas arize-utils - Define your evaluation metrics as a dictionary mapping metric names to scoring functions (accuracy, latency, token efficiency, hallucination rate)
- Create a test harness that runs your agent against a standardized benchmark dataset with identical inputs
- Implement metric collectors using context managers to capture execution time, token counts, and error states during agent runs
- Build a results aggregator that computes min/max/mean/std for each metric across all test cases
- Add a comparison function to track performance deltas between agent versions or configurations
- Export results to JSON or CSV with timestamps for trend analysis and regression detection
- Implement a scoring rubric that weights metrics (e.g., 40% accuracy + 30% speed + 20% cost + 10% safety) into a composite score
Code
import json
import time
import statistics
from typing import Callable, Dict, List, Any
from dataclasses import dataclass, asdict
from datetime import datetime
import anthropic
@dataclass
class EvaluationMetrics:
test_case_id: str
agent_response: str
execution_time_ms: float
token_count: int
is_correct: bool
error_occurred: bool
error_message: str = None
timestamp: str = None
class AgentEvaluator:
def __init__(self, scoring_functions: Dict[str, Callable]):
self.client = anthropic.Anthropic()
self.scoring_functions = scoring_functions
self.results: List[EvaluationMetrics] = []
def evaluate_agent(self, test_cases: List[Dict[str, str]]) -> Dict[str, Any]:
"""Run agent against benchmark test cases and collect metrics."""
for idx, test_case in enumerate(test_cases):
metrics = self._run_single_test(test_case, idx)
self.results.append(metrics)
return self._compute_aggregate_metrics()
def _run_single_test(self, test_case: Dict, case_id: int) -> EvaluationMetrics:
"""Execute single test case and capture metrics."""
start_time = time.time()
error_msg = None
response_text = ""
token_count =
Note: this example was truncated in the source. See the GitHub repo for the latest full version.
Common Pitfalls
- Letting agents loop indefinitely without a hard step limit — set
max_iterationsto 10-20 for most workflows - Passing entire conversation history every iteration — costs explode. Use summarization or sliding window
- Not validating tool outputs before passing them to the next step — one bad output corrupts the entire chain
- Trusting the agent's self-evaluation — agents are notoriously bad at knowing when they're wrong
- Forgetting that agents can hallucinate tool calls that don't exist — always validate tool names against your registry
When NOT to Use This Skill
- When a single LLM call would suffice — agents add 5-10x latency and cost
- When the task has well-defined steps that don't need branching logic — use a workflow engine instead
- For high-stakes decisions without human review — agents make confident mistakes
How to Verify It Worked
- Run the agent on 10+ test cases including edge cases — track success rate, average steps, and total cost
- Compare agent output to human baseline — if a human can do it faster and cheaper, you don't need an agent
- Inspect the full reasoning trace, not just the final output — agents often arrive at correct answers via wrong reasoning
Production Considerations
- Set hard cost ceilings per agent run — a runaway agent can burn $50+ in minutes
- Log every tool call, every model call, every state transition — debugging agents without logs is impossible
- Have a kill switch — agents should be cancelable mid-run without corrupting state
- Monitor token usage trends — context bloat is the #1 cause of agent cost overruns
Related AI Agents Skills
Other Claude Code skills in the same category — free to download.
CrewAI Setup
Build multi-agent systems with CrewAI framework
AutoGen Setup
Create AI agent conversations with AutoGen
LangGraph Workflow
Build stateful AI agent workflows with LangGraph
AI Agent Tools
Create custom tools for AI agents (search, calculator, API)
AI Agent Memory
Implement agent memory with vector stores and summaries
AI Agent Observability
Add tracing, logging, and metrics to AI agents so you can debug failures
AI Agent Retry Strategy
Build robust retry logic for LLM and tool calls in AI agents
pydantic-ai
Build production-ready AI agents with PydanticAI — type-safe tool use, structured outputs, dependency injection, and multi-model support.
Want a AI Agents skill personalized to YOUR project?
This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.