> agent-eval

编码代理（Claude Code、Aider、Codex等）在自定义任务上的直接比较，包含通过率、成本、时间和一致性指标

fetch

$curl "https://skillshub.wtf/affaan-m/everything-claude-code/agent-eval?format=md"

SKILL.md•agent-eval

Agent Eval 技能

一个轻量级 CLI 工具，用于在可复现的任务上对编码代理进行头对头比较。每个“哪个编码代理最好？”的比较都基于感觉——本工具将其系统化。

何时使用

在你自己的代码库上比较编码代理（Claude Code、Aider、Codex 等）
在采用新工具或模型之前衡量代理性能
当代理更新其模型或工具时运行回归检查
为团队做出数据支持的代理选择决策

安装

# pinned to v0.1.0 — latest stable commit
pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b

核心概念

YAML 任务定义

以声明方式定义任务。每个任务指定要做什么、要修改哪些文件以及如何判断成功：

name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests.
  Max 3 retries. Initial delay 1s, max delay 30s.
judge:
  - type: pytest
    command: pytest tests/test_http_client.py -v
  - type: grep
    pattern: "exponential_backoff|retry"
    files: src/http_client.py
commit: "abc1234"  # pin to specific commit for reproducibility

Git 工作树隔离

每个代理运行都获得自己的 git 工作树——无需 Docker。这提供了可复现的隔离，使得代理之间不会相互干扰或损坏基础仓库。

收集的指标

指标	衡量内容
通过率	代理生成的代码是否通过了判断？
成本	每个任务的 API 花费（如果可用）
时间	完成所需的挂钟秒数
一致性	跨重复运行的通过率（例如，3/3 = 100%）

工作流程

1. 定义任务

创建一个 tasks/ 目录，其中包含 YAML 文件，每个任务一个文件：

mkdir tasks
# Write task definitions (see template above)

2. 运行代理

针对你的任务执行代理：

agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3

每次运行：

从指定的提交创建一个新的 git 工作树
将提示交给代理
运行判断标准
记录通过/失败、成本和时间

3. 比较结果

生成比较报告：

agent-eval report --format table

Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘

判断类型

基于代码（确定性）

judge:
  - type: pytest
    command: pytest tests/ -v
  - type: command
    command: npm run build

基于模式

judge:
  - type: grep
    pattern: "class.*Retry"
    files: src/**/*.py

基于模型（LLM 作为判断器）

judge:
  - type: llm
    prompt: |
      Does this implementation correctly handle exponential backoff?
      Check for: max retries, increasing delays, jitter.

最佳实践

从 3-5 个任务开始，这些任务代表你的真实工作负载，而非玩具示例
每个代理至少运行 3 次试验以捕捉方差——代理是非确定性的
在你的任务 YAML 中固定提交，以便结果在数天/数周内可复现
每个任务至少包含一个确定性判断器（测试、构建）——LLM 判断器会增加噪音
跟踪成本与通过率——一个通过率 95% 但成本高出 10 倍的代理可能不是正确的选择
对你的任务定义进行版本控制——它们是测试夹具，应将其视为代码

链接

仓库：github.com/joaquinhuigomez/agent-eval

> related_skills --same-repo

> skill-comply

Visualize whether skills, rules, and agent definitions are actually followed — auto-generates scenarios at 3 prompt strictness levels, runs agents, classifies behavioral sequences, and reports compliance rates with full tool call timelines

> santa-method

Multi-agent adversarial verification with convergence loop. Two independent review agents must both pass before output ships.

> safety-guard

# Safety Guard — Prevent Destructive Operations ## When to Use - When working on production systems - When agents are running autonomously (full-auto mode) - When you want to restrict edits to a specific directory - During sensitive operations (migrations, deploys, data changes) ## How It Works Three modes of protection: ### Mode 1: Careful Mode Intercepts destructive commands before execution and warns: ``` Watched patterns: - rm -rf (especially /, ~, or project root) - git push --force

> product-lens

# Product Lens — Think Before You Build ## When to Use - Before starting any feature — validate the "why" - Weekly product review — are we building the right thing? - When stuck choosing between features - Before a launch — sanity check the user journey - When converting a vague idea into a spec ## How It Works ### Mode 1: Product Diagnostic Like YC office hours but automated. Asks the hard questions: ``` 1. Who is this for? (specific person, not "developers") 2. What's the pain? (quantify

┌ stats

installs/wk2.3K

██████████

github stars237.1K

██████████

first seenMar 20, 2026

└────────────

┌ repo

affaan-m/everything-claude-code

by affaan-m

└────────────