Skip to main content

Why Measure Performance?

Without measurement, you can’t answer:
  • Which agents deliver value?
  • Where are bottlenecks?
  • Is quality improving over time?
  • What’s the ROI of agent investment?

Git-Based Metrics

Git provides a natural audit trail for agent work. Every commit, PR, and change is tracked.

Core Metrics

MetricDescriptionCalculation
Commits/dayAgent activity levelgit log --author="agent" --since="1 day"
PR merge rateQuality of outputMerged PRs / Total PRs
Time to mergeReview efficiencyPR created → PR merged
Lines changedScope of workAdditions + Deletions
Revert rateError frequencyReverts / Total commits

Tracking Agent Commits

Tag agent commits with consistent metadata:
git commit -m "feat: implement user auth

🤖 Generated with [Agents Squads](https://agents-squads.com)

Co-Authored-By: agents-squads <[email protected]>
Agent: auth-implementer
Squad: engineering
Duration: 45m
Tokens: 125,000"

Query Agent Performance

# Commits by agent in last week
git log --since="1 week" --grep="Agent:" --oneline | wc -l

# Agent-specific commits
git log --author="agents-squads" --since="1 month" --stat

# Commits per agent
git log --since="1 month" --grep="Agent:" --format="%s" | \
  grep -oP "Agent: \K[^\n]+" | sort | uniq -c | sort -rn

# Revert rate
echo "scale=2; $(git log --grep="Revert" --since="1 month" | wc -l) / $(git log --since="1 month" | wc -l)" | bc

Performance Dashboard

Using squads CLI

# Overall status
squads dashboard

# Agent-specific metrics
squads feedback stats

# View execution history
squads memory show engineering

Custom Metrics Script

#!/bin/bash
# scripts/agent-metrics.sh

SINCE="${1:-1 week}"
AGENT_EMAIL="[email protected]"

echo "=== Agent Performance Report ==="
echo "Period: $SINCE"
echo ""

# Total commits
commits=$(git log --author="$AGENT_EMAIL" --since="$SINCE" --oneline | wc -l)
echo "Total commits: $commits"

# PRs created (requires gh cli)
prs_created=$(gh pr list --author="@me" --state=all --json createdAt --jq "length")
prs_merged=$(gh pr list --author="@me" --state=merged --json mergedAt --jq "length")
echo "PRs created: $prs_created"
echo "PRs merged: $prs_merged"
echo "Merge rate: $(echo "scale=2; $prs_merged / $prs_created * 100" | bc)%"

# Lines changed
git log --author="$AGENT_EMAIL" --since="$SINCE" --numstat --pretty="" | \
  awk '{add+=$1; del+=$2} END {print "Lines added:", add, "deleted:", del}'

# By squad
echo ""
echo "=== By Squad ==="
git log --since="$SINCE" --grep="Squad:" --format="%s" | \
  grep -oP "Squad: \K\w+" | sort | uniq -c | sort -rn

Quality Metrics

Code Review Scores

Track review feedback on agent PRs:
## PR Review Template

### Agent Output Quality
- [ ] Code is correct
- [ ] Code follows conventions
- [ ] Tests included
- [ ] Documentation updated
- [ ] No security issues

**Score**: 4/5
**Notes**: Minor style issues, otherwise good.

Automated Quality Checks

# .github/workflows/agent-quality.yml
name: Agent Quality Check

on:
  pull_request:
    branches: [main]

jobs:
  quality:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Check if agent PR
        id: check
        run: |
          if git log -1 --format="%b" | grep -q "agents-squads"; then
            echo "is_agent=true" >> $GITHUB_OUTPUT
          fi

      - name: Run quality metrics
        if: steps.check.outputs.is_agent == 'true'
        run: |
          # Lint check
          npm run lint

          # Test coverage
          npm test -- --coverage

          # Complexity check
          npx complexity-report src/

      - name: Post metrics to PR
        if: steps.check.outputs.is_agent == 'true'
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '## Agent Quality Report\n...'
            })

Benchmarking

Task Completion Benchmarks

Track how long standard tasks take:
Task TypeTargetCurrent Avg
Bug fix< 30 min25 min
Feature (small)< 2 hours1.5 hours
Feature (medium)< 1 day6 hours
Refactor< 4 hours3 hours

A/B Testing Agents

Compare different agent configurations:
## Experiment: Prompt Optimization

**Hypothesis**: Adding domain context improves code quality

**Agent A**: Base prompt
**Agent B**: Base prompt + CLAUDE.md context

**Metrics**:
- PR merge rate
- Review iterations needed
- Time to completion

**Results**:
- Agent A: 75% merge rate, 2.3 iterations
- Agent B: 92% merge rate, 1.1 iterations

**Conclusion**: Domain context significantly improves quality

Monitoring & Alerts

Performance Thresholds

# .agents/performance.yml
thresholds:
  merge_rate:
    warning: 0.80
    critical: 0.60

  revert_rate:
    warning: 0.05
    critical: 0.10

  avg_review_iterations:
    warning: 3
    critical: 5

  tokens_per_task:
    warning: 100000
    critical: 200000

Alerting

// Alert on performance degradation
async function checkPerformance() {
  const metrics = await getAgentMetrics('last_week');

  if (metrics.mergeRate < THRESHOLDS.merge_rate.critical) {
    await alert({
      level: 'critical',
      message: `Agent merge rate dropped to ${metrics.mergeRate}`,
      squad: metrics.squad
    });
  }

  if (metrics.revertRate > THRESHOLDS.revert_rate.warning) {
    await alert({
      level: 'warning',
      message: `High revert rate: ${metrics.revertRate}`,
      squad: metrics.squad
    });
  }
}

Feedback Loop

Recording Feedback

# After agent task completion
squads feedback add engineering

# Prompts for:
# - Task success (yes/no)
# - Quality score (1-5)
# - Issues encountered
# - Improvement suggestions

Using Feedback

# View feedback trends
squads feedback stats

# Output:
# Squad: engineering
# Tasks: 45
# Success rate: 91%
# Avg quality: 4.2/5
# Common issues:
#   - Missing tests (8 occurrences)
#   - Style inconsistencies (5 occurrences)

Continuous Improvement

Measure → Analyze → Improve → Repeat
    │         │         │
    │         │         └── Update prompts, add context
    │         └── Identify patterns in failures
    └── Track metrics over time

Best Practices

  • Tag all agent commits with consistent metadata
  • Track metrics weekly, review monthly
  • Set quality thresholds and alert on breaches
  • A/B test prompt and configuration changes
  • Record human feedback after task completion
  • Use metrics to guide agent improvements
Measurement pitfalls:
  • Optimizing for vanity metrics (commits ≠ value)
  • Ignoring quality in favor of speed
  • Not accounting for task difficulty
  • Missing the feedback loop