Agent Performance

Why Measure Performance?

Without measurement, you can’t answer:

Which agents deliver value?
Where are bottlenecks?
Is quality improving over time?
What’s the ROI of agent investment?

Git-Based Metrics

Git provides a natural audit trail for agent work. Every commit, PR, and change is tracked.

Core Metrics

Metric	Description	Calculation
Commits/day	Agent activity level	`git log --author="agent" --since="1 day"`
PR merge rate	Quality of output	Merged PRs / Total PRs
Time to merge	Review efficiency	PR created → PR merged
Lines changed	Scope of work	Additions + Deletions
Revert rate	Error frequency	Reverts / Total commits

Tracking Agent Commits

Tag agent commits with consistent metadata:

git commit -m "feat: implement user auth

🤖 Generated with [Agents Squads](https://agents-squads.com)

Co-Authored-By: agents-squads <agents@agents-squads.com>
Agent: auth-implementer
Squad: engineering
Duration: 45m
Tokens: 125,000"

Query Agent Performance

# Commits by agent in last week
git log --since="1 week" --grep="Agent:" --oneline | wc -l

# Agent-specific commits
git log --author="agents-squads" --since="1 month" --stat

# Commits per agent
git log --since="1 month" --grep="Agent:" --format="%s" | \
  grep -oP "Agent: \K[^\n]+" | sort | uniq -c | sort -rn

# Revert rate
echo "scale=2; $(git log --grep="Revert" --since="1 month" | wc -l) / $(git log --since="1 month" | wc -l)" | bc

Performance Dashboard

Using `squads` CLI

# Overall status
squads dashboard

# Agent-specific metrics
squads feedback stats

# View execution history
squads memory show engineering

Custom Metrics Script

#!/bin/bash
# scripts/agent-metrics.sh

SINCE="${1:-1 week}"
AGENT_EMAIL="agents@agents-squads.com"

echo "=== Agent Performance Report ==="
echo "Period: $SINCE"
echo ""

# Total commits
commits=$(git log --author="$AGENT_EMAIL" --since="$SINCE" --oneline | wc -l)
echo "Total commits: $commits"

# PRs created (requires gh cli)
prs_created=$(gh pr list --author="@me" --state=all --json createdAt --jq "length")
prs_merged=$(gh pr list --author="@me" --state=merged --json mergedAt --jq "length")
echo "PRs created: $prs_created"
echo "PRs merged: $prs_merged"
echo "Merge rate: $(echo "scale=2; $prs_merged / $prs_created * 100" | bc)%"

# Lines changed
git log --author="$AGENT_EMAIL" --since="$SINCE" --numstat --pretty="" | \
  awk '{add+=$1; del+=$2} END {print "Lines added:", add, "deleted:", del}'

# By squad
echo ""
echo "=== By Squad ==="
git log --since="$SINCE" --grep="Squad:" --format="%s" | \
  grep -oP "Squad: \K\w+" | sort | uniq -c | sort -rn

Quality Metrics

Code Review Scores

Track review feedback on agent PRs:

## PR Review Template

### Agent Output Quality
- [ ] Code is correct
- [ ] Code follows conventions
- [ ] Tests included
- [ ] Documentation updated
- [ ] No security issues

**Score**: 4/5
**Notes**: Minor style issues, otherwise good.

Automated Quality Checks

# .github/workflows/agent-quality.yml
name: Agent Quality Check

on:
  pull_request:
    branches: [main]

jobs:
  quality:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Check if agent PR
        id: check
        run: |
          if git log -1 --format="%b" | grep -q "agents-squads"; then
            echo "is_agent=true" >> $GITHUB_OUTPUT
          fi

      - name: Run quality metrics
        if: steps.check.outputs.is_agent == 'true'
        run: |
          # Lint check
          npm run lint

          # Test coverage
          npm test -- --coverage

          # Complexity check
          npx complexity-report src/

      - name: Post metrics to PR
        if: steps.check.outputs.is_agent == 'true'
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '## Agent Quality Report\n...'
            })

Benchmarking

Task Completion Benchmarks

Track how long standard tasks take:

Task Type	Target	Current Avg
Bug fix	< 30 min	25 min
Feature (small)	< 2 hours	1.5 hours
Feature (medium)	< 1 day	6 hours
Refactor	< 4 hours	3 hours

A/B Testing Agents

Compare different agent configurations:

## Experiment: Prompt Optimization

**Hypothesis**: Adding domain context improves code quality

**Agent A**: Base prompt
**Agent B**: Base prompt + CLAUDE.md context

**Metrics**:
- PR merge rate
- Review iterations needed
- Time to completion

**Results**:
- Agent A: 75% merge rate, 2.3 iterations
- Agent B: 92% merge rate, 1.1 iterations

**Conclusion**: Domain context significantly improves quality

Monitoring & Alerts

Performance Thresholds

# .agents/performance.yml
thresholds:
  merge_rate:
    warning: 0.80
    critical: 0.60

  revert_rate:
    warning: 0.05
    critical: 0.10

  avg_review_iterations:
    warning: 3
    critical: 5

  tokens_per_task:
    warning: 100000
    critical: 200000

Alerting

// Alert on performance degradation
async function checkPerformance() {
  const metrics = await getAgentMetrics('last_week');

  if (metrics.mergeRate < THRESHOLDS.merge_rate.critical) {
    await alert({
      level: 'critical',
      message: `Agent merge rate dropped to ${metrics.mergeRate}`,
      squad: metrics.squad
    });
  }

  if (metrics.revertRate > THRESHOLDS.revert_rate.warning) {
    await alert({
      level: 'warning',
      message: `High revert rate: ${metrics.revertRate}`,
      squad: metrics.squad
    });
  }
}

Feedback Loop

Recording Feedback

# After agent task completion
squads feedback add engineering

# Prompts for:
# - Task success (yes/no)
# - Quality score (1-5)
# - Issues encountered
# - Improvement suggestions

Using Feedback

# View feedback trends
squads feedback stats

# Output:
# Squad: engineering
# Tasks: 45
# Success rate: 91%
# Avg quality: 4.2/5
# Common issues:
#   - Missing tests (8 occurrences)
#   - Style inconsistencies (5 occurrences)

Continuous Improvement

Measure → Analyze → Improve → Repeat
    │         │         │
    │         │         └── Update prompts, add context
    │         └── Identify patterns in failures
    └── Track metrics over time

Best Practices

Tag all agent commits with consistent metadata
Track metrics weekly, review monthly
Set quality thresholds and alert on breaches
A/B test prompt and configuration changes
Record human feedback after task completion
Use metrics to guide agent improvements

Measurement pitfalls:

Optimizing for vanity metrics (commits ≠ value)
Ignoring quality in favor of speed
Not accounting for task difficulty
Missing the feedback loop

Token Economics

Cost-focused metrics

Deployment

Production monitoring

Get Started

Core Concepts

Configuration

Building Agents

Governance

Production

Resources

API

Why Measure Performance?

Git-Based Metrics

Core Metrics

Tracking Agent Commits

Query Agent Performance

Performance Dashboard

Using `squads` CLI

Custom Metrics Script

Quality Metrics

Code Review Scores

Automated Quality Checks

Benchmarking

Task Completion Benchmarks

A/B Testing Agents

Monitoring & Alerts

Performance Thresholds

Alerting

Feedback Loop

Recording Feedback

Using Feedback

Continuous Improvement

Best Practices

Token Economics

Deployment

Get Started

Core Concepts

Configuration

Building Agents

Governance

Production

Resources

API

​Why Measure Performance?

​Git-Based Metrics

​Core Metrics

​Tracking Agent Commits

​Query Agent Performance

​Performance Dashboard

​Using squads CLI

​Custom Metrics Script

​Quality Metrics

​Code Review Scores

​Automated Quality Checks

​Benchmarking

​Task Completion Benchmarks

​A/B Testing Agents

​Monitoring & Alerts

​Performance Thresholds

​Alerting

​Feedback Loop

​Recording Feedback

​Using Feedback

​Continuous Improvement

​Best Practices

​Related

Token Economics

Deployment

Why Measure Performance?

Git-Based Metrics

Core Metrics

Tracking Agent Commits

Query Agent Performance

Performance Dashboard

Using `squads` CLI

Custom Metrics Script

Quality Metrics

Code Review Scores

Automated Quality Checks

Benchmarking

Task Completion Benchmarks

A/B Testing Agents

Monitoring & Alerts

Performance Thresholds

Alerting

Feedback Loop

Recording Feedback

Using Feedback

Continuous Improvement

Best Practices

Related