Design Principles
Start Simple, Scale Deliberately
- Begin with single-agent solutions
- Add complexity only when measurements prove it helps
- Avoid premature framework adoption
- Validate each addition delivers value
1
Single Agent
Start simple
2
Add Memory
Persistence when needed
3
Add Tools
Capabilities when needed
4
Multiple Agents
Scale when needed
5
Squads
Organize when needed
Only progress when current stage is insufficient.
Optimize for Trust, Not Capability
The bottleneck is rarely what agents can do, but whether humans trust what they do.- Transparency - Show reasoning, not just results
- Auditability - Log decisions and actions
- Predictability - Consistent behavior patterns
- Reversibility - Easy to undo mistakes
Agent Design
Prompt Engineering
Do:Single Responsibility
Each agent should do one thing well:Fail Gracefully
System Architecture
Memory Hierarchy
| Layer | Type | Purpose |
|---|---|---|
| Project config | Static | Project knowledge (CLAUDE.md) |
| Squad memory | Persistent | Cross-session state |
| Conversation | Session | Current task context |
| Working memory | Ephemeral | In-progress data |
Communication Patterns
| Pattern | When to Use |
|---|---|
| Direct handoff | Agent A completes, passes to Agent B |
| Shared state file | Multiple agents read/write same doc |
| Message queue | Async, decoupled agents |
| Orchestrator | Central coordinator delegates tasks |
Isolation Boundaries
Good Isolation
- Agent A →
src/auth/only - Agent B →
src/api/only - Clear boundaries
Poor Isolation
- All agents modify all files
- No boundaries
- Conflicts and overwrites
Quality Assurance
Review Before Merge
Even automated PRs need review:Validation Gates
1
Syntax check
Valid? Continue. Invalid? Reject.
2
Tests
Pass? Continue. Fail? Reject.
3
Linter
Clean? Continue. Issues? Auto-fix and retry.
4
Review
Human approval → Merge
Feedback Loops
Operational Excellence
Monitoring
Track these metrics:| Metric | Target | Red Flag |
|---|---|---|
| Task completion rate | > 90% | < 70% |
| Token efficiency | > 80% | < 50% |
| Error rate | < 5% | > 15% |
| Avg task duration | < 10 min | > 30 min |
Cost Control
Incident Response
When agents fail:- Stop - Prevent further damage
- Assess - What happened, what’s affected
- Fix - Resolve immediate issue
- Learn - Update prompts/guardrails
- Document - Record in squad memory
Anti-Patterns
Avoid These
Checklists
New Agent Checklist
- Clear, single-purpose objective
- Specific constraints and boundaries
- Defined output format
- Error handling instructions
- Anti-slop rules included
- Tested on representative inputs
Production Readiness
- Monitoring in place
- Budget limits configured
- Error alerting enabled
- Rollback plan documented
- Feedback loop established
- Review process defined
Daily Operations
- Check
squads statusfor issues - Review any failed tasks
- Monitor cost trends
- Update memory with learnings
- Clear completed todos