Run your AI agent squads entirely on your machine using Ollama and local LLMs. Zero API costs, complete privacy, works offline.
Best for: Privacy-sensitive projects, offline development, reducing API costs, experimenting without usage limits.
Why Local LLMs?
| Benefit | Description |
|---|
| Privacy | Code never leaves your machine |
| Cost | Zero API costs after hardware investment |
| Offline | Works without internet connection |
| No limits | No rate limits or quotas |
| Control | Choose and customize your models |
Quick Start
1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
2. Pull a Model
# Recommended for coding tasks
ollama pull qwen2.5-coder:14b
# Or for general tasks
ollama pull llama3.2:latest
# List installed models
ollama list
3. Start Ollama Server
ollama serve
# Server runs at http://localhost:11434
4. Install Squads CLI
npm install -g squads-cli
squads init
Create or update your agent to use Ollama:
# .agents/squads/local/code-reviewer.md
---
provider: ollama
model: qwen2.5-coder:14b
---
# Code Reviewer
## Purpose
Review code changes for bugs, security issues, and improvements.
## Instructions
1. Read the provided code diff
2. Identify potential issues
3. Suggest improvements
4. Format as actionable feedback
6. Run Your Agent
squads run local/code-reviewer --execute
Recommended Models
For Coding
| Model | Size | VRAM | Best For |
|---|
qwen2.5-coder:14b | 14B | 10GB | Complex code tasks |
qwen2.5-coder:7b | 7B | 6GB | General coding |
codellama:13b | 13B | 10GB | Code completion |
deepseek-coder:6.7b | 6.7B | 5GB | Budget coding |
For General Tasks
| Model | Size | VRAM | Best For |
|---|
llama3.2:8b | 8B | 6GB | Balanced performance |
mistral:7b | 7B | 6GB | Fast responses |
mixtral:8x7b | 47B | 32GB | High quality |
phi3:14b | 14B | 10GB | Reasoning tasks |
Start with qwen2.5-coder:7b if you have 8GB+ VRAM. It offers the best balance of speed and capability for code-related agent tasks.
Configuration Options
Squad-Level Default
Set Ollama as default for an entire squad:
# .agents/squads/local/SQUAD.md
---
name: local
mission: Privacy-first local development
providers:
default: ollama
model: qwen2.5-coder:7b
---
Agent-Level Override
Override for specific agents:
---
provider: ollama
model: mixtral:8x7b
temperature: 0.3
---
# Deep Analyzer
Uses larger model for complex analysis tasks.
Environment Variables
# .env
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=qwen2.5-coder:7b
LM Studio Alternative
LM Studio provides a GUI for running local models with OpenAI-compatible API.
Setup
- Download LM Studio from lmstudio.ai
- Download a model (e.g.,
TheBloke/CodeLlama-13B-GGUF)
- Start the local server (runs on port 1234)
# .agents/squads/local/SQUAD.md
---
providers:
default: openai # LM Studio uses OpenAI-compatible API
base_url: http://localhost:1234/v1
model: local-model
---
Hardware Requirements
| Model Size | Minimum VRAM | Recommended |
|---|
| 7B | 6GB | 8GB |
| 13B | 10GB | 12GB |
| 30B | 20GB | 24GB |
| 70B | 40GB | 48GB |
Optimization Settings
# Increase context window (uses more VRAM)
ollama run qwen2.5-coder:7b --num-ctx 8192
# Use GPU layers (faster inference)
OLLAMA_NUM_GPU=999 ollama serve
Quantization
Lower precision = faster + less VRAM:
# Q4 quantization (smallest, fastest)
ollama pull qwen2.5-coder:7b-q4_0
# Q8 quantization (balanced)
ollama pull qwen2.5-coder:7b-q8_0
Hybrid Setup
Use local LLMs for development, cloud for production:
# .agents/squads/engineering/SQUAD.md
---
providers:
# Local for development
development:
provider: ollama
model: qwen2.5-coder:7b
# Cloud for production
production:
provider: anthropic
model: claude-sonnet-4
---
Switch with environment:
# Development (local)
SQUADS_ENV=development squads run engineering
# Production (cloud)
SQUADS_ENV=production squads run engineering
Troubleshooting
Model Not Loading
# Check Ollama is running
curl http://localhost:11434/api/tags
# Restart Ollama
ollama serve
Out of Memory
# Use smaller model
ollama pull qwen2.5-coder:3b
# Or use CPU-only (slower but works)
OLLAMA_NUM_GPU=0 ollama serve
Slow Responses
- Use quantized models (
-q4_0 suffix)
- Reduce context window (
--num-ctx 4096)
- Ensure GPU acceleration is enabled
- Close other GPU-intensive applications