Run your AI agent squads entirely on your machine using Ollama and local LLMs. Zero API costs, complete privacy, works offline.
Best for: Privacy-sensitive projects, offline development, reducing API costs, experimenting without usage limits.
Why Local LLMs?
Benefit Description Privacy Code never leaves your machine Cost Zero API costs after hardware investment Offline Works without internet connection No limits No rate limits or quotas Control Choose and customize your models
Quick Start
1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
2. Pull a Model
# Recommended for coding tasks
ollama pull qwen2.5-coder:14b
# Or for general tasks
ollama pull llama3.2:latest
# List installed models
ollama list
3. Start Ollama Server
ollama serve
# Server runs at http://localhost:11434
4. Install Squads CLI
npm install -g squads-cli
squads init
Create or update your agent to use Ollama:
# .agents/squads/local/code-reviewer.md
---
provider: ollama
model: qwen2.5-coder:14b
---
# Code Reviewer
## Purpose
Review code changes for bugs, security issues, and improvements.
## Instructions
1. Read the provided code diff
2. Identify potential issues
3. Suggest improvements
4. Format as actionable feedback
6. Run Your Agent
squads run local/code-reviewer --execute
Recommended Models
For Coding
Model Size VRAM Best For qwen2.5-coder:14b14B 10GB Complex code tasks qwen2.5-coder:7b7B 6GB General coding codellama:13b13B 10GB Code completion deepseek-coder:6.7b6.7B 5GB Budget coding
For General Tasks
Model Size VRAM Best For llama3.2:8b8B 6GB Balanced performance mistral:7b7B 6GB Fast responses mixtral:8x7b47B 32GB High quality phi3:14b14B 10GB Reasoning tasks
Start with qwen2.5-coder:7b if you have 8GB+ VRAM. It offers the best balance of speed and capability for code-related agent tasks.
Configuration Options
Squad-Level Default
Set Ollama as default for an entire squad:
# .agents/squads/local/SQUAD.md
---
name : local
mission : Privacy-first local development
providers :
default : ollama
model : qwen2.5-coder:7b
---
Agent-Level Override
Override for specific agents:
---
provider : ollama
model : mixtral:8x7b
temperature : 0.3
---
# Deep Analyzer
Uses larger model for complex analysis tasks.
Environment Variables
# .env
OLLAMA_HOST = http://localhost:11434
OLLAMA_MODEL = qwen2.5-coder:7b
LM Studio Alternative
LM Studio provides a GUI for running local models with OpenAI-compatible API.
Setup
Download LM Studio from lmstudio.ai
Download a model (e.g., TheBloke/CodeLlama-13B-GGUF)
Start the local server (runs on port 1234)
# .agents/squads/local/SQUAD.md
---
providers :
default : openai # LM Studio uses OpenAI-compatible API
base_url : http://localhost:1234/v1
model : local-model
---
Hardware Requirements
Model Size Minimum VRAM Recommended 7B 6GB 8GB 13B 10GB 12GB 30B 20GB 24GB 70B 40GB 48GB
Optimization Settings
# Increase context window (uses more VRAM)
ollama run qwen2.5-coder:7b --num-ctx 8192
# Use GPU layers (faster inference)
OLLAMA_NUM_GPU = 999 ollama serve
Quantization
Lower precision = faster + less VRAM:
# Q4 quantization (smallest, fastest)
ollama pull qwen2.5-coder:7b-q4_0
# Q8 quantization (balanced)
ollama pull qwen2.5-coder:7b-q8_0
Hybrid Setup
Use local LLMs for development, cloud for production:
# .agents/squads/engineering/SQUAD.md
---
providers :
# Local for development
development :
provider : ollama
model : qwen2.5-coder:7b
# Cloud for production
production :
provider : anthropic
model : claude-sonnet-4
---
Switch with environment:
# Development (local)
SQUADS_ENV = development squads run engineering
# Production (cloud)
SQUADS_ENV = production squads run engineering
Troubleshooting
Model Not Loading
# Check Ollama is running
curl http://localhost:11434/api/tags
# Restart Ollama
ollama serve
Out of Memory
# Use smaller model
ollama pull qwen2.5-coder:3b
# Or use CPU-only (slower but works)
OLLAMA_NUM_GPU = 0 ollama serve
Slow Responses
Use quantized models (-q4_0 suffix)
Reduce context window (--num-ctx 4096)
Ensure GPU acceleration is enabled
Close other GPU-intensive applications
Multi-LLM Usage Mix local and cloud providers
Token Economics Compare local vs cloud costs