Skip to main content
Run your AI agent squads entirely on your machine using Ollama and local LLMs. Zero API costs, complete privacy, works offline.
Best for: Privacy-sensitive projects, offline development, reducing API costs, experimenting without usage limits.

Why Local LLMs?

BenefitDescription
PrivacyCode never leaves your machine
CostZero API costs after hardware investment
OfflineWorks without internet connection
No limitsNo rate limits or quotas
ControlChoose and customize your models

Quick Start

1. Install Ollama

brew install ollama

2. Pull a Model

# Recommended for coding tasks
ollama pull qwen2.5-coder:14b

# Or for general tasks
ollama pull llama3.2:latest

# List installed models
ollama list

3. Start Ollama Server

ollama serve
# Server runs at http://localhost:11434

4. Install Squads CLI

npm install -g squads-cli
squads init

5. Configure for Local LLM

Create or update your agent to use Ollama:
# .agents/squads/local/code-reviewer.md

---
provider: ollama
model: qwen2.5-coder:14b
---

# Code Reviewer

## Purpose
Review code changes for bugs, security issues, and improvements.

## Instructions
1. Read the provided code diff
2. Identify potential issues
3. Suggest improvements
4. Format as actionable feedback

6. Run Your Agent

squads run local/code-reviewer --execute

For Coding

ModelSizeVRAMBest For
qwen2.5-coder:14b14B10GBComplex code tasks
qwen2.5-coder:7b7B6GBGeneral coding
codellama:13b13B10GBCode completion
deepseek-coder:6.7b6.7B5GBBudget coding

For General Tasks

ModelSizeVRAMBest For
llama3.2:8b8B6GBBalanced performance
mistral:7b7B6GBFast responses
mixtral:8x7b47B32GBHigh quality
phi3:14b14B10GBReasoning tasks
Start with qwen2.5-coder:7b if you have 8GB+ VRAM. It offers the best balance of speed and capability for code-related agent tasks.

Configuration Options

Squad-Level Default

Set Ollama as default for an entire squad:
# .agents/squads/local/SQUAD.md
---
name: local
mission: Privacy-first local development

providers:
  default: ollama
  model: qwen2.5-coder:7b
---

Agent-Level Override

Override for specific agents:
---
provider: ollama
model: mixtral:8x7b
temperature: 0.3
---

# Deep Analyzer

Uses larger model for complex analysis tasks.

Environment Variables

# .env
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=qwen2.5-coder:7b

LM Studio Alternative

LM Studio provides a GUI for running local models with OpenAI-compatible API.

Setup

  1. Download LM Studio from lmstudio.ai
  2. Download a model (e.g., TheBloke/CodeLlama-13B-GGUF)
  3. Start the local server (runs on port 1234)

Configure Squads

# .agents/squads/local/SQUAD.md
---
providers:
  default: openai  # LM Studio uses OpenAI-compatible API
  base_url: http://localhost:1234/v1
  model: local-model
---

Performance Tips

Hardware Requirements

Model SizeMinimum VRAMRecommended
7B6GB8GB
13B10GB12GB
30B20GB24GB
70B40GB48GB

Optimization Settings

# Increase context window (uses more VRAM)
ollama run qwen2.5-coder:7b --num-ctx 8192

# Use GPU layers (faster inference)
OLLAMA_NUM_GPU=999 ollama serve

Quantization

Lower precision = faster + less VRAM:
# Q4 quantization (smallest, fastest)
ollama pull qwen2.5-coder:7b-q4_0

# Q8 quantization (balanced)
ollama pull qwen2.5-coder:7b-q8_0

Hybrid Setup

Use local LLMs for development, cloud for production:
# .agents/squads/engineering/SQUAD.md
---
providers:
  # Local for development
  development:
    provider: ollama
    model: qwen2.5-coder:7b

  # Cloud for production
  production:
    provider: anthropic
    model: claude-sonnet-4
---
Switch with environment:
# Development (local)
SQUADS_ENV=development squads run engineering

# Production (cloud)
SQUADS_ENV=production squads run engineering

Troubleshooting

Model Not Loading

# Check Ollama is running
curl http://localhost:11434/api/tags

# Restart Ollama
ollama serve

Out of Memory

# Use smaller model
ollama pull qwen2.5-coder:3b

# Or use CPU-only (slower but works)
OLLAMA_NUM_GPU=0 ollama serve

Slow Responses

  1. Use quantized models (-q4_0 suffix)
  2. Reduce context window (--num-ctx 4096)
  3. Ensure GPU acceleration is enabled
  4. Close other GPU-intensive applications