Local LLM Setup with Ollama

Run your AI agent squads entirely on your machine using Ollama and local LLMs. Zero API costs, complete privacy, works offline.

Best for: Privacy-sensitive projects, offline development, reducing API costs, experimenting without usage limits.

Why Local LLMs?

Benefit	Description
Privacy	Code never leaves your machine
Cost	Zero API costs after hardware investment
Offline	Works without internet connection
No limits	No rate limits or quotas
Control	Choose and customize your models

Quick Start

1. Install Ollama

macOS
Linux
Windows

brew install ollama

curl -fsSL https://ollama.com/install.sh | sh

2. Pull a Model

# Recommended for coding tasks
ollama pull qwen2.5-coder:14b

# Or for general tasks
ollama pull llama3.2:latest

# List installed models
ollama list

3. Start Ollama Server

ollama serve
# Server runs at http://localhost:11434

4. Install Squads CLI

npm install -g squads-cli
squads init

5. Configure for Local LLM

Create or update your agent to use Ollama:

# .agents/squads/local/code-reviewer.md

---
provider: ollama
model: qwen2.5-coder:14b
---

# Code Reviewer

## Purpose
Review code changes for bugs, security issues, and improvements.

## Instructions
1. Read the provided code diff
2. Identify potential issues
3. Suggest improvements
4. Format as actionable feedback

6. Run Your Agent

squads run local/code-reviewer --execute

Recommended Models

For Coding

Model	Size	VRAM	Best For
`qwen2.5-coder:14b`	14B	10GB	Complex code tasks
`qwen2.5-coder:7b`	7B	6GB	General coding
`codellama:13b`	13B	10GB	Code completion
`deepseek-coder:6.7b`	6.7B	5GB	Budget coding

For General Tasks

Model	Size	VRAM	Best For
`llama3.2:8b`	8B	6GB	Balanced performance
`mistral:7b`	7B	6GB	Fast responses
`mixtral:8x7b`	47B	32GB	High quality
`phi3:14b`	14B	10GB	Reasoning tasks

Start with qwen2.5-coder:7b if you have 8GB+ VRAM. It offers the best balance of speed and capability for code-related agent tasks.

Configuration Options

Squad-Level Default

Set Ollama as default for an entire squad:

# .agents/squads/local/SQUAD.md
---
name: local
mission: Privacy-first local development

providers:
  default: ollama
  model: qwen2.5-coder:7b
---

Agent-Level Override

Override for specific agents:

---
provider: ollama
model: mixtral:8x7b
temperature: 0.3
---

# Deep Analyzer

Uses larger model for complex analysis tasks.

Environment Variables

# .env
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=qwen2.5-coder:7b

LM Studio Alternative

LM Studio provides a GUI for running local models with OpenAI-compatible API.

Setup

Download LM Studio from lmstudio.ai
Download a model (e.g., TheBloke/CodeLlama-13B-GGUF)
Start the local server (runs on port 1234)

Configure Squads

# .agents/squads/local/SQUAD.md
---
providers:
  default: openai  # LM Studio uses OpenAI-compatible API
  base_url: http://localhost:1234/v1
  model: local-model
---

Performance Tips

Hardware Requirements

Model Size	Minimum VRAM	Recommended
7B	6GB	8GB
13B	10GB	12GB
30B	20GB	24GB
70B	40GB	48GB

Optimization Settings

# Increase context window (uses more VRAM)
ollama run qwen2.5-coder:7b --num-ctx 8192

# Use GPU layers (faster inference)
OLLAMA_NUM_GPU=999 ollama serve

Quantization

Lower precision = faster + less VRAM:

# Q4 quantization (smallest, fastest)
ollama pull qwen2.5-coder:7b-q4_0

# Q8 quantization (balanced)
ollama pull qwen2.5-coder:7b-q8_0

Hybrid Setup

Use local LLMs for development, cloud for production:

# .agents/squads/engineering/SQUAD.md
---
providers:
  # Local for development
  development:
    provider: ollama
    model: qwen2.5-coder:7b

  # Cloud for production
  production:
    provider: anthropic
    model: claude-sonnet-4
---

Switch with environment:

# Development (local)
SQUADS_ENV=development squads run engineering

# Production (cloud)
SQUADS_ENV=production squads run engineering

Troubleshooting

Model Not Loading

# Check Ollama is running
curl http://localhost:11434/api/tags

# Restart Ollama
ollama serve

Out of Memory

# Use smaller model
ollama pull qwen2.5-coder:3b

# Or use CPU-only (slower but works)
OLLAMA_NUM_GPU=0 ollama serve

Slow Responses

Use quantized models (-q4_0 suffix)
Reduce context window (--num-ctx 4096)
Ensure GPU acceleration is enabled
Close other GPU-intensive applications

Multi-LLM Usage

Mix local and cloud providers

Token Economics

Compare local vs cloud costs

Get Started

Core Concepts

Configuration

Building Agents

Governance

Production

Resources

API

Local LLM Setup with Ollama

Why Local LLMs?

Quick Start

1. Install Ollama

2. Pull a Model

3. Start Ollama Server

4. Install Squads CLI

5. Configure for Local LLM

6. Run Your Agent

Recommended Models

For Coding

For General Tasks

Configuration Options

Squad-Level Default

Agent-Level Override

Environment Variables

LM Studio Alternative

Setup

Configure Squads

Performance Tips

Hardware Requirements

Optimization Settings

Quantization

Hybrid Setup

Troubleshooting

Model Not Loading

Out of Memory

Slow Responses

Multi-LLM Usage

Token Economics

Get Started

Core Concepts

Configuration

Building Agents

Governance

Production

Resources

API

​Why Local LLMs?

​Quick Start

​1. Install Ollama

​2. Pull a Model

​3. Start Ollama Server

​4. Install Squads CLI

​5. Configure for Local LLM

​6. Run Your Agent

​Recommended Models

​For Coding

​For General Tasks

​Configuration Options

​Squad-Level Default

​Agent-Level Override

​Environment Variables

​LM Studio Alternative

​Setup

​Configure Squads

​Performance Tips

​Hardware Requirements

​Optimization Settings

​Quantization

​Hybrid Setup

​Troubleshooting

​Model Not Loading

​Out of Memory

​Slow Responses

​Related

Multi-LLM Usage

Token Economics

Why Local LLMs?

Quick Start

1. Install Ollama

2. Pull a Model

3. Start Ollama Server

4. Install Squads CLI

5. Configure for Local LLM

6. Run Your Agent

Recommended Models

For Coding

For General Tasks

Configuration Options

Squad-Level Default

Agent-Level Override

Environment Variables

LM Studio Alternative

Setup

Configure Squads

Performance Tips

Hardware Requirements

Optimization Settings

Quantization

Hybrid Setup

Troubleshooting

Model Not Loading

Out of Memory

Slow Responses

Related