Run your AI agent squads entirely on your machine using Ollama and local LLMs. Zero API costs, complete privacy, works offline.Documentation Index
Fetch the complete documentation index at: https://docs.agents-squads.com/llms.txt
Use this file to discover all available pages before exploring further.
Best for: Privacy-sensitive projects, offline development, reducing API costs, experimenting without usage limits.
Why Local LLMs?
| Benefit | Description |
|---|---|
| Privacy | Code never leaves your machine |
| Cost | Zero API costs after hardware investment |
| Offline | Works without internet connection |
| No limits | No rate limits or quotas |
| Control | Choose and customize your models |
Quick Start
1. Install Ollama
- macOS
- Linux
- Windows
2. Pull a Model
3. Start Ollama Server
4. Install Squads CLI
5. Configure for Local LLM
Create or update your agent to use Ollama:6. Run Your Agent
Recommended Models
For Coding
| Model | Size | VRAM | Best For |
|---|---|---|---|
qwen2.5-coder:14b | 14B | 10GB | Complex code tasks |
qwen2.5-coder:7b | 7B | 6GB | General coding |
codellama:13b | 13B | 10GB | Code completion |
deepseek-coder:6.7b | 6.7B | 5GB | Budget coding |
For General Tasks
| Model | Size | VRAM | Best For |
|---|---|---|---|
llama3.2:8b | 8B | 6GB | Balanced performance |
mistral:7b | 7B | 6GB | Fast responses |
mixtral:8x7b | 47B | 32GB | High quality |
phi3:14b | 14B | 10GB | Reasoning tasks |
Configuration Options
Squad-Level Default
Set Ollama as default for an entire squad:Agent-Level Override
Override for specific agents:Environment Variables
LM Studio Alternative
LM Studio provides a GUI for running local models with OpenAI-compatible API.Setup
- Download LM Studio from lmstudio.ai
- Download a model (e.g.,
TheBloke/CodeLlama-13B-GGUF) - Start the local server (runs on port 1234)
Configure Squads
Performance Tips
Hardware Requirements
| Model Size | Minimum VRAM | Recommended |
|---|---|---|
| 7B | 6GB | 8GB |
| 13B | 10GB | 12GB |
| 30B | 20GB | 24GB |
| 70B | 40GB | 48GB |
Optimization Settings
Quantization
Lower precision = faster + less VRAM:Hybrid Setup
Use local LLMs for development, cloud for production:Troubleshooting
Model Not Loading
Out of Memory
Slow Responses
- Use quantized models (
-q4_0suffix) - Reduce context window (
--num-ctx 4096) - Ensure GPU acceleration is enabled
- Close other GPU-intensive applications
Related
Multi-LLM Usage
Mix local and cloud providers
Token Economics
Compare local vs cloud costs