Free per-call
Open-weight models (Llama, Qwen, Mistral, DeepSeek) run locally on Ollama with zero API spend.
Stays on your machine
No third-party LLM provider sees your prompts or responses, the model runs entirely on your hardware, and nothing about the content is sent to OpenAI, Anthropic, Google, or anyone else.
No rate limits, no outages
Your model lives on your disk. No API quotas, no provider outages, no waiting in line, it answers as fast as your hardware will let it.
Your hardware, your rules
Pick the model size that fits your RAM. Mix local + cloud freely. Swap weights any time.
Three steps to local AI
From zero to chatting with a local model in under five minutes.
- 1
Install Ollama
Ollama is a free, open-source runner that downloads and serves open-weight models over a local HTTP API. Install it once, pull whatever models you want, and it runs in the background.
macOSLinuxWindowsmacOSbrew install ollamaLinuxcurl -fsSL https://ollama.com/install.sh | shWindowswinget install Ollama.OllamaDetailed download options for every OS at ollama.com/download.
- 2
Pull a model
Pick a model and tell Ollama to download it. The first pull takes a minute (the weights are big); after that the model lives on your disk and starts in seconds.
ollama pull llama3.2 ollama run llama3.2
- 3
Connect aiDex to your Ollama
aiDex connects to your Ollama instance over HTTPS. Run a small free tunnel on the same machine as Ollama to get a stable public URL, then paste it into aiDex.
Cloudflare TunnelngrokTailscaleCloudflare Tunnelcloudflared tunnel --url http://localhost:11434ngrokngrok http 11434Tailscaletailscale serve --bg --https=443 http://localhost:11434Settings path
Settings → Provider keys → Ollama URL
Cloudflare Tunnel is free for personal use and prints a stable URL. Tailscale gives you a private HTTPS endpoint only reachable from your own devices, ideal for solo use.
Recommended models by hardware
Quantized GGUF defaults (q4_K_M). RAM is the lower bound for CPU inference; VRAM is what fits comfortably on a dedicated GPU so generation stays on the card.
Bigger models read more nuance but generate slower. Start with an 8B; step up to 70B only if you have the hardware to keep generation snappy.
Tips for getting the most out of local
Match the model to your RAM
A model that won't fit in RAM falls back to disk swap and grinds. Look at the recommended RAM column and pick one tier below your laptop's free RAM, leave headroom for the OS and your browser.
Mix local and cloud in one team
Run the chatty agents locally (free, fast) and reserve cloud calls for the frontier judge or synthesis step. Best of both: low cost, high ceiling.
Run the moderator locally too
aiDex's moderator just needs to emit small JSON plans. Llama 3.1 8B or Qwen 2.5 7B handle that easily, set it as your team's moderator and your whole conversation runs on-device.
GPU acceleration is automatic
Ollama uses Metal on Apple Silicon, CUDA on NVIDIA, and Vulkan on AMD with zero config. If you have a GPU, you'll see it light up the moment generation starts.