Run local with Ollama

Run AI on your own laptop. Free.

Pair aiDex with Ollama and turn your own machine into the LLM backend. Zero cost per call, zero data leaving your computer, fully offline-capable.

Free per-call

Open-weight models (Llama, Qwen, Mistral, DeepSeek) run locally on Ollama with zero API spend.

Stays on your machine

No third-party LLM provider sees your prompts or responses, the model runs entirely on your hardware, and nothing about the content is sent to OpenAI, Anthropic, Google, or anyone else.

No rate limits, no outages

Your model lives on your disk. No API quotas, no provider outages, no waiting in line, it answers as fast as your hardware will let it.

Your hardware, your rules

Pick the model size that fits your RAM. Mix local + cloud freely. Swap weights any time.

Three steps to local AI

From zero to chatting with a local model in under five minutes.

  1. 1

    Install Ollama

    Ollama is a free, open-source runner that downloads and serves open-weight models over a local HTTP API. Install it once, pull whatever models you want, and it runs in the background.

    macOSLinuxWindows
    macOSbrew install ollama
    Linuxcurl -fsSL https://ollama.com/install.sh | sh
    Windowswinget install Ollama.Ollama

    Detailed download options for every OS at ollama.com/download.

  2. 2

    Pull a model

    Pick a model and tell Ollama to download it. The first pull takes a minute (the weights are big); after that the model lives on your disk and starts in seconds.

    ollama pull llama3.2
    ollama run llama3.2
  3. 3

    Connect aiDex to your Ollama

    aiDex connects to your Ollama instance over HTTPS. Run a small free tunnel on the same machine as Ollama to get a stable public URL, then paste it into aiDex.

    Cloudflare TunnelngrokTailscale
    Cloudflare Tunnelcloudflared tunnel --url http://localhost:11434
    ngrokngrok http 11434
    Tailscaletailscale serve --bg --https=443 http://localhost:11434

    Settings path

    Settings → Provider keys → Ollama URL

    Cloudflare Tunnel is free for personal use and prints a stable URL. Tailscale gives you a private HTTPS endpoint only reachable from your own devices, ideal for solo use.

Recommended models by hardware

Quantized GGUF defaults (q4_K_M). RAM is the lower bound for CPU inference; VRAM is what fits comfortably on a dedicated GPU so generation stays on the card.

Ollama modelParamsRAM (CPU)VRAM (GPU)Best for
llama3.2:3b3B4 GBFast everyday chat. Runs on basically any laptop with 4 GB free RAM.
llama3.1:8b8B8 GB6 GBSolid all-rounder for reasoning, writing, and code. Sweet spot for most laptops.
qwen2.5:7b7B8 GB6 GBStrong multilingual (incl. Portuguese & Spanish). Good for cross-locale teams.
mistral:7b7B8 GB6 GBSharp for code review, refactors, and explaining unfamiliar APIs.
deepseek-r1:7b7B8 GB6 GBReasoning-tuned distillation. Slower per token, deeper answers.
llama3.3:70b-q470B48 GB40 GBFrontier-tier locally, only run if you have a beefy GPU or workstation.

Bigger models read more nuance but generate slower. Start with an 8B; step up to 70B only if you have the hardware to keep generation snappy.

Tips for getting the most out of local

  • Match the model to your RAM

    A model that won't fit in RAM falls back to disk swap and grinds. Look at the recommended RAM column and pick one tier below your laptop's free RAM, leave headroom for the OS and your browser.

  • Mix local and cloud in one team

    Run the chatty agents locally (free, fast) and reserve cloud calls for the frontier judge or synthesis step. Best of both: low cost, high ceiling.

  • Run the moderator locally too

    aiDex's moderator just needs to emit small JSON plans. Llama 3.1 8B or Qwen 2.5 7B handle that easily, set it as your team's moderator and your whole conversation runs on-device.

  • GPU acceleration is automatic

    Ollama uses Metal on Apple Silicon, CUDA on NVIDIA, and Vulkan on AMD with zero config. If you have a GPU, you'll see it light up the moment generation starts.

No API bill. No data sharing. No lock-in.

Sign up for aiDex and you get the team-chat orchestration. Run the models on your own hardware and you get them free, private, and offline.