> Local-first AI for people who own their tools
— forge team
The default settled fast: prompt goes over the wire, somebody else's GPU answers, your data is in somebody else's log. This is fine, until it isn't. Then it's never fine again.
Local-first AI is the other default. Model lives on your box. Prompt never leaves. The reply costs you electricity, not credits.
There are three arguments. They are old arguments. They are still the right ones.
1. Ownership
In 2019, Ink & Switch published Local-first software: you own your data, in spite of the cloud. Kleppmann, Wiggins, van Hardenberg, McGranaghan. The piece names seven ideals — no spinners, your work isn't trapped on one device, the network is optional, seamless collaboration, the long now, security and privacy by default, and ultimate ownership and control. The argument is not "the cloud is bad." The argument is: the primary copy of your work lives on a machine you control. Cloud is a sync target, not the landlord.
Apply that to AI. The "work" is your prompts, your context window, the documents you're feeding the model, the conversation log. The primary copy should live where you live. If the vendor folds, your tool keeps working. If the vendor changes the price, your tool keeps working. If the vendor decides your sector is now off-limits, your tool keeps working. That's the long now, applied to inference.
2. Privacy
Self-hosted inference is the only deployment shape where you can honestly say no data leaves the box. Industry write-ups on self-hosted vs. cloud LLMs keep returning to the same observation: when regulated data is in scope — GDPR, HIPAA, SOC 2 — you can't defensibly audit a trail that lives on someone else's servers. PII detection, output validation, prompt logging — all of it stays inside your trust boundary.
That's the compliance frame. The personal frame is simpler. If you wouldn't paste it into a stranger's terminal, don't paste it into a remote model. A local model is closer to that stranger's terminal being your own.
3. Latency and cost-per-token-of-electricity
Local stopped being slow.
Recent benchmarks (SitePoint, 2026) put Ollama serving Llama 3.1 8B (Q4_K_M) at roughly 62 tok/s on a single stream — vs. vLLM FP16 at 71 tok/s. A 13% gap, mostly from quantisation, not architecture. Ollama is riding llama.cpp's kernels, which are themselves within 2–8% of bare-metal at the same settings. For a 70B model on Apple Silicon, the gap is even narrower in absolute terms. For a human reading a response, you do not feel the difference.
What you do feel: no network round-trip, no rate limit, no $0.000-something per call adding up while you iterate. The marginal cost of the next prompt is the electricity to compute it. That changes how you use the thing. You don't ration. You don't batch. You just try.
Where this lands
Local-first AI isn't a purity test. Cloud models are still bigger, still better at the long-tail task, still the right call for the one-shot research query. The hybrid is the realistic stance: local by default, cloud opt-in per prompt, your data never leaves the box unless you explicitly send it.
That's the shape forge ships with. Local model, unlimited prompts, no telemetry. Cloud is a flag, not a default.
You own the tool. The tool runs in your house. The work stays in your house.
That used to be how software worked. It can be that again.
Sources
- Ink & Switch: Local-first software — 2019 essay by Kleppmann, Wiggins, van Hardenberg, McGranaghan; the seven ideals of local-first software.
- SitePoint: Ollama vs vLLM Performance Benchmark 2026 — single-stream and concurrent benchmarks for Llama 3.1 8B on consumer GPUs.
- Kunal Ganglani: Ollama vs llama.cpp 2026 — measurement of Ollama's overhead vs. raw llama.cpp (2–8% on a 4090 / 7B).
- RedactChat: Self-Hosted vs Cloud LLMs — compliance and trust-boundary argument for self-hosted inference.
- SitePoint: Definitive Guide to Local LLMs 2026 — Privacy, Tools & Hardware — overview of the current local-LLM toolchain and privacy posture.