Local LLMs: When to Ditch the Cloud
Running AI models on your own hardware - when it makes sense, what you need, and the real trade-offs between cloud APIs and local inference.
Cloud AI is convenient. It's also:
- Expensive at scale
- A privacy concern
- Dependent on someone else's uptime
- Rate-limited when you need it most
Sometimes running your own models is the right call.
When Local Makes Sense
Privacy-Critical Applications
Medical data. Financial records. Legal documents.
Every API call is data leaving your network.
Local: data never leaves your servers.
High-Volume, Low-Complexity
Processing millions of documents with a simple classifier.
API costs spiral. Local costs are fixed after hardware.
Latency-Sensitive
API calls have network overhead. 100-500ms minimum.
Local inference can be single-digit milliseconds.
Offline Requirements
Aircraft. Ships. Remote sites. Disaster scenarios.
Can't call an API without internet.
When Cloud Makes Sense
Frontier Capabilities
GPT-4, Claude 3 Opus - the best models aren't available locally.
If you need state-of-the-art, you rent it.
Variable Load
Traffic spikes 10x during launches. Crashes back down.
Owning hardware for peak load means waste at average load.
Moving Fast
New model drops? API update. Done.
Local? Download, test, deploy, hope nothing breaks.
The Hardware Reality
What You Need
For development/prototyping:
- Mac M2/M3 with 16GB+ RAM: runs 7B models well
- Consumer GPU (RTX 3080+): runs 13B models
For production inference:
- A100 40GB: runs 70B models at speed
- H100: runs everything fast
- Multi-GPU clusters: runs multiple models simultaneously
Cost Comparison
| Setup | Upfront | Monthly | Break-even* |
|---|---|---|---|
| GPT-4 API | $0 | $10,000 | Never |
| Single A100 | $15,000 | $500 (power/hosting) | 2 months |
| Cloud A100 | $0 | $2,000 | 6 months |
*At 1M tokens/day. Your math will vary.
Model Selection
Small and Focused (1-7B params)
- Fast inference
- Runs on consumer hardware
- Good for classification, extraction, simple generation
Examples: Phi-2, Mistral 7B, Llama 3 8B
Medium and Capable (13-34B params)
- Reasonable speed with good hardware
- Better reasoning, more knowledge
- Production-viable for most tasks
Examples: Llama 3 70B (quantized), Mixtral 8x7B
Large and Powerful (70B+ params)
- Needs serious hardware
- Approaching API-model quality
- Worth it for specific applications
Examples: Llama 3 70B, Qwen 72B
Optimization Techniques
Quantization
Reduce precision, reduce memory, increase speed.
- FP16 → standard, minimal quality loss
- INT8 → 50% memory reduction, slight quality loss
- INT4 → 75% memory reduction, noticeable quality loss
model = AutoModelForCausalLM.from_pretrained(
"llama-3-70b",
load_in_4bit=True # Fits on consumer GPU
)
Use the lowest precision that meets your quality requirements.
Batching
Process multiple requests simultaneously.
Single request doesn't saturate GPU. Batching does.
Single request: 50 tokens/second
Batch of 8: 300 tokens/second (6x throughput)
Latency increases. Throughput increases more.
KV Caching
Store computed attention for reuse.
Essential for chat applications where context persists.
# First message: 500ms
# Subsequent messages: 100ms (reusing cached context)
Speculative Decoding
Small model drafts, large model verifies.
Draft model (7B): generates 8 tokens fast
Main model (70B): accepts 6, rejects 2
Net: faster than pure 70B inference
Cutting edge. Increasingly available.
Stay Updated
Get updates on new labs and experiments.
The Architecture of Autonomous Flight
How we built a neural-symbolic hybrid system to control manned aircraft in real-time.
The Ops Reality
What You're Signing Up For
- Model updates (download, test, deploy)
- Hardware maintenance (GPUs fail)
- Scaling (need more capacity, buy more hardware)
- Monitoring (is inference actually working?)
Cloud APIs abstract this. Local ownership exposes it.
Hybrid Approaches
Best of both:
if privacy_critical(request):
return local_model.generate(request)
elif complex_reasoning(request):
return cloud_api.generate(request)
else:
return local_model.generate(request) # cheaper
Use local by default. Cloud for what local can't do.
Getting Started
The Easy Path
- Install Ollama: One-line install, runs models locally
- Pull a model:
ollama pull llama3 - Run inference:
ollama run llama3
Start here. Move to serious infrastructure when you hit limits.
The Production Path
- vLLM: High-performance inference server
- Text Generation Inference (TGI): HuggingFace's solution
- NVIDIA Triton: Enterprise-grade serving
Benchmarks before choosing. Needs vary.
The Math Exercise
Before going local:
Current API cost: $X/month
Local hardware cost: $Y (one-time) + $Z/month (ops)
Break-even point: Y / (X - Z) months
If break-even < 12 months and you can handle ops: go local.
If break-even > 24 months or ops is a stretch: stay cloud.
Cold economics. Run the numbers.
Further Reading
Technical Resources
- Ollama - Easy local inference
- vLLM - Production inference
- LMSys Chatbot Arena - Model comparisons
Related Posts
- The Cost of AI - Understanding AI economics
- RAG vs Fine-Tuning - Model customization options
Cloud AI is a service. Local AI is a capability. Services can be cut off. Capabilities endure. Choose based on what you're building for.
Explore our services
AI consulting, development, and strategic advisory.
2026 Field Notes: Orchestration over God Prompts
The era of the "God prompt" is over. We're seeing a massive industry shift toward specialized micro-agents orchestrated via frameworks like CrewAI and LangGraph.
At Kingly, we power this with Lev (Leviathan), our universal agent runtime. Lev deploys AI workflows across 38 platforms without rewrites, utilizing disk-based orchestration (FlowMind YAML) instead of in-memory state. This guarantees deterministic handoffs and fundamentally prevents the "groupthink" that plagues shared-memory agent swarms.