What actually exists now#
The local-LLM pitch has been "you can drop your closed-API subscription and self-host" for two years. The pitch keeps maturing. This post is the verified 2026 version — what exists, what it costs, and what still does not work.
I am covering three flagships (Llama 4, DeepSeek V3, Mistral Large 3), the hosting-provider economics, the hardware reality, and the tooling layer. All numbers are from cited sources, not estimates.
Llama 4 (Meta, April 2025)#
Meta released Llama 4 on 5 April 2025. The family has three members:
- Scout: 17B active parameters / 16 experts (109B total), 10M token context window
- Maverick: 17B active parameters / 128 experts (400B total), multimodal
- Behemoth: larger model still in training at announcement
The 10M context window on Scout is real and reported by Meta, making it the longest-context open-weight model available. Hugging Face release notes have the implementation details.
In practice: Scout is a strong mid-tier model at a sensible inference cost. Maverick is multimodal and closer to frontier closed models on benchmarks. Neither has displaced Claude or GPT-5 for hardest tasks, but both are deployable.
DeepSeek V3 (DeepSeek, late 2024 through 2026)#
DeepSeek V3 (paper on arXiv) is a 671B total / 37B active parameter MoE model with Multi-Head Latent Attention. The model architecture is documented in detail and has been studied extensively.
Pricing on the official DeepSeek API:
- V3: $0.27 per million input tokens, $1.10 per million output
- V3.1: $0.15 / $0.75
- V3.2: $0.26 / $0.38
These are lower than any closed-frontier model by a significant margin. The trade-off is that DeepSeek lags Claude and GPT-5 on tool-use reliability, long-context coherence, and frontier-instruction following. For high-volume tasks where the price gap matters, this is easily worth it. For complex production workflows where reliability matters more than cost, it is not.
Mistral Large 3 (Mistral, December 2025)#
Mistral released Large 3 on 2 December 2025. 675B total / 41B active parameters, also MoE. 256K context window, multimodal and multilingual.
This is the current Mistral flagship, replacing Large 2 (the 123B dense model from mid-2024). For European-language work especially, Mistral remains the best open-weight option because the training data mix was calibrated for it.
What hosted providers charge#
For most people, running these models means calling them through a hosting provider, not buying the hardware. Inference.net's comparison and Together's pricing page give verified numbers:
- Llama 4 Scout: roughly $0.08/M input, $0.30/M output at most providers.
- Llama 4 Maverick: roughly $0.20/M input, $0.60/M output.
- Together AI, Fireworks, DeepInfra: competitive on price.
- Groq: competitive on speed rather than price, with significantly faster inference on supported models.
Compare to Claude Sonnet 4.6 at $3/$15 per million: open-source Llama 4 Scout via hosting is roughly 40x cheaper on output tokens. This is where the "local" economics actually comes from in 2026 — not from running hardware yourself, but from using cheaper-per-token open models via API.
The hardware reality#
If you actually want to run on your own hardware:
H100 PCIe on the secondary market: roughly $25,000 to $30,000 in 2026, down from the $80,000-$120,000 peak in 2023 (ThunderCompute pricing analysis). DGX H100 complete systems run $250,000 to $400,000.
RunPod H100: $2.69/hour on-demand in March 2026. Lambda H100 SXM: $3.78/hour on-demand (Intuition Labs rental comparison). Spheron: as low as $2.01/hour.
For most workloads, renting beats buying until you are running continuously. At $2.69/hour for 8 hours/day, 20 days/month, you pay roughly $430/month for dedicated H100 access. An owned H100 at $25,000 pays back at that usage in about 5 years — probably longer than the hardware's useful life at this pace of progression.
Apple Silicon: an M4 Max with 128GB unified memory runs Llama 4 70B in 4-bit quantisation at maybe 25 tokens per second, usable for single-user workloads. Ollama v0.19 added MLX support in March 2026, making this more practical.
The tooling layer#
All three major local-LLM tools remain actively maintained in 2026:
- Ollama: v0.19 with MLX on Apple Silicon. One-command local model hosting.
- LM Studio: still the only full-featured GUI option. Good for non-technical users.
- vLLM: v0.11 with Blackwell FP4/FP8 support. vLLM-Omni arrived November 2025 for multimodal. This is the production-grade option.
Any of these will take you from "I downloaded a model" to "I have an OpenAI-compatible API endpoint" in under an hour. That used to be the hardest part. It is now solved.
Where the gap still hurts#
Real limitations that matter in 2026:
Long context. Llama 4 Scout claims 10M tokens. In practice it starts degrading noticeably well before the advertised maximum. For long-document work that actually needs the full window, Claude's 1M context with proven retrieval behaviour is still more reliable.
Tool use and structured output. Open-source models are less reliable at JSON output, function calling with complex schemas, and multi-turn tool conversations. For a production pipeline running thousands of times, the reliability gap compounds. For one-off queries, it is fine.
Frontier-instruction following. "Respond in this specific format but only if the input meets these conditions, otherwise say nothing" style instructions are handled more reliably by Claude 4.7 and GPT-5 than by any current open model. Not by a huge margin, but enough to matter in production.
Where local actually wins#
Privacy and compliance. If you process medical data, legal documents, or DSGVO-restricted material, self-hosting is not a cost play, it is a legal requirement. The economics are less important than the fact that the closed APIs simply do not pass some risk reviews.
High-volume simple tasks. Classification, translation, tagging at millions of calls per month. At these volumes, per-token cost beats everything else, and the reliability gap matters less with downstream validation.
Custom fine-tuning. If you have domain-specific data, fine-tuning Llama 4 or Mistral on your data produces something the closed APIs cannot. This remains the most legitimate reason to self-host in 2026.
Edge and offline. You cannot call Claude from a device without internet. You can ship a 3B or 7B model to run on-device. This is a growing category for mobile apps, automotive, and industrial use cases.
The hybrid pattern#
After a year of testing, my own working stack:
- Claude Sonnet 4.6 via API for coding, writing, long-context, and client-facing work.
- Claude Haiku 4.5 for simple summarisation at low volume.
- Llama 4 Scout via Together AI for batch data processing where the 40x price advantage matters.
- DeepSeek R1 via API for high-volume reasoning tasks where the price gap closes most of the quality gap.
- Local Mistral 7B on Mac for privacy-sensitive one-offs.
Monthly cost across all of these: around $130 at my usage, compared to roughly $250 if I ran everything on frontier APIs. The quality of output has gone up because I am using better-fit models per task, not down.
Decision rule#
Do you have privacy or compliance requirements? Self-host or use a DSGVO-conformant European API. Economics are secondary.
Do you run high-volume repeatable tasks? Use an open-source model through Together, Fireworks, or DeepInfra. You get cost savings without ops burden.
Are you a solo dev or small team doing general work? Stay on Claude or GPT-5 as your default. The cost of a subscription is dwarfed by the productivity, and managing a local stack is not free engineering time.
Are you specifically excited about tinkering? Go for it. Just do not convince yourself you are saving money once you factor in your own time.
Further reading#
- The $500 AI Stack That Replaced My $3,000 SaaS Bill for the broader question of stack design.
- Hidden Costs of Credit-Based AI Pricing for the closed-API cost traps.
- Reasoning Models: When Are They Worth It? for when to deploy compute at all.
Sources#
- Llama 4 release: https://ai.meta.com/blog/llama-4-multimodal-intelligence/
- Llama 4 Hugging Face notes: https://huggingface.co/blog/llama4-release
- DeepSeek V3 paper: https://arxiv.org/abs/2412.19437
- DeepSeek API pricing: https://api-docs.deepseek.com/news/news1226
- Mistral Large 3 launch: https://techcrunch.com/2025/12/02/mistral-closes-in-on-big-ai-rivals-with-mistral-3-open-weight-frontier-and-small-models/
- Mistral Large 2 history: https://mistral.ai/news/mistral-large-2407
- Hosting provider pricing: https://inference.net/content/llm-api-pricing-comparison/
- Together pricing: https://www.together.ai/pricing
- H100 secondary market pricing: https://www.thundercompute.com/blog/nvidia-h100-pricing
- H100 rental comparison: https://intuitionlabs.ai/articles/h100-rental-prices-cloud-comparison
- Local tooling roundup: https://medium.com/@rosgluk/local-llm-hosting-complete-2025-guide-ollama-vllm-localai-jan-lm-studio-more-f98136ce7e4a
