Skip to main content
AI Tool Radar
Deep Dives

Open-Source AI in 2026: Does Local Actually Make Sense? Gemma 4, DeepSeek V4, Llama 4 & Qwen 3.5

The gap between open-weight and closed models has shrunk to a few benchmark points. But the open flagships are mixture-of-experts giants that need server hardware, not your laptop. Here is what actually runs locally, what does not, and when self-hosting is worth it, especially for data protection in the DACH region.

6 min read2026-05-25By Roland Hentschel
open sourcelocal llmgemma 4deepseek v4llama 4qwendata protection

The gap closed, but read the fine print#

The big story in open-weight models is real: the distance between the best open models and the best closed ones has narrowed to a handful of benchmark points. DeepSeek, Alibaba and Google are shipping models that trade blows with frontier systems, and they publish the weights.

The fine print is where most articles mislead you. "Open weights" does not mean "runs on your laptop". The 2026 open flagships are sparse mixture-of-experts (MoE) models with hundreds of billions to over a trillion parameters. They are open in the licensing sense, and they need serious server hardware to run. The models that actually run on a workstation are the small ones. Confusing those two categories is the single most common mistake in this space, so let us separate them clearly.

The 2026 open flagships (server-class)#

These are the headline releases. All are MoE, which means a large total parameter count but only a fraction active per token, so they are cheaper to run than their size suggests, but still not laptop material.

DeepSeek V4 shipped on 24 April 2026 as two models, both open weights under the MIT license. Per DeepSeek's own release notes, V4-Pro is 1.6T total / 49B active parameters and V4-Flash is 284B total / 13B active, both with a 1 million token context window and output up to 384K tokens. This is arguably the strongest fully-open release of the year on coding and reasoning.

Qwen 3.5 from Alibaba arrived 16 February 2026. The flagship Qwen3.5-397B-A17B carries 397B total parameters with 17B active, on a hybrid architecture, with native context up to 262K tokens (extendable toward 1M). It is available on Hugging Face and is one of the strongest open models for reasoning and multilingual work.

Llama 4 is the older player here, and it matters to say so. Meta released the Llama 4 herd in April 2025: Scout (17B active, 16 experts, a headline 10M token context) and Maverick (17B active, 128 experts, 400B total). A year on it is still widely deployed and well supported, but it is no longer the freshest option, and that shows on the newer benchmarks.

The practical point: none of these run on consumer hardware. To self-host them you are renting or buying GPU servers. For most small businesses that means using them through an API (DeepSeek's own, or a hosting provider), not running them yourself.

What actually runs locally (workstation-class)#

This is the category that matters if your reason for going open is privacy or offline use.

Gemma 4 is Google's open-model family, released in April 2026. Google calls it "byte for byte the most capable open model" and notes Gemma has now been downloaded over 500 million times. It comes in several sizes, from a roughly 2B-class model up to around 31B, and the smaller variants are the point: they are built to run on a single GPU or a capable laptop. A 30B-class model at 4-bit quantization typically wants 12-16GB of VRAM, which a mid-range workstation GPU or an Apple Silicon machine with enough unified memory can handle.

Alongside Gemma, the smaller Qwen variants and Mistral's open models fill out the locally-runnable tier. Tools like Ollama (ollama run gemma4) and LM Studio have made the setup genuinely simple, which was the missing piece two years ago.

The honest trade-off: a 30B local model is good, not frontier. It will handle summarisation, drafting, classification, structured extraction and most everyday tasks well. It will not match GPT-5.5 or Claude Opus 4.7 on the hardest reasoning or coding. For a lot of business work, that is a perfectly acceptable deal, especially given what you get in return.

The DACH angle: when local is worth it#

For businesses in Germany, Austria and Switzerland, the appeal of local models is rarely raw capability. It is data protection. Running a model on your own hardware means customer data, case files, patient notes or contracts never leave your premises. No US cloud, no transfer question, no third-party processor to put in a contract.

That is a strong argument for regulated and sensitive work, and we have written before about local LLMs for regulated professions in DACH. But be honest about the cost side:

  • Hardware is a real investment. A capable local setup (a workstation with 16-24GB+ of VRAM, or an Apple Silicon machine with large unified memory) is a four-figure purchase, plus the time to maintain it.
  • Quality is "good enough", not best-in-class. You are trading the top 10% of capability for control over your data.
  • Compliance is not automatic. Local processing helps with data protection, but it does not by itself make you GDPR-compliant. You still need the documentation, the legal basis and the rest.

The decision is not "open vs closed" in the abstract. It is "does this specific workflow involve data I cannot send to a cloud, and is a good-enough model acceptable for it?" When the answer to both is yes, a local Gemma or Qwen on your own machine is an excellent fit. When the work is general and the data is not sensitive, a cheap hosted model will almost always be faster, better and less hassle.

MoE is now the default, and that is good news#

One structural shift worth understanding: nearly every flagship open model in 2026 is a sparse mixture-of-experts. Instead of activating all parameters for every token, the model routes each token to a small subset of "expert" sub-networks. A 397B model might only use 17B parameters per token.

For you, this means open models punch above their apparent size in cost and speed, while still being large in capability. It is the main reason the open-vs-closed gap closed so fast, and the main reason even the giant models are cheaper to serve than their headline parameter counts suggest.

The bottom line#

Open-weight AI in 2026 is in great shape, but the useful question is narrow. If you want to self-host for privacy, look at the small, locally-runnable models (Gemma 4, smaller Qwen, Mistral) and accept "good enough". If you just want frontier capability cheaply, use the big open models (DeepSeek V4, Qwen 3.5) through an API, or frankly compare them against a hosted closed model on price and quality for your task. Do not buy a server because a 1.6-trillion-parameter model is "free to download". It is free to download and expensive to run.

Sources#


Roland Hentschel

Roland Hentschel

AI & Web Technology Expert

Web developer and AI enthusiast helping businesses navigate the rapidly evolving landscape of AI tools. Testing and comparing tools so you don't have to.

Tools Covered in This Post

More from the Blog