Why ChatGPT Gets Worse on German Prompts

The thing everyone notices#

If you use ChatGPT in both English and German, you have felt this: the English responses are sharper, more nuanced, more useful. The German ones are flatter, more generic, sometimes noticeably wrong in ways the English version would not be.

This is not imagined. It is the direct consequence of how large language models learn, how their training data is distributed, and how tokenisation works. None of the fixes are perfect, but there are real things you can do that close most of the gap.

What is actually happening#

Three things stack against German output in GPT-5 and similar models.

First, the training data is dramatically English-heavy. Best estimates put English at 50-70 percent of the pretraining corpus across major models. German is usually the second or third most represented European language, but still only a few percent. That means for the same concept, the model has seen maybe 10-20 times more English examples than German ones.

Second, German tokenisation is inefficient. The tokeniser for most modern models was optimised for English. German words get split into more tokens, which means:

You hit context limits faster
Output generation is slower
The model has weaker statistical grip on German word patterns

Third, high-quality German reference text in the training data is narrower. Academic papers, documentation, and technical content are overwhelmingly English. Even when the model knows a concept in German, its "idea" of that concept was formed mostly from English sources and translated.

What people usually try first (and why it mostly fails)#

The most common reflex is to prompt in German and expect German-native quality. When that fails, the next move is usually prompting in English and asking for a German translation at the end.

That second approach is better than the first, but it produces a characteristic problem: the German output reads like translated English. Sentence structures are wrong. Idioms land flat. Compound words that should exist do not, and compound words that should not exist appear. A German reader can smell it instantly.

What actually helps#

Over the last year I have landed on five techniques that meaningfully improve German output quality. None are magic, but stacked together they close most of the English-German gap.

1. Prompt in German, but frame in English#

The system prompt or persistent context should be in English. The user request should be in German. This is counter-intuitive but works because it lets the model use its stronger English reasoning for the instructions while still generating in German.

Example system prompt:

You are a German-language copywriter with 15 years of experience writing for B2B SaaS in the DACH region. You write in "Sie" form by default, use short sentences, avoid Anglicisms where clean German alternatives exist, and never use em-dashes. All output must be in German unless explicitly asked otherwise.

The user prompt in actual German then gets the benefit of the English-framed guardrails.

2. Provide German examples explicitly#

Tokens that do not appear in training data often enough get shaky output. You can partially fix this by including 1-3 high-quality German example passages in the prompt.

These do not have to be on the same topic. They should show the tone, sentence structure, and word choice you want. The model will pattern-match against them and produce output that respects those patterns.

3. Forbid specific failure modes explicitly#

German-language LLM output has predictable failure patterns. Ban them upfront:

Vermeide folgende Patterns:

Einleitungen wie "In der heutigen schnelllebigen Welt"

Filler wie "Es ist wichtig zu beachten"

Anglizismen wenn klare deutsche Alternativen existieren

Em-Dashes oder Gedankenstriche (--)

Saetze laenger als 20 Woerter

Generische Abschluesse wie "Zusammenfassend laesst sich sagen"

The forbidden list is more effective than any positive instruction, because you are banning the path of least resistance that the model would otherwise fall into.

4. Use Claude for German, not GPT#

This one is uncomfortable to say because I like ChatGPT. But based on side-by-side testing over several months, Claude Opus 4.6 produces noticeably better German than GPT-5.4. The sentence rhythms are more natural, the Anglicism rate is lower, and the tone shifts (Sie vs du, formal vs casual) are cleaner.

I still use ChatGPT for everything else. But for client-facing German copy, I default to Claude. Our ChatGPT vs Claude comparison goes deeper on when each wins.

5. Always read the output aloud#

This is a workflow fix, not a prompt fix, but it is the most important. LLM German often passes a silent read, because your eyes autocomplete the unnatural phrasing. It fails a spoken read almost immediately.

Reading aloud catches: sentence structures that require breathing in the wrong places, words that land awkwardly, rhythms that sound like translated English. Anything that makes you stumble is a signal that a human reader will feel the same thing, even if they cannot articulate why.

What does not help#

In the interest of saving you time on things I have tried that did not work:

Translating the prompt into German more carefully. The quality bottleneck is in the model, not in the prompt language. Elaborate German prompts do not outperform simpler ones.

Asking the model to "write like a native German speaker". The model already thinks it is doing this. Saying it explicitly does nothing.

Using GPT-4 instead of GPT-5 for German. GPT-5.4 is better, despite the English-bias issue. Older models are worse across the board.

Post-editing with a second LLM call. Sometimes it helps, but often it introduces new errors while fixing old ones. Human editing is more reliable.

The meta-lesson#

The English-German quality gap in LLMs is real, measurable, and unlikely to close completely in the next model generation. The fix is not to force the model to be something it is not. The fix is to use the model inside a workflow that accounts for its weaknesses: English-framed instructions, forbidden patterns, native-language examples, and always a human editing pass.

This also has implications for using LLMs across any non-English market. The same techniques work for French, Italian, Polish, and other mid-tier represented languages. The gap narrows as you move down the training data distribution, but the tactics are the same.

For AI tools that handle German content specifically well, see our writing tools category and the Jasper guide, which has the strongest German brand-voice controls of any marketing-focused tool I have tested.

Roland Hentschel

AI & Web Technology Expert

Web developer and AI enthusiast helping businesses navigate the rapidly evolving landscape of AI tools. Testing and comparing tools so you don't have to.

Why ChatGPT Gets Worse on German Prompts

The thing everyone notices#

What is actually happening#

What people usually try first (and why it mostly fails)#

What actually helps#

1. Prompt in German, but frame in English#

2. Provide German examples explicitly#

3. Forbid specific failure modes explicitly#

4. Use Claude for German, not GPT#

5. Always read the output aloud#

What does not help#

The meta-lesson#

Roland Hentschel

Tools Covered in This Post

Emergent Guide 2026: The AI App Builder That Ships Full-Stack Apps

Wegic Review 2026: Chat-Based AI Website Builder, Tested

Loom Guide 2026

More from the Blog

Generative Engine Optimization: How to Track Your Brand in AI Search

Should You Build Your MVP With Lovable in 2026?

AI Agents and MCP Go Mainstream