AI Hallucinations in Legal Research: A Self-Test

The question nobody wants to test rigorously#

Everyone who uses AI knows it hallucinates. Most people have their own stories: a made-up book title, a fake Supreme Court case, a confident wrong answer about their own technology stack.

But "hallucinates sometimes" is a vague claim. How often? In what domains? Which models fail worse? And crucially: does it fail more when the stakes are higher?

I decided to test one corner of this empirically. Not a proper academic study, but a structured self-test I could reproduce. The topic: German GmbH law, specifically questions a founder of a small company might actually ask. The stakes: getting the wrong answer could lead to personal liability, tax penalties, or an invalid contract.

I have no legal qualification. But I have a law school classmate who does, who was willing to check every AI answer against the actual sources. This is what we found.

The test setup#

Ten questions, all real things I or founder friends had asked about GmbH law in the past year. All answerable from German commercial and tax law. None of them obscure edge cases, all of them things a competent German lawyer would answer in under five minutes.

Four models tested, all with their default settings in March 2026:

GPT-5.4 via ChatGPT Plus
Claude Opus 4.6 via Claude Pro
Gemini 2.0 Pro via Google One AI Premium
Perplexity Pro (as a research-focused baseline)

Each model got each question fresh in a new conversation, asked in German, with the same framing: "I am a GmbH founder, please explain..." No system prompts, no follow-up questions, one response per question.

My law school classmate graded each response against the actual German Commercial Code (HGB), GmbH Act (GmbHG), and relevant tax law. Grading scale:

Correct: the answer is substantively right, no fabrications
Partially correct: the answer is right in direction but missing or misrepresenting specifics
Wrong: the answer is factually incorrect, misleading, or contains fabricated citations

The results#

Across 40 total responses (10 questions, 4 models):

Correct: 18 (45%) Partially correct: 14 (35%) Wrong: 8 (20%)

One in five answers to a real legal question was flat wrong. Not "slightly off" wrong, but the kind of wrong that would produce actual harm if acted on.

By model:

Perplexity Pro: 7/10 correct, 3/10 partially correct, 0/10 wrong
Claude Opus 4.6: 6/10 correct, 3/10 partially correct, 1/10 wrong
GPT-5.4: 4/10 correct, 4/10 partially correct, 2/10 wrong
Gemini 2.0 Pro: 1/10 correct, 4/10 partially correct, 5/10 wrong

Perplexity's lead is not surprising. It grounds answers in web search results and cites sources, which reduces the fabrication surface area. Claude was the best pure LLM. Gemini was much worse than I expected, with frequent citations of German tax paragraphs that either do not exist or say something different from what the model claimed.

The worst failures#

Three of the eight wrong answers are worth describing because they illustrate the failure mode.

Failure 1 (GPT-5.4): Asked about the minimum capital requirement for a UG (haftungsbeschraenkt). The model confidently said 25.000 EUR, which is the full GmbH requirement. UG actually requires a minimum of 1 EUR. This is the kind of fact a founder might act on when deciding which company form to use.

Failure 2 (Gemini 2.0 Pro): Asked about the procedure for changing the managing director (Geschaeftsfuehrerwechsel). The model described a process that mixed German and Austrian GmbH law, confidently cited paragraphs of the GmbHG that were either renumbered or did not exist, and gave a timeline that was off by weeks. Acting on this advice would have produced an invalid registration.

Failure 3 (Claude Opus 4.6): Asked about whether a GmbH can hold its own shares (eigene Anteile). The model said yes, with caveats about capital reserves. The actual rule is more restrictive than the model presented. Not fabricated, but incomplete in a way that would matter for an actual transaction.

Note that the "best" LLM still had a significant wrong answer. The failure rate is not zero for any of the pure models.

What the partially correct category tells you#

The "partially correct" responses are actually the more insidious problem. They read well. They sound authoritative. They are correct in tone and general direction. A lay reader, including most founders, would not catch what is wrong.

One of the partially correct answers from Claude, for example, correctly described the tax treatment of GmbH distributions but confused the 60 percent partial exemption rule (Teileinkuenfteverfahren) with the older half-income rule (Halbeinkuenfteverfahren). For most purposes this would be a harmless error. For someone calculating expected after-tax income on a large distribution, it would be a 4-6 percent miscalculation in their favour.

The model was not lying. It was confidently merging two similar-looking frameworks. That is worse than a hallucination, because it does not trigger your skeptic instinct the way a fabricated citation does.

Why this happens#

Legal text is unusually dangerous for LLMs for three specific reasons.

First, the source material is dense, technical, and self-referential. Paragraphs reference other paragraphs. Amendments change specific clauses without changing the surrounding structure. The model's statistical pattern of "what legal text looks like" is strong, but its ability to track which specific rule applies in which specific context is much weaker.

Second, answers in legal domains have a characteristic register of confidence. Lawyers do not hedge. They state rules. Models imitate that tone well, which makes a wrong answer sound as authoritative as a right one.

Third, minor updates to laws happen constantly, and the model's training data cutoff lags reality by many months. A rule that was correct in 2023 may not be correct now. The model does not know which of its facts are stale.

What actually works#

A few things I now do before trusting any AI output on legal or tax questions:

I use Perplexity first, because the citation discipline forces verifiability. If Perplexity cannot find a source for a claim, that is a strong signal.
I ask the same question to two models and compare. If they agree, I verify one source. If they disagree, I verify both.
I check specific numbers, dates, and paragraph references against the primary source (gesetze-im-internet.de for German law, the Bundessteuerblatt for tax rulings). I do not trust any AI-provided citation without clicking through.
For anything where I would actually act, I talk to an actual lawyer or tax advisor. AI is good for getting oriented. It is bad for deciding.

The harder question#

The interesting thing is not that AI hallucinates. It is that it hallucinates less in domains where most people cannot detect it. Founders know their own product. Developers know their code. Lawyers know law. In each case, the AI is most useful and most dangerous in the other two domains.

The failure mode is asymmetric. When AI helps you outside your expertise, you cannot tell whether it is helping or misleading. The only defense is treating AI as a source to verify, never as an authority to trust.

That is not an exciting conclusion, and it limits AI's value for research-heavy knowledge work. But ten bad answers out of forty, in a high-stakes domain, is a strong enough signal that I am not going to pretend otherwise.

For more on how the models compare on different kinds of accuracy, our ChatGPT vs Claude comparison covers the practical quality differences. But none of those differences matter if you do not verify output that matters.

Roland Hentschel

AI & Web Technology Expert

Web developer and AI enthusiast helping businesses navigate the rapidly evolving landscape of AI tools. Testing and comparing tools so you don't have to.

AI Hallucinations in Legal Research: A Self-Test

The question nobody wants to test rigorously#

The test setup#

The results#

The worst failures#

What the partially correct category tells you#

Why this happens#

What actually works#

The harder question#

Roland Hentschel

Tools Covered in This Post

Grammarly Guide 2026

Dify Review 2026: Open-Source LLM App Platform, Tested

Descript Guide 2026

More from the Blog

Generative Engine Optimization: How to Track Your Brand in AI Search

Should You Build Your MVP With Lovable in 2026?

AI Agents and MCP Go Mainstream