AI Detection Tools Are Broken: The Evidence and Alternatives

The most damning single study#

In July 2023, Weixin Liang, James Zou, and colleagues at Stanford published "GPT detectors are biased against non-native English writers" in the journal Patterns (Cell Press publication, arXiv:2304.02819). The paper tested seven widely-used AI detectors on 91 TOEFL essays written by non-native English speakers and 88 essays written by US 8th-grade students.

The results:

Average false-positive rate on the TOEFL essays: 61.3%.
19.8% of the TOEFL essays were unanimously flagged as AI-generated by all seven detectors.
97.8% of the TOEFL essays were flagged by at least one detector.
On the US 8th-grade essays, the false-positive rate was close to zero.

This is not a small effect. It is a devastating result for anyone using AI detectors in contexts where non-native English writing is common — which is most of higher education globally, most international hiring processes, and most editorial pipelines that accept submissions from international writers.

The study has held up. It has been cited heavily, and follow-up evaluations on other detector tools have produced similar findings.

What the detector companies claim vs what independent tests show#

Turnitin claims around 98% accuracy with less than 1% false-positive rate on its AI detector. Turnitin's own Chief Product Officer has acknowledged that the detector deliberately lets roughly 15% of AI content through in order to keep the false-positive rate low (Leap analysis; San Diego Law Library guide).

Independent tests find real-world accuracy around 85-95% on straightforward cases, and false-positive rates of 3-4% on native English writing and 5-12% on non-native or technical writing. The 1% figure does not survive contact with real-world data distributions.

GPTZero independent testing has shown a wide range. A 2024 study titled "Perception, performance, and detectability of conversational AI across 32 university courses" found an 18% false-positive rate on actual student submissions (PMC reference). Other studies have reported rates from 1% on curated benchmark data up to 10% on real-world content. Medical text categories showed false-positive rates around 10%.

The pattern across all three major detector products: marketing claims of very low false-positive rates, independent testing revealing meaningfully higher rates on realistic data, and a wide range of results depending on exactly what is tested.

The universities that have responded#

The institutional response has been more decisive than the debate. Vanderbilt University disabled its Turnitin AI detector on 16 August 2023. The university's reasoning in its own words: with 75,000 papers processed annually and Turnitin's claimed 1% false-positive rate, that would mean roughly 750 falsely-accused student papers every year — an unacceptable number, even accepting Turnitin's optimistic own estimate.

Since Vanderbilt's announcement, Michigan State, Johns Hopkins, Curtin University (January 2026), Waterloo, Edinburgh, Manchester, and more have taken the same step. The education-advocacy site Please.edu maintains a running list of institutions that have disabled AI detectors, now well over 50.

This is not a fringe position. It is the response of universities that ran the numbers on false positives and concluded the reputational and ethical cost of false accusations outweighed the detection value. Several of these institutions cite the Liang study directly.

Why detectors do not work well on modern AI output#

The underlying technical problem has gotten worse since 2023, not better. Detection tools rely on signals like perplexity (how surprising the next word is) and burstiness (how much sentence variation there is). Early GPT-3.5 output was distinguishable because it had very low perplexity and flat burstiness. Current frontier models (GPT-5, Claude 4.7, Gemini 3) produce output with statistics much closer to human writing, which reduces the signal.

Simultaneously, human writing has shifted. A generation of writers has grown up working alongside AI tools, and their native style has converged toward cleaner, more structured prose. The distributions that detectors were trying to distinguish have narrowed. This is a fundamental problem, not a tuning one.

The honest read on current detection technology: useful as a rough first-pass signal, completely unreliable as a sole basis for any consequential decision.

What serious editors and institutions do instead#

The approach that has emerged in organisations that took the detection-reliability problem seriously is a shift from output-analysis to process-analysis.

Process signals: version history in Google Docs or Word tracked changes, revision timelines, intermediate drafts, commit history if relevant, recorded research conversations. A piece that appears fully formed in a single paste is suspicious in a way that a piece written over multiple sessions is not. None of these is conclusive alone, but together they are more reliable than any classifier.

Content signals: original reporting, verifiable citations, specific details that a writer would only include if they actually researched the topic, weird opinions backed by unusual reasoning, contestable claims that a model would typically smooth over. Models produce plausible-sounding generalities by default. Human writing, when the writer knows the topic, has specificity models do not insert without explicit prompting.

Editorial processes: outlines required before payment, calls with freelancers on anything that feels off, reference checks, paid trial periods for contract work, paying writers enough that the incentive to cut corners is smaller. These are not new practices. They are the practices that always distinguished good editorial from weak editorial, and the AI-detection crisis has pushed them back to the centre.

Educational alternatives#

For universities specifically, the alternatives that have emerged:

Oral defences or in-person written work for high-stakes assessments. Expensive but definitive.
AI-integrated assessment where students are allowed to use AI and are graded on their reflection about what the tool did well, what it got wrong, and what they contributed. This matches the world students will actually work in.
Assignment redesign toward tasks that require observed process (lab work, presentations, sustained research with intermediate deliverables).

None of these are free, but none of them require deploying a classifier that the evidence says is unreliable and biased.

Practical guidance#

If you run a publication or content operation: stop paying for AI detection tools as a primary authenticity check. Budget the money into real editorial review. Build process signals into your workflow. Require outlines. Look at version history. Have calls with new contributors before paying them.

If you teach or administer at an institution: the documented false-positive rates, particularly on non-native English writing, make detection-tool use a legal and reputational risk. The institutions that have disabled these tools have not regretted it. Consider redesigning assessment instead.

If you are a writer dealing with an AI-detector accusation: the Liang study and the long list of universities that disabled their detectors are a good starting defence. The burden of proof should not be on you to demonstrate humanity.

What does not change#

Models will keep improving. The output-statistics signals detectors rely on will keep weakening. The distinction between human and AI output will keep blurring, especially in short-form, routine, or clean-prose writing. This is a permanent structural shift, not a transient problem.

The good news is that the genuinely important questions were never "did a model write this" — they were "did a human with judgement shape this, is the piece true and specific, is the writer accountable". Those questions are answerable through process and content signals without a classifier. They also happen to be the questions that matter for long-term quality.

AI detection tools were always a shortcut for those harder questions. The shortcut has collapsed. The harder questions remain, and they can be answered, they just cost more.

Sources#

Liang et al., "GPT detectors are biased against non-native English writers", Patterns: https://www.cell.com/patterns/fulltext/S2666-3899(23)00130-7
Preprint on arXiv: https://arxiv.org/abs/2304.02819
Vanderbilt University announcement, 16 August 2023: https://www.vanderbilt.edu/brightspace/2023/08/16/guidance-on-ai-detection-and-why-were-disabling-turnitins-ai-detector/
Turnitin accuracy analysis (Leap): https://www.tryleap.ai/turnitin/accuracy
San Diego Law Library guide on AI detectors: https://lawlibguides.sandiego.edu/c.php?g=1443311&p=10721367
GPTZero research reference: https://gptzero.me/resources/researchers
"Perception, performance, and detectability of conversational AI" study (PMC): https://pmc.ncbi.nlm.nih.gov/articles/PMC10519776/
Running list of universities disabling AI detectors: https://www.pleasedu.org/resources/schools-that-banned-ai-detectors

Roland Hentschel

AI & Web Technology Expert

Web developer and AI enthusiast helping businesses navigate the rapidly evolving landscape of AI tools. Testing and comparing tools so you don't have to.

AI Detection Tools Are Broken: The Evidence and Alternatives

The most damning single study#

What the detector companies claim vs what independent tests show#

The universities that have responded#

Why detectors do not work well on modern AI output#

What serious editors and institutions do instead#

Educational alternatives#

Practical guidance#

What does not change#

Further reading#

Sources#

Roland Hentschel

Tools Covered in This Post

Surfer SEO Pricing 2026: Plans, Costs and How to Use It

Canva AI Guide 2026

ChatGPT + DALL-E 2026: Integration & Image Limits

More from the Blog

Should You Build Your MVP With Lovable in 2026?

AI Agents and MCP Go Mainstream

Open-Source AI 2026: Gemma 4, DeepSeek V4, Llama 4