A short story that is not mine#
A professor at a US state university I will not name sent me a transcript last year. A student had submitted an essay. The professor ran it through Turnitin's AI detector, which flagged 68 percent of it as AI-generated. The student insisted he wrote it himself. He had a Google Doc version history showing the writing happen in real time, pauses and all. He also had a speech-to-text habit because of a wrist injury, which meant his sentences came out more polished than his class average, because he was literally dictating them.
The AI detector did not know any of that. It saw the pattern of polished, relatively low-perplexity prose and flagged it as machine-written. The professor, to his credit, did not escalate. A colleague in the same department, in a similar situation, did.
I have been collecting stories like this for about a year. It is how I got to the conclusion in the title of this post. AI detection tools are not just imperfect, which would be fine. They are systematically broken in ways that matter, and the serious publishers and educators I know have quietly stopped using them for anything load-bearing.
This is a post about why they are broken and what the replacement looks like in 2026.
What detection tools claim, and what they actually do#
A detection tool is a classifier. It looks at a piece of text and outputs a probability that the text was produced by a language model. The features it uses are some combination of:
- Perplexity, meaning how surprising the next word is given the previous ones. Model output tends to be lower perplexity than human writing because models optimise for likely next tokens.
- Burstiness, meaning how much sentence length and complexity varies. Human writing is bursty. Early model output was not.
- Stylometric features, meaning patterns in vocabulary, punctuation, sentence structure.
This approach worked reasonably well in 2022 and 2023. It stopped working in stages across 2024 and 2025, for two reasons.
Models got better at sounding human. GPT-4 was more bursty than GPT-3. GPT-5 and Claude 4 are so much more bursty that the feature has almost no discriminative power anymore. The detectors tried to adapt, but they were always chasing.
Humans started writing more like models. This one is subtler but more important. A generation of writers has now been reading and producing text with models in the loop. Polished, confident, structurally tight prose used to be a marker of a model. It is increasingly a marker of an educated human writer who has learned what clean prose looks like from years of working alongside AI. The distribution has converged.
The empirical consequence is that published benchmarks for detection tools look dire.
A 2024 Stanford study found that GPT-4 output passed Turnitin's AI detector as human about 50 percent of the time. A 2025 follow-up from a team at the University of Maryland tested all the major detectors (GPTZero, Originality.AI, Copyleaks, Turnitin, ZeroGPT) on Claude 3.5 and Gemini 1.5 output and found the best detector caught 40 percent. On simple rewrites ("paraphrase this in your own style"), every detector dropped below 20 percent.
Meanwhile, the false positive rate on human writing was consistently in the 3 to 10 percent range, and much higher on non-native English writing. A separate Stanford paper showed detectors flagging non-native TOEFL essays as AI-generated at rates above 60 percent, which is a devastating result for anyone using these tools in an educational or hiring context.
The honest summary is that these tools are close to useless for their advertised purpose, and actively harmful when applied to non-native speakers.
What serious publishers do instead#
Over the last year I have talked to editorial leads at four content companies, two educational publishers, and a handful of academics who used to rely on detection tools. They have all landed in roughly the same place, which is that they use a combination of process signals and content signals that do not try to answer the question "did a model write this?" directly.
Process signals. These are signals about how the writing was done, not what it looks like.
- Version history. Google Docs revision timeline, Microsoft Word track changes, any environment that shows the writing happen. A piece that appeared in full in one paste is suspicious in a way that a piece written over three sessions is not.
- Deadline behaviour. Work submitted exactly at deadline, from a person who has historically been early or late, is a weak signal but a real one when combined with others.
- Draft conversations. A writer who can talk about the choices in their draft, the things they considered and rejected, the research they did, is doing something a model cannot fake on demand.
None of these are conclusive alone. Together they paint a picture that no classifier does.
Content signals. These are signals about the piece itself that are harder to fake than stylometry.
- Original research or reporting. A piece that quotes a specific interview, cites a specific study it clearly engaged with, or contains original data is not something a model produces without being walked through each step.
- Specificity of detail. Models tend to produce plausible-sounding generalities. Human writing, especially on a topic the writer actually knows, has specificity that a model copies only when prompted. A sentence like "I sat with the product team for two hours and they told me the churn cohort was mostly users in the $29 tier" is expensive for a model to produce in context.
- Weird opinions. Model output tends toward defensible, balanced positions. Humans make strange calls. A piece with an unusual opinion backed up by an unusual line of reasoning is human-coded in a way that detectors do not measure.
Editorial process. This is the part that has really changed.
- Phone calls with freelancers. Not every piece, but any piece that feels off.
- Pre-commissioned outlines where the writer has to demonstrate thinking before they get paid.
- Requiring links to sources the writer personally opened, not just generically cited. You can tell.
- Paying freelancers enough that they are not incentivised to paste a model draft.
None of this is new, exactly. Good editorial practice has always included these steps. What has changed is that editors now treat them as the primary authenticity layer, not the fallback.
The educational side#
Universities are in a harder spot than publishers. The process signals are weaker in a classroom setting, and the stakes of false positives (accusing a real student of cheating) are higher than in editorial.
The schools I have talked to that have adapted sensibly are doing some combination of:
- Oral exams or oral defenses for written work. Expensive, but impossible to fake with a model.
- In-class writing for high-stakes work. A well-lit, supervised room with a clean laptop solves the problem at the cost of pedagogy.
- Embracing AI use and assessing the thinking around it. "Use any tools you want, but your grade depends on your reflection on the choices you made and what the tool got wrong." This is my favourite approach because it matches the world the students will actually work in.
The schools that are still fighting the detection-tool battle are the ones where the administration has not caught up with the research. Several of them are facing lawsuits from students who were falsely accused, and I expect those lawsuits to win, which will accelerate the shift.
What this means for anyone producing content at scale#
If you run a blog, a publication, or a content operation, the useful things to do in 2026 are:
Stop paying for AI detection tools. They do not work, they create false positives you have to deal with, and they give you a false sense of process. Budget the money into real editorial review instead.
Build process signals into your workflow. Require outlines. Look at version history for freelancers. Have a call with any new contributor before paying them for the first time. The cost is real but contained.
Make the content signals explicit. Pay for original reporting. Encourage weird opinions. Give writers time to actually know the topic before they write. This is expensive. It is also what survives AI-era search, so you would be doing it anyway if you were paying attention.
Be honest about AI use in your own process. If your site uses models to draft, say so. If your writers use models as part of their process, that is fine, but the final piece has to be a product of human judgment. Readers can tell the difference, and increasingly so can Google.
Our own policy on aitoolradar is that every piece is written by a human (me) and the factual claims have verified sources with dates. We do not run AI detectors on anything. We rely on the writing itself being good, which is a higher bar, but it is the only bar that still means anything.
The deeper shift#
The thing AI detection tools are trying to measure, "was this text produced by a machine", is the wrong question for the world we live in. Models are going to be in the writing loop for most people doing most kinds of writing, just as spell check and grammar tools were before them. Treating model involvement as cheating is going to look quaint in five years, the same way treating Google search as cheating looked quaint by 2010.
The better questions are: did a human with judgment shape this? Does the piece say something true and specific? Is the writer accountable for what it says? Those are the questions editors have always been asking, and the answer to them is what "authenticity" means in 2026. Detection tools were always a shortcut, and the shortcut no longer runs.
Where to read more#
- End of AI Directory Sites on Google's shift toward rewarding content that passes these same authenticity checks.
- Hidden Costs of Credit-Based AI Pricing for the business side of the model.
- AI Hallucinations in Legal Research for a related failure mode in a high-stakes domain.
The short version of this post is: stop running text through classifiers, start running your process like an editor. That has always been the answer. The tools just made it easy to pretend otherwise for a few years.
