AI Detection

How AI Detection Discriminates Against Non-Native Speakers

12 min read
Alex RiveraAR
Alex Rivera

Content Lead at HumanizeThisAI

Try HumanizeThisAI free — 1,000 words, no login required

Try it now

AI detection tools systematically misidentify non-native English writing as machine-generated. This isn’t a minor bug. A Stanford study found that over 61% of TOEFL essays by non-native speakers were falsely flagged as AI-written — while the same detectors were near-perfect on essays by native speakers. Here’s how deep the problem goes.

The Stanford Study That Changed the Conversation

In 2023, researchers at Stanford University led by James Zou published a paper that should have stopped every school from relying on AI detection tools. The study, “GPT Detectors Are Biased Against Non-Native English Writers,” tested seven popular AI detectors against 91 TOEFL essays written by non-native English speakers and compared the results to essays written by U.S.-born eighth graders.

The results were devastating.

MetricTOEFL Essays (Non-Native)U.S. Student Essays (Native)
Average false positive rate61.22%Near 0%
Unanimously flagged by all 7 detectors19.8% (18 of 91 essays)0%
Flagged by at least one detector97.8% (89 of 91 essays)Minimal

Read that again: 97.8% of TOEFL essays — all written entirely by human beings — were flagged as AI-generated by at least one detector. Nearly one in five were flagged by every single detector tested. Meanwhile, the detectors correctly identified the native English essays as human-written with near-perfect accuracy.

This isn’t a rounding error. This is systematic discrimination built into the core architecture of how these tools work.

Why Do AI Detectors Fail on Non-Native Writing?

The root cause is deceptively simple, and it’s not something that can be easily patched.

Most AI content detectors rely on a metric called perplexity — a measure of how predictable or surprising the next word in a text is. Low perplexity means the text is very predictable. High perplexity means it contains unexpected word choices, unusual phrasing, and surprising turns. AI-generated text tends to have low perplexity because language models are, by definition, optimized to predict the most likely next word.

Here’s the problem: non-native English speakers also produce low-perplexity text, but for completely different reasons. Research in applied linguistics has consistently shown that L2 (second language) writers exhibit:

  • Reduced lexical richness — smaller active vocabulary, more repetition of common words
  • Lower lexical diversity — fewer unique words per total words
  • Simpler syntactic complexity — shorter clauses, more basic sentence structures
  • Reduced grammatical complexity — fewer subordinate clauses, less embedding
  • Higher reliance on formulaic language — more stock phrases and conventional expressions

These are all natural features of writing in a second language. They reflect the real cognitive constraints of communicating in a language you’re still mastering. But to an AI detector, they look identical to the statistical signature of machine-generated text.

The Core Paradox

AI detectors assume that predictable, simple writing is AI-generated. But predictable, simple writing is also what you get when a brilliant person is expressing complex ideas in their second or third language. The detector can’t tell the difference — and that’s not a bug that can be easily fixed. It’s a fundamental limitation of the perplexity-based approach.

The Vocabulary Experiment That Proved the Point

The Stanford researchers ran a follow-up experiment that makes the bias even more obvious. They took the same TOEFL essays and used AI to enhance the vocabulary — replacing simpler words with more sophisticated, native-sounding alternatives while keeping the meaning intact.

The result: the average false positive rate dropped by 49.7%, falling from 61.3% to 11.6%.

Think about what that means. The essays didn’t change in content, structure, or argumentation. The ideas remained the same. Only the vocabulary became more “native-sounding.” And suddenly, the detectors stopped flagging them.

This proves that the detectors aren’t measuring whether text was AI-generated. They’re measuring whether text sounds like it was written by a native English speaker. When you strip away the technical language, that’s what “perplexity-based detection” really means in practice.

Real-World Consequences

The numbers translate into real harm. International students, immigrant professionals, and ESL writers are disproportionately accused of AI cheating when they’ve done nothing wrong.

In Education

Studies from 2024 and 2025 indicate that ESL student submissions are up to 30% more likely to be falsely flagged compared to native speakers. The Center for Democracy and Technology has warned that education agencies’ generative AI discipline policies are being disproportionately enforced against English Learner (EL) students due to the higher false-flag rates.

The CDT makes a legal argument worth noting: where an education agency is aware of these high error rates but nonetheless chooses to deploy the technology, it arguably meets the requirements for a disparate impact claim — or potentially even a disparate treatment claim under civil rights law.

A Yale School of Management student sued the university in 2025 after GPTZero flagged their exam, alleging wrongful suspension, discrimination against non-native English speakers, and denial of due process. Similar lawsuits have followed at other institutions.

In the Workplace

The same bias follows non-native speakers into professional settings. As employers begin using AI detection tools to monitor employee output, the same perplexity-based algorithms produce the same biased results. A report written in clear but simple English by a non-native speaker is statistically more likely to be flagged than a native speaker’s work — regardless of quality or accuracy.

In College Admissions

Some colleges have started screening application essays with AI detection tools. For international applicants writing personal statements in their second language, this introduces a systematic disadvantage. Their writing is more likely to be flagged, potentially leading to rejection or additional scrutiny that native applicants don’t face.

The Racial Dimension

The bias isn’t limited to non-native speakers. Data shows that AI detection false positives fall disproportionately along racial lines.

False Accusation Rates by Race

  • 20% of Black teens were falsely accused of using AI to complete an assignment
  • 10% of Latino teens were falsely accused
  • 7% of white teens were falsely accused

Source: Brandeis University AI Steering Council; Northern Illinois University Center for Innovative Teaching and Learning

Black students are nearly three times more likely to be falsely accused than white students. This isn’t because they write differently from native English speakers in the way ESL students do — it’s because AI detection algorithms are trained predominantly on certain styles of writing, and any deviation from that narrow baseline increases the false positive risk.

Researchers at the University of Nebraska-Lincoln also found higher false positive rates among neurodivergent students, including those with ADHD and autism. The pattern is consistent: anyone whose writing style falls outside the “expected” norm is more likely to be falsely accused.

How Has the Industry Responded?

Detection companies are aware of the Stanford study. Their responses have ranged from dismissive to defensive.

Turnitin published a blog post claiming their AI detector shows “no statistically significant bias against English Language Learners,” based on their own internal research. However, this contradicts independent findings. The company’s study focused on documents meeting a 300-word minimum threshold and used their own test methodology, which critics have noted doesn’t replicate real-world conditions.

Originality.ai published a response calling the Stanford study “flawed,” arguing that the TOEFL essays used were too short and not representative of typical academic submissions. But the fundamental finding — that perplexity-based detection is biased against writers with limited vocabulary — has been replicated in multiple independent studies.

Meanwhile, universities are drawing their own conclusions. Vanderbilt University disabled Turnitin’s AI detection tool entirely. The University of Waterloo and Curtin University followed suit. At least 12 major universities have deactivated AI detection features, citing accuracy concerns and the risk of discriminatory outcomes.

What Needs to Change

The Stanford researchers were direct in their conclusion: “The detectors are just too unreliable at this time, and the stakes are too high for the students, to put our faith in these technologies without rigorous evaluation and significant refinements.”

Meaningful change requires action at multiple levels:

For Institutions

  • Stop using AI detection scores as the sole or primary basis for academic integrity charges
  • Require independent audits of detection tools for bias before deployment
  • Implement robust appeals processes that don’t place the burden of proof on the accused student
  • Provide alternative assessment methods for ESL students, such as oral exams or in-class writing
  • Publish data on false positive rates by student demographic

For Detection Companies

  • Test for and publish bias metrics across ESL populations before releasing products
  • Move beyond perplexity as the primary signal, or develop language-proficiency-aware models
  • Stop marketing accuracy rates that don’t hold up across diverse populations
  • Allow independent researchers access to their tools for bias testing

For Policymakers

  • Treat AI detection tools with the same scrutiny applied to other algorithmic decision-making systems
  • Require bias audits under emerging state AI laws like the Colorado AI Act and Illinois AI regulations
  • Clarify that discriminatory use of AI detection tools violates civil rights protections

What Can You Do Right Now?

If you’re a non-native English speaker — or anyone who writes in a style that AI detectors tend to flag — you need a practical strategy, not just awareness.

Document your writing process. Use Google Docs or another tool that tracks revision history. Save your research notes, outlines, and drafts. If you’re falsely flagged, this is your strongest evidence. See our complete action plan for false flags.

Know your rights. Most accredited institutions have formal appeals processes. AI detection scores alone should never be the sole basis for academic integrity charges, and increasingly, universities agree with that position.

Test your writing before submitting. Run your work through an AI detector yourself so you know how it scores. If it gets falsely flagged, you can address it proactively rather than being blindsided.

Consider humanization tools. If your authentic human writing is being flagged because of vocabulary patterns, running it through a humanization tool can increase vocabulary diversity without changing your meaning. This isn’t cheating — it’s protecting your own work from a biased system. The Stanford study proved that vocabulary enhancement alone reduced false positives by nearly 50%.

TL;DR

  • A Stanford study found that AI detectors falsely flagged 61% of TOEFL essays by non-native speakers as AI-generated, while correctly identifying native English essays with near-perfect accuracy.
  • The bias stems from perplexity-based detection: non-native writers naturally produce simpler, more predictable text that looks statistically identical to AI output.
  • When researchers enhanced only the vocabulary in those same essays, false positives dropped by nearly 50% — proving the detectors measure native-sounding writing, not AI origin.
  • Black students are nearly 3x more likely than white students to be falsely accused of AI cheating, and neurodivergent students face elevated false positive rates too.
  • At least 12 major universities have disabled AI detection features. Document your writing process and test your work before submitting to protect yourself.

Falsely flagged because of how you write, not what you wrote? You shouldn’t have to prove your own humanity. HumanizeThisAI increases vocabulary diversity and natural variation in your text — protecting your original work from biased detectors. Try it free with 1,000 words/month with a free account.

Try HumanizeThisAI Free

Frequently Asked Questions

Alex RiveraAR
Alex Rivera

Content Lead at HumanizeThisAI

Alex Rivera is the Content Lead at HumanizeThisAI, specializing in AI detection systems, computational linguistics, and academic writing integrity. With a background in natural language processing and digital publishing, Alex has tested and analyzed over 50 AI detection tools and published comprehensive comparison research used by students and professionals worldwide.

Ready to humanize your AI content?

Transform your AI-generated text into undetectable human writing with our advanced humanization technology.

Try HumanizeThisAI Now