AI Detection

How AI Detectors Handle Different GPT Versions

10 min read
Alex RiveraAR
Alex Rivera

Content Lead at HumanizeThisAI

Try HumanizeThisAI free — 1,000 words, no login required

Try it now

Every new GPT release resets the detection game. GPT-3.5 was easy to catch. GPT-4 was harder. GPT-5 briefly dropped detection rates before tools scrambled to catch up. Here's how AI detectors actually perform across each model version — and why the newest model isn't always the hardest to detect. (For background on the core metrics detectors use, see our guides on perplexity and burstiness.)

Why Does the AI Model Version Matter?

AI detectors work by identifying statistical patterns in text — specifically perplexity (word predictability) and burstiness (sentence length variation). Every AI model has its own statistical fingerprint based on its training data, architecture, and fine-tuning. When a new model launches, it produces text with patterns that existing detectors haven't been trained to recognize.

This creates a predictable cycle. A new model drops. Detection accuracy falls. Detector companies rush to collect samples, retrain their classifiers, and publish updated benchmarks. Within weeks or months, detection rates climb back up. Then the next model launches and the cycle restarts.

Understanding where each model sits in this cycle tells you how likely your content is to get flagged — and whether your detector results are actually reliable.

GPT-3 and GPT-3.5: The Easy Targets

GPT-3.5 (the model behind the original ChatGPT launch in late 2022) is the most detectable AI model in existence. Every major detector was built and trained on its output first. Turnitin, GPTZero, Originality.ai, and Copyleaks all cut their teeth on GPT-3.5 text.

Detection rates on raw GPT-3.5 output sit at 95–99% across nearly every tool. The text has unmistakable tells: extremely low perplexity, robotically uniform sentence lengths, heavy reliance on transitions like “Furthermore” and “Additionally,” and a tendency to generate clean five-paragraph structures with topic sentences that practically announce themselves.

GPT-3 (the earlier, less fine-tuned version) is actually slightly harder to detect in some cases. It produced rougher, less polished output — more grammatical quirks, stranger word choices, occasional incoherence. Ironically, these imperfections made it resemble bad human writing more than GPT-3.5's suspiciously clean output did.

Still Using GPT-3.5?

If you're still generating content with GPT-3.5-turbo (the free tier model on many platforms), your text is almost certainly getting flagged. The detectors have had three years to study this model's output. Without humanization, GPT-3.5 text is as close to a guaranteed flag as it gets.

GPT-4 and GPT-4o: Harder, But Not Hard Enough

GPT-4 represented a meaningful jump in text quality — and a corresponding drop in detectability. When it launched, detection accuracy fell by 10–15 percentage points across most tools. The text was smoother, more varied, and closer to how educated humans actually write.

Benchmark studies now report detection accuracy for raw GPT-4 output at around 81% — a significant drop from GPT-3.5's 95%+ but still high enough to be problematic. GPT-4o (the optimized multimodal variant) performs similarly, with some tests showing marginally lower detection due to its more conversational tuning.

The improvement came from several changes. GPT-4 uses more diverse vocabulary, varies its sentence structure more naturally, and doesn't lean as heavily on the formulaic transitions that made GPT-3.5 so easy to spot. But it still has detectable patterns — a preference for balanced paragraph lengths, consistent use of hedging language (“it's important to note”), and a tendency to provide exhaustive lists rather than selective examples.

ModelRaw Detection RateAfter Basic EditingAfter Humanization
GPT-3.595–99%60–80%5–15%
GPT-478–85%45–65%3–12%
GPT-4o75–83%40–60%3–10%
GPT-590–97%*35–55%3–10%
Claude 3.5/453–72%30–50%2–8%
Gemini 2.x80–90%40–60%3–10%

*GPT-5 rates reflect post-training updates. Initial detection was lower (~76%) before detectors adapted.

GPT-5: The Detection Dip and Recovery

GPT-5's release followed the exact pattern everyone expected. At launch, detection accuracy dropped noticeably. Originality.ai reported initial detection at 96.5% but acknowledged GPT-5 content was “less detectable than GPT-4o.” Independent tests initially found lower numbers — some tools dipping to 76% accuracy before retraining.

GPTZero responded aggressively. Within weeks of GPT-5's release, they published an update claiming over 97% accuracy on GPT-5, GPT-5-mini, and GPT-5-nano after retraining on samples from the new models. Turnitin similarly updated its classifier to cover GPT-5 output.

The speed of the recovery is notable. Where GPT-4 took months to be reliably detected, the detector ecosystem adapted to GPT-5 in weeks. The infrastructure for rapid retraining is now well-established — detector companies maintain pipelines specifically designed to ingest text from new models and update their classifiers quickly.

But there's a caveat. Those 97% accuracy claims come from the detector companies themselves, tested on their own benchmarks. Independent testing of GPT-5 detection is still catching up, and early third-party results suggest real-world accuracy is closer to 85–92% depending on the content type and length.

Claude, Gemini, and the Non-GPT Models

Here's something most people miss: AI detectors were primarily trained on GPT output. Models from other companies produce text with different statistical signatures, and detectors don't always catch them.

Claude (Anthropic) is the most interesting case. Independent testing has found detection rates of only 53–60% on Claude Haiku output from some detectors. Claude's writing style is distinctly different from GPT models — it tends to be more cautious, uses more qualifiers, and structures arguments differently. Tools trained primarily on GPT patterns don't always recognize these signatures.

Gemini (Google) sits somewhere between GPT and Claude in detectability. Turnitin explicitly lists Gemini as a supported model for detection, and rates generally fall in the 80–90% range for raw output. An important wrinkle: Gemini output may carry Google DeepMind's SynthID watermarks, which provide a separate detection vector beyond statistical analysis.

Llama, Mistral, and open-source models are the hardest for detectors to handle. These models can be fine-tuned on custom datasets, fundamentally altering their statistical fingerprints. A Llama model fine-tuned on a specific person's writing produces output that is effectively indistinguishable from that person's natural writing — because statistically, it is their writing patterns reproduced by a different engine.

How Fast Do Detectors Retrain for New Models?

The detection game has shifted from “can we detect AI?” to “how fast can we retrain?” Here's the timeline for how quickly detectors adapt to new models.

  • GPT-3.5 (Nov 2022): Detectors took 3–6 months to reach 90%+ accuracy. This was the first generation, and the infrastructure didn't exist yet.
  • GPT-4 (Mar 2023): 6–10 weeks to recover accuracy. Detectors had established training pipelines by this point.
  • GPT-4o (May 2024): 2–4 weeks. Incremental update, not a fundamentally new architecture.
  • GPT-5 (2025): 1–3 weeks for major tools. GPTZero published their updated accuracy within days.

The window of lower detection after a new model release is shrinking with each generation. But the broader arms race continues because retraining solves detection on raw output — it doesn't solve detection on edited, humanized, or mixed-authorship content, which is how most people actually use these models.

How Does Turnitin Handle Different Models?

Since Turnitin is the tool most students care about, it's worth looking at separately.

Turnitin's AI detection system is designed to identify text from GPT-3, GPT-4, GPT-4o, GPT-5, Gemini, LLaMA, and other leading models. They claim 98% accuracy with a less-than-1% false positive rate.

The fine print matters. Turnitin's Chief Product Officer admitted the tool intentionally detects about 85% of AI content and deliberately lets 15% go undetected to keep false positives below their 1% target. That deliberate gap exists across all models — Turnitin has decided that missing some AI content is an acceptable tradeoff to avoid wrongly accusing human writers.

What Turnitin doesn't publicize is how their accuracy varies by model. Independent testing suggests Claude output is harder for Turnitin than GPT output, and that newer models generally score lower than older ones even after retraining. The 98% headline number almost certainly reflects best-case performance on GPT-3.5 and GPT-4, not worst-case performance on Claude or fine-tuned open-source models.

What This Means for You

The model you use changes your detection risk, but not as much as you'd think. Here's the practical takeaway.

Switching models is not a reliable strategy. Yes, Claude output is harder to detect than GPT-3.5. But detectors are catching up to every model. Counting on a detection gap that shrinks monthly is not a plan — it's a gamble.

Editing matters more than the model. Across every model version, the single biggest factor in detection is whether the text was edited after generation. GPT-3.5 text that's been properly humanized is harder to detect than raw GPT-5 output. The model gives you a starting point; what you do with it determines the outcome.

New model windows are real but brief. If you happen to be using a model that just launched, you may have a few weeks of lower detection. But building a workflow around this is impractical — you'd need to switch models every time a new one launches.

The best approach is model-agnostic. Use whichever AI model produces the best output for your needs. Then run it through a semantic humanization tool like HumanizeThisAI that strips detectable patterns regardless of which model created them. Verify with a detector check before submitting. That workflow works on GPT-3.5, GPT-5, Claude, Gemini — it doesn't matter, because semantic reconstruction operates on the output, not the source.

TL;DR

  • GPT-3.5 is detected at 95–99% accuracy — it is the easiest AI model to catch by far.
  • GPT-4/4o dropped detection to ~80%, but detectors have largely caught up. GPT-5 followed the same dip-and-recovery pattern within weeks.
  • Non-GPT models (Claude, Llama, Mistral) are harder to detect because most tools were trained primarily on GPT output.
  • The window of lower detection after a new model release is shrinking — detector companies now retrain in days, not months.
  • Switching models is not a reliable strategy. Humanizing your output works across every model and every detector, regardless of version.

Every model gets detected eventually. The only reliable protection is humanization that works across all models and all detectors. HumanizeThisAI gives you 1,000 words free — no account required — so you can test it on output from whatever model you use.

Try HumanizeThisAI Free

Frequently Asked Questions

Alex RiveraAR
Alex Rivera

Content Lead at HumanizeThisAI

Alex Rivera is the Content Lead at HumanizeThisAI, specializing in AI detection systems, computational linguistics, and academic writing integrity. With a background in natural language processing and digital publishing, Alex has tested and analyzed over 50 AI detection tools and published comprehensive comparison research used by students and professionals worldwide.

Ready to humanize your AI content?

Transform your AI-generated text into undetectable human writing with our advanced humanization technology.

Try HumanizeThisAI Now