What are perplexity and burstiness in AI detection?

Perplexity measures how predictable your word choices are to a language model — low perplexity signals AI. Burstiness measures variation in sentence length and complexity across a document — flat, uniform sentences signal AI. Together, they form the statistical foundation most AI detectors use.

What is a false positive in AI detection?

A false positive is when an AI detector incorrectly flags human-written text as AI-generated. Turnitin's sentence-level false positive rate is approximately 4%, and rates are significantly higher for non-native English speakers and formal academic writers.

What is the difference between an AI humanizer and a paraphraser?

A paraphraser swaps words and rearranges sentences but preserves the underlying statistical patterns that detectors measure. A humanizer uses semantic reconstruction — extracting meaning and rebuilding text from scratch with human-like writing patterns — which is far more effective at avoiding detection.

Why did OpenAI shut down their AI text classifier?

OpenAI discontinued their AI text classifier in July 2023 because it only correctly identified 26% of AI-written text while incorrectly flagging 9% of human-written text as AI. The low accuracy rate made it unreliable for practical use.

What is zero-shot detection and why does it matter?

Zero-shot detection is when a detector identifies AI text from a model it was never specifically trained on. It matters because new AI models launch constantly, and zero-shot detection rates are typically 15-30 percentage points worse than trained detection — creating a vulnerability window each time a new model releases.

AI Content Detection Glossary: Every Term Explained

Last updated: March 2026 | Definitions sourced from academic papers, detection tool documentation, and industry research

AI content detection has its own language. Perplexity, burstiness, classifier confidence, zero-shot detection — if you've ever read an AI detection report and felt lost, this glossary is for you. Every term below is explained in plain English with context for how it actually affects whether your text gets flagged.

Terms are organized alphabetically. Use your browser's search (Ctrl+F or Cmd+F) to jump to a specific term. Where relevant, we've linked to deeper guides on our blog.

A

Accuracy

The percentage of correct predictions a detection tool makes out of all predictions. If a detector analyzes 100 documents and correctly identifies 90 of them (whether as human or AI), its accuracy is 90%. Sounds straightforward, but accuracy alone can be misleading. A detector that labels everything "human" would be 50% accurate in a balanced dataset — and useless. That's why precision and recall matter more in practice. Turnitin claims 98% accuracy, but that figure applies specifically to raw, unmodified ChatGPT output. For more on how well these tools actually perform, see our deep dive on AI detector accuracy.

AI Content Detector

A software tool designed to determine whether text was written by a human or generated by an AI model. Examples include Turnitin's AI detector, GPTZero, Originality.ai, Copyleaks, and ZeroGPT. These tools analyze statistical properties of text — not its meaning or quality — to make predictions. No current detector is 100% accurate, and all produce both false positives (flagging human text as AI) and false negatives (missing AI text). For tool-specific breakdowns, see our free AI detector.

AI Humanizer

A tool that takes AI-generated text and reconstructs it to match human writing patterns at the statistical level. Unlike paraphrasers (which swap words), humanizers rebuild text from the meaning up — changing perplexity, burstiness, sentence structure, and vocabulary distribution to produce output that reads and measures as human-written. HumanizeThisAI is an example. For more on how humanizers differ from paraphrasers, see AI Humanizer vs. Paraphraser: What's the Difference?

AI Watermarking

A technique where an AI model embeds a hidden statistical signature in its output during text generation. The watermark is invisible to readers but can be detected by a corresponding verification tool. Google's SynthID is the most prominent example, embedded in Gemini outputs since 2024. Current watermarks degrade quickly when text is edited or paraphrased, and major third-party detectors (Turnitin, GPTZero) do not yet use watermark detection — they rely on their own statistical models instead. See also: SynthID.

B

Burstiness

A measure of how much sentence length and complexity vary across a document. Human writing is naturally "bursty" — we write a four-word sentence, then a 35-word monster, then something in between. AI models default to uniform sentence lengths, typically averaging 12-18 words with little variation. Low burstiness is one of the strongest signals AI detectors use. GPTZero was one of the first detectors to use burstiness as a core metric, and it remains foundational to most detection systems in 2026. For a deeper explanation, see What Is Burstiness in AI Detection?

Technically, burstiness measures the change in perplexity across sentences in a document. If some sentences have high perplexity and others have low perplexity, the text is "bursty." If perplexity is uniform throughout, the text looks machine-generated.

Are AI Detectors Biased?

The tendency of AI detectors to disproportionately flag certain types of human writing as AI-generated. The most documented bias is against non-native English speakers. A Stanford HAI study found that detectors misclassified over 61% of TOEFL essays by non-native speakers as AI-generated, compared to near-perfect accuracy on essays by native speakers. Formal academic writers, neurodivergent students, and anyone who writes with consistent structure face elevated false positive rates. Vanderbilt University cited this bias as a primary reason for disabling Turnitin's AI detector. For more on this problem, see AI Detection Discrimination Against Non-Native Speakers.

C

Classifier

A machine learning model that sorts input into categories. In AI detection, a classifier sorts text into two categories: "human-written" or "AI-generated." Some classifiers add a third category: "mixed." The classifier is trained on large datasets of known human and AI text, learning the statistical patterns that distinguish them. OpenAI built (and later shut down) their own AI text classifier in 2023 due to low accuracy — it only correctly identified 26% of AI-written text. GPTZero, Turnitin, and Originality.ai all use proprietary classifiers.

Confidence Score

A number (typically 0-100% or 0-1) that indicates how certain a detector is about its prediction. A confidence score of 95% means the model is quite sure the text is AI-generated. A score of 55% means it's barely more than a coin flip. Most detectors set a threshold (often around 50-65%) below which they won't make a definitive claim. Turnitin displays this as the percentage of text "qualifying as AI-generated" with color-coded segments. It's important to understand that confidence scores are probability estimates, not proof.

What Is Cross-Humanizer Generalization?

A detection approach where a model is trained on outputs from multiple different humanizer tools simultaneously, rather than being specialized for one tool. Turnitin introduced this concept in their August 2025 anti-humanizer update. The idea is that different humanizers may leave different specific traces, but there are common patterns across all of them that a generalized model can learn. Testing shows this approach improved detection of basic humanizers but remains less effective against semantic reconstruction tools. For context on this arms race, see The AI Detection Arms Race in 2026.

D

Detection Threshold

The minimum confidence score at which a detector will label text as AI-generated. If a tool's threshold is 60%, any text scoring below 60% will be classified as "likely human" even if it has some AI characteristics. Different tools use different thresholds. Turnitin uses a 20% minimum to flag a document and recommends professors investigate scores between 20-80% rather than treating them as definitive. Setting the threshold higher reduces false positives but increases false negatives (more AI text gets through). Setting it lower catches more AI text but flags more innocent humans.

Deep Learning

A subset of machine learning that uses neural networks with multiple layers to learn complex patterns from data. All modern AI text generators (ChatGPT, Claude, Gemini) use deep learning, and so do the detection tools trying to catch them. In detection, deep learning models learn subtle statistical patterns across millions of text samples that would be impossible to define through manual rules. The "deep" refers to the many layers of the neural network, not the depth of understanding.

E

E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness)

Google's framework for evaluating content quality in search rankings. E-E-A-T isn't an AI detection metric, but it's deeply relevant. Google has stated they don't penalize AI content per se — they penalize low-quality content. AI-generated text often lacks the "Experience" component (first-person insights, real-world anecdotes) that human authors naturally bring. Content that demonstrates genuine experience scores better regardless of how it was produced. This is why humanizing AI text for SEO means adding authentic human elements, not just bypassing detection.

Entropy

A measure of randomness or uncertainty in a system. In text analysis, entropy measures how unpredictable the distribution of words is. High entropy means more varied, less predictable word usage — more human-like. Low entropy means more uniform, more predictable patterns — more AI-like. Entropy is closely related to perplexity; some detectors use them interchangeably, though they measure slightly different aspects of the same underlying property.

F

False Negative

When a detector fails to flag AI-generated text and incorrectly labels it as human-written. Also called a "miss." False negatives are what humanizer tools aim to create — AI text that the detector genuinely believes is human. Turnitin has acknowledged they intentionally accept some false negatives (they estimate catching about 85% of AI content) in order to keep their false positive rate low. Every detector makes this tradeoff.

False Positive

When a detector incorrectly flags human-written text as AI-generated. This is the scenario everyone fears — getting accused of using AI when you didn't. Turnitin claims a document-level false positive rate below 1%, but their sentence-level rate is approximately 4%. In a 20-sentence essay, that means there's a meaningful chance at least one sentence gets wrongly flagged. False positive rates are higher for non-native English speakers, formal writers, and neurodivergent students. Several major universities have disabled AI detection specifically because of false positive concerns. See our action plan for false positives.

Fine-Tuning

The process of taking a pre-trained AI model and training it further on a specific dataset for a specialized task. In AI detection, fine-tuning is how detectors are updated to recognize new AI models. When GPT-4o or Claude 3.5 launches, detection companies fine-tune their classifiers on the new model's output. This is also why there's usually a gap between a new AI model launching and detectors reliably catching it — the fine-tuning takes time and data.

G

Generative AI

AI systems that create new content — text, images, audio, code — rather than analyzing or classifying existing content. ChatGPT, Claude, Gemini, and Llama are all generative AI models for text. The content they produce is what AI detectors are designed to identify. Generative AI doesn't copy from existing sources; it produces statistically probable sequences based on patterns learned during training. This is why plagiarism checkers can't catch AI writing and purpose-built AI detectors exist.

GPT (Generative Pre-trained Transformer)

OpenAI's family of large language models, including GPT-3.5, GPT-4, and GPT-4o (the models behind ChatGPT). "Generative" means it produces new text. "Pre-trained" means it learned from massive text datasets before being fine-tuned. "Transformer" refers to the neural network architecture it uses. GPT models are the most commonly detected AI writing source because ChatGPT popularized AI writing and detectors had the most training data from it. Other model families (Claude, Gemini, Llama) have different statistical fingerprints.

H

Hallucination

When an AI model generates information that sounds authoritative but is factually wrong or entirely made up. AI models don't "know" facts; they predict likely word sequences. Sometimes those sequences describe things that don't exist: fake research papers, nonexistent court cases, invented statistics. Hallucinations aren't directly related to AI detection, but they're a practical risk when using AI-generated content. Always fact-check AI output, especially citations and statistics.

Humanization

The process of modifying AI-generated text so it exhibits the statistical properties of human writing. Effective humanization goes beyond word swapping to address perplexity, burstiness, sentence structure variation, vocabulary distribution, and topic flow. The term is sometimes used broadly (any modification to AI text) and sometimes specifically (semantic reconstruction that fundamentally changes the text's statistical profile). In this glossary, we use it in the specific sense. For a complete guide, see How to Humanize AI Text in 2026.

I

Inference

The process of using a trained model to make predictions on new data. When you submit text to an AI detector, the detector runs inference — feeding your text through its model to produce a prediction. Inference is distinct from training; it's the model applying what it learned rather than learning something new. Detection speed depends largely on inference efficiency.

L

Large Language Model (LLM)

An AI model trained on massive amounts of text data that can generate, analyze, and manipulate human language. "Large" refers to both the training data (hundreds of billions of words) and the model size (billions of parameters). ChatGPT, Claude, Gemini, and Llama are all LLMs. Each LLM has a somewhat distinct statistical fingerprint based on its training data and architecture, which is why detectors sometimes perform differently against different models. GPTZero detects ChatGPT-4o at 90.4% but Claude 3.5 at only 86.7%.

Log-Likelihood

The logarithm of the probability that a given model would produce a specific sequence of text. Some detectors use log-likelihood as a direct feature: they run your text through a reference language model and calculate how probable it is that the model generated those exact word sequences. Very high log-likelihood (meaning the model "likes" your text) suggests AI origin. Low log-likelihood suggests human writing with its unpredictable choices. This is closely related to perplexity, which is derived from log-likelihood.

M

Machine Learning

A branch of AI where systems learn patterns from data rather than following explicitly programmed rules. Both AI text generators and AI detectors rely on machine learning. The generator learned to produce human-like text by studying billions of examples. The detector learned to distinguish human from AI text by studying labeled examples of both. This is fundamentally why the detection problem is so hard — both sides are using similar technology, and improvements on one side drive improvements on the other.

How Do Detectors Handle Mixed Content?

A document that contains both human-written and AI-generated text. This is extremely common in practice — a student might use AI for certain paragraphs and write others themselves, or a content creator might use AI as a starting point and then heavily edit. Mixed content is significantly harder for detectors to handle. Testing shows Turnitin's accuracy drops to 20-63% on mixed documents compared to 77-98% on fully AI-generated ones. Most detectors analyze text in segments and may flag individual paragraphs differently.

N

Natural Language Processing (NLP)

The field of AI concerned with enabling computers to understand, interpret, and generate human language. Both text generation (ChatGPT producing an essay) and text detection (GPTZero analyzing that essay) are NLP tasks. NLP covers everything from simple spell-checkers to complex transformer models. The rapid advances in NLP since 2017 (when the transformer architecture was introduced) are what made both convincing AI text and sophisticated AI detection possible.

Neural Network

A computing system inspired by the human brain, consisting of interconnected nodes ("neurons") organized in layers. Neural networks learn by adjusting the connections between nodes based on training data. In AI detection, neural networks process text through multiple layers, each extracting increasingly abstract patterns. Early layers might recognize word patterns. Later layers identify sentence-level structures. The final layers make the human-or-AI classification. All modern LLMs and most modern detectors are built on neural networks.

O

Originality Score

A metric specific to Originality.ai that indicates the probability of text being original (human-written) versus AI-generated. Displayed as a percentage, where 100% means the tool believes the text is entirely original and 0% means it believes the text is entirely AI-generated. Other detectors use similar scoring but with different names: GPTZero uses "completely generated probability," Turnitin uses "AI writing percentage," and Copyleaks uses a binary "AI Content Detected" flag with a confidence level.

Overfitting

When a model performs well on its training data but poorly on new, unseen data. In AI detection, overfitting is a significant problem. A detector might achieve 99% accuracy on its test set but perform much worse in the real world because it learned to recognize specific training examples rather than general AI writing patterns. This is one reason why detection companies' self-reported accuracy numbers often exceed what independent testers find.

P

Paraphrasing

Restating text using different words while preserving the original meaning. In the context of AI detection, paraphrasing refers to running AI text through tools like QuillBot that swap synonyms and rearrange sentences. While effective against plagiarism checkers, paraphrasing fails against AI detectors because it doesn't change the underlying statistical properties (perplexity, burstiness, structural patterns) that detectors actually measure. Turnitin now specifically detects and flags paraphrased AI content. For a detailed comparison, see AI Humanizer vs. Paraphraser.

Perplexity

A measure of how predictable text is to a language model. Technically, perplexity quantifies how "surprised" a model is by a given sequence of words. Low perplexity means the model could easily have predicted those words — suggesting AI origin. High perplexity means the word choices were unexpected — suggesting human writing.

Example: In the phrase "I went for a walk in the..." the word "park" has low perplexity (predictable). The phrase "municipal council's abandoned greenhouse" has high perplexity (unexpected). Human writing naturally mixes both. AI writing clusters toward the low-perplexity, predictable end.

GPTZero pioneered perplexity-based detection, but most modern detectors now use it as one input among many rather than relying on it alone. Research from Pangram Labs has shown that perplexity alone is insufficient for reliable detection.

Precision

Of all the documents a detector labels as "AI-generated," precision measures what percentage actually were AI-generated. High precision means the detector rarely falsely accuses humans. Low precision means it frequently flags human text as AI. Precision is distinct from recall (see below). A detector can have high precision (rarely wrong when it does flag something) but low recall (misses a lot of AI text). Turnitin prioritizes precision over recall, which is why they miss some AI content — they'd rather not flag it than falsely accuse a student.

Prompt Engineering

The practice of crafting specific instructions to an AI model to control the style, tone, and structure of its output. In the context of AI detection, some users try to use prompt engineering to make AI output less detectable — for example, instructing the model to "write like a human" or "vary your sentence lengths." Testing shows this reduces detection by 10-20 percentage points but doesn't eliminate it, because the underlying statistical patterns of the model's generation process remain even when the surface style changes.

R

Recall

Of all the AI-generated documents in a dataset, recall measures what percentage the detector successfully identified. High recall means the detector catches most AI text. Low recall means a lot of AI text slips through. Recall and precision exist in tension: boosting one typically reduces the other. Turnitin has publicly stated they accept lower recall (missing about 15% of AI text) to maintain higher precision (fewer false positives).

Robustness

A detector's ability to maintain accuracy when text has been modified, paraphrased, or humanized. A robust detector still catches AI text even after it's been altered. Most detectors are reasonably robust against basic paraphrasing (catching it 60-70% of the time) but much less robust against semantic reconstruction (catching it only 3-15% of the time). Improving robustness is the central technical challenge for detection companies.

S

Semantic Reconstruction

The process of extracting meaning from AI-generated text and rebuilding it from scratch using human-like writing patterns. Unlike paraphrasing (which modifies text at the word level), semantic reconstruction works at the meaning level — understanding what the text says, discarding how it says it, and expressing the same ideas with genuinely different structure, rhythm, and vocabulary distribution. This is the core technique used by advanced AI humanizers. It's effective against detection because the output isn't a modification of AI text; it's new text that happens to convey the same meaning.

Supervised Learning

A machine learning approach where the model is trained on labeled data — examples where the correct answer is known. Most AI detectors use supervised learning: they're trained on thousands of documents labeled as either "human" or "AI," learning to distinguish between them. The quality and diversity of the training data directly affects detection accuracy. This is why detectors tend to perform better on text from models that were popular when they were trained (like GPT-3.5) and worse on newer models they haven't seen as much of.

SynthID

Google's AI watermarking system, embedded in outputs from Gemini models. SynthID works by subtly biasing the model's word choices toward a pattern that a verification tool can later detect. The watermark is statistically invisible to human readers but can be identified algorithmically. However, SynthID degrades significantly when text is edited, paraphrased, or reconstructed. As of 2026, major third-party detectors (Turnitin, GPTZero, Originality.ai) do not use SynthID verification — they use their own statistical models. SynthID is currently more relevant as a research direction than a practical detection tool.

T

Temperature

A parameter that controls the randomness of an AI model's output. Low temperature (e.g., 0.2) makes the model more deterministic and predictable, producing text that's easier to detect. High temperature (e.g., 1.0 or above) introduces more randomness, which increases perplexity and can reduce detectability slightly. However, high temperature alone doesn't reliably bypass detection because other statistical signatures (burstiness, structural patterns) remain. Very high temperature often produces incoherent text, making it impractical as a detection avoidance strategy.

Token

The basic unit of text that AI models process. A token isn't exactly a word — it's a chunk of text determined by the model's tokenizer. Common words like "the" are single tokens. Less common words might be split into multiple tokens. Roughly, one token equals about 0.75 words in English. Tokens matter for detection because detectors often analyze token-level probabilities. When a detector says your text has "low perplexity," it means the model's probability distribution at each token position was concentrated (high confidence) rather than spread out (low confidence).

Transformer

The neural network architecture underlying virtually all modern language models and AI detectors. Introduced in Google's 2017 paper "Attention Is All You Need", transformers process text using an "attention mechanism" that allows the model to consider relationships between all words in a passage simultaneously, rather than reading sequentially. This is what makes both AI writing and AI detection as powerful as they are. GPT, Claude, Gemini, and the models inside Turnitin and GPTZero are all transformer-based.

True Negative

When a detector correctly identifies human-written text as human. This is the ideal outcome for human authors. The true negative rate is related to the false positive rate: if the false positive rate is 4%, the true negative rate is 96%. For students submitting original work, the true negative rate is the most personally relevant metric.

True Positive

When a detector correctly identifies AI-generated text as AI-generated. This is what detection companies optimize for in their marketing ("98% detection rate" means 98% true positive rate on their test data). But true positive rates vary enormously depending on the type of AI text: raw ChatGPT output has a high true positive rate, while humanized text has a very low one.

Turing Test

A test proposed by Alan Turing in 1950: if a human evaluator can't distinguish between a machine's responses and a human's, the machine is said to exhibit intelligent behavior. In 2026, the Turing Test has been effectively surpassed for text — most people cannot reliably distinguish well-prompted AI writing from human writing. This is precisely why we need algorithmic detection: the human eye can't do it anymore, so we rely on statistical analysis. But as this glossary makes clear, algorithmic detection has its own significant limitations.

Z

Zero-Shot Detection

A detection approach where the model can identify AI text from a generator it has never been specifically trained on. For example, if a detector trained only on GPT-4 output can also catch Claude or Gemini output, it has zero-shot generalization capability. This is the holy grail for detection companies because new AI models launch constantly. In practice, zero-shot detection rates are significantly lower than trained detection — typically 15-30 percentage points worse. This is why there's always a vulnerability window when new models launch.

Which 10 Terms Matter Most?

If you only remember ten terms from this glossary, make it these. They cover 90% of what you'll encounter in any AI detection context:

Term	One-Line Definition	Why It Matters
Perplexity	How predictable your word choices are	Core detection metric — low = AI signal
Burstiness	Variation in sentence length/complexity	Core detection metric — flat = AI signal
False Positive	Human text wrongly flagged as AI	The biggest risk for honest writers
Confidence Score	How certain the detector is	A probability estimate, not proof
Semantic Reconstruction	Rebuilding text from its meaning	Most effective humanization method
Transformer	Architecture behind all modern AI	Powers both generation and detection
Detection Threshold	Score cutoff for flagging text	Varies by tool — affects false positive rate
Paraphrasing	Swapping words while keeping structure	Detected 60-80% of the time by modern tools
LLM	The models that generate AI text	Each model has a unique detection fingerprint
Zero-Shot Detection	Catching text from untrained-on models	Always weaker — new models exploit this gap

TL;DR

Perplexity (word predictability) and burstiness (sentence-length variation) are the two core metrics AI detectors rely on — low perplexity + flat burstiness = AI signal.
False positives are a real risk: Turnitin's sentence-level rate is ~4%, and non-native English speakers are flagged at dramatically higher rates due to detector bias.
Paraphrasing swaps words but preserves the statistical fingerprint detectors measure; semantic reconstruction rebuilds text from meaning, which is why humanizers outperform paraphrasers.
No detector is 100% accurate — all make trade-offs between precision (not accusing innocents) and recall (catching AI text), and self-reported accuracy rarely matches independent testing.
New AI models always have a vulnerability window before detectors fine-tune against them, and zero-shot detection performs 15-30 percentage points worse than trained detection.

Now that you know the vocabulary, put it to use. Run any text through HumanizeThisAI to see how it scores on the metrics above — perplexity, burstiness, and overall detection probability. Free for up to 1,000 words, no account required. Or check your text with our free AI detector first.

Try HumanizeThisAI Free

Frequently Asked Questions

Alex Rivera

Content Lead at HumanizeThisAI

Alex Rivera is the Content Lead at HumanizeThisAI, specializing in AI detection systems, computational linguistics, and academic writing integrity. With a background in natural language processing and digital publishing, Alex has tested and analyzed over 50 AI detection tools and published comprehensive comparison research used by students and professionals worldwide.