How we test, what the numbers actually mean.
Every detector benchmark we publish on this site comes from this process. Sample composition, sourcing, scoring, freshness rules, and the things we will not claim. Last reviewed May 7, 2026.
What 99.6% means
The headline figure is the rolling four-week average of HumanGPT's bypass rate across the seven major detectors (GPTZero, Turnitin AI, Originality.ai, Copyleaks, ZeroGPT, Sapling, Winston AI). 'Bypass' means the detector returned a 'human' verdict on the humanized output. The number is the percentage of detector verdicts that returned human across all samples in a given test cycle.
What 99.6% does not mean: it does not mean every single passage from every single user clears every detector every time. It is an aggregate across samples and detectors. Individual passages can still trigger flags, especially on Originality.ai's strict mode or on highly formal text (legal contracts, dense academic prose). For those, Pro and Founder tiers add additional rewrite passes.
Sample sourcing
Each weekly test uses 100 fresh text samples. Composition:
- 0140 AI-only samplesGenerated from a rotating prompt pool covering essay topics, marketing copy, business memos, story openings, and academic-style paragraphs. Half from ChatGPT-4o, half from Claude 3.5 Sonnet, with rare additions from Gemini 2.5 and Llama 3.3 70B.
- 0230 AI + manual editsAI output where we made roughly 20-40% manual edits before testing. Simulates the realistic case of a student or writer touching up AI output before submission.
- 0330 AI + HumanGPT processedFresh AI output run through the HumanGPT engine on the appropriate plan tier. This is the cohort the bypass rate is measured against.
- 04Topical rotationWe rotate topics weekly: academic, marketing, cover letter, business, legal, story, article, essay, report. No topic repeats two weeks in a row.
- 05Length distributionSamples range 250-1500 words to stress detectors at their weak points (under 200 words, most detectors are unreliable; over 2000 words, all detectors do better).
Scoring
Every sample is run through every detector in the test set. Scores are screenshot-logged for audit. A 'pass' is recorded when the detector returns a human verdict using the detector's own default threshold (we do not lower thresholds to inflate our numbers). The aggregate bypass rate for the cycle is the percentage of pass verdicts across all (sample, detector) pairs.
We log per-detector breakdowns separately because the seven detectors have very different strictness profiles. Copyleaks and Originality are the strictest. ZeroGPT is the loosest. The aggregate hides this variance, so individual detector numbers are published alongside the headline.
When detectors update
Detector vendors ship model updates roughly every 4-12 weeks. Each update produces a temporary dip in our bypass rate for the affected detector. When that happens we publish the dip in the same week's report, then patch the engine. Patches typically take 1-7 days. We do not retroactively edit old reports to hide dips; the historical changelog stays public.
What we will not claim
- 01We will not claim 100% bypass.Anyone claiming this is either lying or has not tested at scale. The detector landscape moves and 100% on every passage on every detector forever is not achievable.
- 02We will not claim user counts we cannot verify.Marketing pages used to show fabricated user counts. We do not. Real signups are visible to the team only and we do not publish them.
- 03We will not invent press placements or testimonials.If a publication has not actually written about HumanGPT, it does not appear in the press marquee. If a quote is not from a real verified user, it is not on the site.
- 04We will not lower detector thresholds for testing.Vendors set their own default flag thresholds. We use those defaults. Lowering thresholds during our own tests to claim higher bypass rates is fraud.
Independent verification
We encourage independent testing. Anyone can run a HumanGPT humanization through any detector and check the verdict. The free tier (200 words/day, no signup) is more than enough to verify the engine. If your sample produces a different result than our reported numbers, please let us know with the input and output text and we will investigate.
Where original research lives
Larger studies (the May 2026 500-sample detector accuracy test, the language-by-language false positive study, the GPT-4o vs Claude 3.5 vs Gemini 2.5 detection comparison) are published as separate articles in the HumanGPT Journal with their own dated methodology sections. The methodology principles below apply to all of them.