A Number Is Not Evidence: How AI Detectors Punish Honest Students

wpnews.pro

A Number Is Not Evidence: How AI Detectors Punish Honest Students #

There is a particular kind of dread that arrives in an email. Not the dramatic, cinematic kind, but the bureaucratic sort, the message with a case number in the subject line and a portal link in the body, the message that informs you that an allegation has been raised against you and that you are required to respond. For a growing number of university students around the world, that email is the opening move in a process they did not start, cannot easily understand, and may struggle to survive with their academic record intact. The allegation is that they cheated. The accuser is a piece of software. And the evidence, frequently, is nothing more than a number generated by a model that its own manufacturer admits can be wrong.

We have built an apparatus of suspicion and pointed it at the people we are supposed to be educating. The apparatus does not deliberate. It does not weigh context, intent, or the particular cadence of a sentence written by someone who learned English as a third language. It produces a percentage. And in institutions stretched thin and frightened of being left behind by generative artificial intelligence, that percentage has quietly hardened into something resembling a verdict. The question that follows is not technical. It is moral, procedural, and increasingly legal. If a machine is the first-instance judge of a student's honesty, and that machine is demonstrably biased against particular kinds of writers, what does an institution owe the people it accuses? And is a detector's output ever, on its own, enough to justify punishing anyone at all?

The number that became a verdict #

Begin with the scale of the thing, because scale is what turns an individual injustice into a systemic one. In October 2025, the Australian Broadcasting Corporation reported on internal documents from Australian Catholic University showing that the institution had logged close to six thousand academic misconduct referrals in a single year. Roughly nine in ten of those referrals concerned suspected misuse of artificial intelligence. The university operated across nine campuses, and across all of them the same pattern repeated itself: a student submitted work, a detection tool returned a score, and a process began.

What the documents revealed was not merely the volume of accusations but the thinness of the evidence underpinning many of them. According to the reporting, around a quarter of all referrals were dismissed once they were actually investigated. The university acknowledged that any case resting solely on a Turnitin AI report was dismissed immediately, an admission that, read carefully, is extraordinary. It means the institution understood that the tool's output, by itself, proved nothing. And yet thousands of students had been put through the machinery of accusation on precisely that basis.

The human texture of those cases is where the abstraction collapses into something harder to look at. One student, referred to in the reporting as Madeleine, was accused of using AI to write an assignment in the final year of her nursing degree, while she was on placement caring for patients. She was cleared, but only after roughly six months during which her results were withheld. She believes the delay cost her a graduate role. A paramedic student described being told that eighty-four per cent of his essay had been flagged as machine-written, despite having laboured over every line of it. Students told the ABC that the process was opaque, rushed, and weighted against them, that they were given little time to respond, and that they were asked to produce search histories and handwritten notes to prove a negative. The burden, in other words, had been inverted. It was not for the university to prove that a student had cheated. It was for the student to prove that they had not.

This inversion is the single most important fact in the entire story, and we will return to it, because it is the precise point at which a tool becomes an injustice. Australian Catholic University, to its credit, eventually stopped using the Turnitin AI indicator in March 2025. But internal documents indicated it had been aware of the tool's unreliability for more than a year before pulling it. The damage to the students processed in the interim was already done.

A model that is wrong in a predictable direction #

It would be one thing if AI detectors were simply unreliable in a random, scattershot way, flagging the innocent and the guilty in roughly equal and indiscriminate measure. Random error is bad, but it is at least democratic. The deeper problem is that these tools are wrong in a direction. They fail more often, and more confidently, against specific groups of people. And the people they fail are, with a bleak predictability, those least equipped to fight back.

The foundational evidence here comes from a study led by researchers at Stanford University, published in 2023 and titled, with admirable bluntness, “GPT detectors are biased against non-native English writers.” The team, whose co-authors included the computer scientist Weixin Liang, ran a set of essays written by non-native English speakers through seven widely used GPT detectors. The detectors flagged those essays as AI-generated 61.3 per cent of the time. Nearly ninety-eight per cent of the essays were flagged by at least one detector. On around a fifth of the papers, every single detector agreed, unanimously, that a human-written essay had been produced by a machine. Every one of those essays had, in fact, been written by a person.

The mechanism behind this bias is not mysterious, and understanding it matters because it tells us the problem cannot be engineered away with a software update. Many detectors lean on a statistical property called perplexity, a measure of how surprising or predictable a sequence of words is. Text that uses common words in common arrangements scores as low-perplexity, and low-perplexity text looks, to the model, like machine output. The trouble is that writing in a second or third language tends naturally toward exactly this profile. A non-native writer often reaches for the reliable word rather than the idiomatic one, favours regular syntax over flourish, and produces prose that is more formal, more concise, and more evenly structured. These are the marks of someone writing carefully in a language not their own. They are also, by unhappy coincidence, the statistical fingerprints the detector has learned to associate with a chatbot.

The Markup, the investigative technology newsroom, examined the consequences of this in reporting by the journalist Tara García Mathewson, also published in 2023. The piece documented how Turnitin's tool was markedly more likely to flag the writing of international students, and it laid out the stakes with clarity. There are close to a million international students in the United States alone. For many of them, an academic misconduct finding is not merely an academic matter. It can imperil a visa, and with it an entire future built around the right to remain and study. The newsroom quoted Liang's assessment directly: “The design of many GPT detectors inherently discriminates against non-native authors.” The reporting also captured the helplessness of those caught in the net. One New York University senior, Heewon Yang, put it plainly: “If it's the AI picking up on our language patterns and automatically deciding, I don't know how I can prevent that.”

That sentence deserves to be sat with. I don't know how I can prevent that. It is the voice of someone who has understood that the accusation against them is not really about anything they did. It is about how they write, which is to say about who they are, and there is no behaviour they can adopt to make themselves safe.

The student who paid to prove a negative #

In law, there is a reason the burden of proof rests with the accuser. Proving that something did not happen is, in the general case, close to impossible. You cannot produce evidence of an absence. This is why the inversion documented at Australian Catholic University is so corrosive, and why a case from Garden City, New York, has come to feel like a landmark.

In early 2026, Newsday reported on a student named Orion Newby, a history student at Adelphi University. The episode that upended his first year began when a professor ran one of his papers, written for a World Civilizations course, through Turnitin's AI detection tool. The tool returned a result indicating the paper was one hundred per cent AI-generated. On the strength of that single number, the professor awarded a zero and a misconduct finding followed.

Newby said he had not used AI to generate the essay. The grammatical help he had received came from a human tutor through Bridges to Adelphi, a university programme that supports students with neurodevelopmental differences. Two other AI detectors, run on the same paper, indicated it had been written by a human. In other words, the evidence was not even internally consistent. One tool said machine, two said human, and the institution proceeded on the word of the one that condemned him.

What Newby's family did next is what most families cannot do. They went to court, and they reportedly spent more than one hundred thousand dollars doing it. The matter came before Justice Randy Sue Marber in the State Supreme Court in Nassau County, and on 28 January 2026 she ruled against the university, ordering Adelphi to reverse its sanctions and expunge the finding from Newby's record. The student's attorney, Mark Lesko, described the decision as groundbreaking and urged higher education to look hard at what it had been doing. Newby's own account of the moment he was first accused is the kind of thing that lingers: “I felt shocked. I felt like that was it. I felt like my life was over.”

The Newby case is a victory, but it is a victory that should frighten us rather than reassure us, because of what it took to win. It took six figures. It took a family with the resources, the resolve, and the time to mount a sustained legal campaign against an institution. It took a sympathetic judge. The overwhelming majority of students accused on the basis of a detector score have none of these things. They have a portal link, a deadline, and a number they cannot argue with. For every Orion Newby whose name is cleared in a courtroom, there are, by the logic of the Australian numbers, a great many others who simply absorbed the finding, accepted the lesser penalty to make the nightmare stop, or quietly carried a mark on their record that they did not earn. Justice that is only available to those who can spend a hundred thousand dollars is not a system of justice. It is a lottery with an entry fee.

A climate of fear, measured #

If you want to understand what the constant possibility of false accusation does to a population, you do not have to rely on anecdote. It has been measured. In February 2026, Times Higher Education reported on survey work conducted by YouGov and commissioned by the student support company Studiosity, drawing on responses from 2,373 students in the United Kingdom. The numbers describe a generation studying under a low, persistent hum of anxiety. Seventy-one per cent of respondents said they used AI tools for assignments or study, a figure up from sixty-four per cent the previous year. But the striking finding was not adoption. It was dread. Three-quarters of the students who used AI reported stress about being wrongly flagged for plagiarism by detection tools. Sixty per cent experienced stress while using the tools at all. Fifty-two per cent of everyone surveyed cited being accused of cheating when they had done nothing wrong as a source of stress. And the burden was not evenly distributed: international students were twice as likely to report experiencing a lot of stress, the very students the Stanford research had already identified as most likely to be falsely flagged. The fear, in other words, is not irrational. It is a rational response to a measurable risk that falls hardest on those already most exposed.

There is something quietly devastating in that distribution. The detection regime was sold, implicitly, as a way of protecting the integrity of honest work. What the survey describes is the opposite: honest students, the ones with nothing to hide, living under a suspicion they cannot dispel, and the heaviest share of that suspicion landing on the students who are furthest from home and have the most to lose. Universities UK chief executive Vivienne Stern was among those whose organisation engaged with the findings, and the report's own recommendations were unambiguous, urging university leaders to reconsider detection tools that produce false positives and to establish pathways to protect students from wrongful accusations. The recommendation is sensible. That it needed to be made at all is the indictment.

The algorithm starts writing the student #

Here is where the story curves into something genuinely strange, a development that would be funny if it were not a quiet catastrophe for the whole purpose of education. Students have begun to write badly on purpose.

In April 2026, The Washington Post published an opinion piece arguing, in its headline, that AI detectors are hurting honest students and that schools should ban them. The phenomenon it described had already been surfacing elsewhere. The technology writer Mike Masnick, drawing on an account from the writing instructor Dadland Maye published in the Chronicle of Higher Education, documented students deliberately degrading their own prose to avoid the detector's gaze. They introduced typos. They roughened their grammar. They stripped out the sophisticated vocabulary and the clean structure that a detector might read as too polished to be human. They spent hours rewriting work that was already good, not to make it better, but to make it look worse, because looking worse was the only way to look safe.

The logic, once you trace it, folds back on itself in a way that should be deeply alarming to anyone who cares about teaching. Maye described a student who began using generative AI for the first time not to cheat but to defend herself, running her own honest writing through AI tools to see how it would register, because she had heard that certain stylistic features, the em dash among them, were rumoured to trigger detectors. A tool deployed to stop students using AI had become the specific reason a student started using it. The detector did not catch a cheat. It manufactured one, out of someone who had been entirely honest, by making honesty feel unsafe.

Strip away the technology for a moment and look at what is actually happening here. The entire project of education is to help a person write more clearly, think more precisely, and express themselves with greater command of their language. We assess this growth because we want to encourage it. And now we have built an assessment apparatus whose statistical preferences run in exactly the opposite direction, that reads clarity as suspicious and precision as evidence of guilt, and that is therefore teaching students, through the crude instrument of fear, to write less clearly and think less precisely on the page. The algorithm's limitations have begun to shape the work. The tail is wagging the dog, and the dog is the intellectual development of a generation.

Who is judging, exactly? #

To call an AI detector a judge is, strictly, a category error, and the error is illuminating. A judge, in any system worthy of the name, does several things a detector cannot. A judge gives reasons. A judge can be questioned. A judge weighs the specific circumstances of the case before them, the context, the history, the plausibility of competing accounts. A judge operates within a framework of appeal, so that error at one level can be corrected at another. And a judge, crucially, is accountable: a bad decision can be traced to a decision-maker who can be asked to justify it.

A detector does none of this. It returns a number. It cannot tell you why a particular sentence raised its score, in any sense a human being could interrogate or rebut. It has no concept of the student in front of it, of the fact that this is a non-native speaker writing carefully, or a neurodivergent student who works with a tutor, or simply a person whose natural prose happens to be tidy. It cannot hear an explanation. It produces the same output whether the truth is innocence or guilt, because it is not, in any meaningful sense, assessing truth at all. It is assessing resemblance, the statistical likeness between one body of text and another, and resemblance is not evidence of authorship any more than sharing a surname is evidence of a crime.

Turnitin itself, the company whose tool sits at the centre of so many of these cases, cautions that its AI indicator may not always be accurate and should not be used as the sole basis for action. This is the manufacturer telling the buyer not to do the exact thing the buyer keeps doing. The detection score is meant to be a flag, a prompt for a human being to look more closely, the beginning of an inquiry rather than the end of one. What the cases at Australian Catholic University and Adelphi reveal is an institutional failure to honour that distinction. The flag became the finding. The prompt became the verdict. The thing that was supposed to start a conversation ended it.

This collapse is not, at root, a failure of the technology. It is a failure of process, and the difference matters enormously, because it locates the responsibility squarely with the institutions rather than with the tools. A detector cannot expel anyone. A detector cannot withhold a degree, derail a graduate job offer, or place a permanent mark on a record. Only an institution can do those things, and only an institution can decide to do them on evidence it has been explicitly warned is insufficient. When a university treats a detector's output as dispositive, it is not being overruled by a machine. It is choosing to outsource its judgement to one, and then disowning the consequences. The algorithm makes a convenient defendant precisely because it cannot be held to account. The institution behind it can.

What is actually owed #

So what, concretely, do institutions owe the students they accuse? The answer is not exotic. It is mostly a matter of importing into the disciplinary process the elementary protections that any fair system of adjudication already takes for granted, and that these processes have somehow been allowed to discard.

The first obligation is the most fundamental, and it is the one inverted at Australian Catholic University. The burden of proof must rest with the accuser. It is for the institution to demonstrate misconduct, not for the student to perform the near-impossible feat of proving they did not cheat. A demand that a student produce browser histories, draft documents, and handwritten notes to establish their innocence is not a fair process. It is a presumption of guilt dressed in the language of due diligence, and it should be recognised as such.

The second obligation follows directly from everything the research has established. A detector score cannot, on its own, constitute sufficient evidence for any disciplinary action whatsoever. This is not a cautious or hedged position. It is the position the tools' own makers articulate, the position the Stanford findings compel, and the position a New York court effectively enforced when it overturned a finding built on a single score while two other detectors disagreed. A number that is wrong sixty-one per cent of the time against a known group of writers, and that the manufacturer warns against using alone, is not evidence on which a person's academic future can justly be staked. If a detector flag is to play any role at all, it can only be as the trigger for a genuine human inquiry, one that gathers independent evidence, hears the student's account in full, and reaches a conclusion a person is willing to put their name to and defend.

The third obligation concerns transparency and appeal. A student accused must be told what they are accused of, on what basis, and by what process the matter will be decided. They must have a real and timely opportunity to respond, before sanctions bite rather than months afterward while their results sit frozen and their job offers evaporate. And there must be a route of appeal that does not require a six-figure legal budget to walk. The lesson of the Newby case is not that the courts work. It is that they should never have had to, that the protections he eventually secured in a courtroom should have been available to him, and to every student, inside the institution from the start.

The fourth obligation is to confront the bias directly rather than treating it as an unfortunate externality. An institution that knows, as the evidence now makes it impossible not to know, that its detection regime systematically misfires against international students, non-native English speakers, and neurodivergent students, and that deploys that regime anyway without correction, is not a neutral party caught out by a flawed vendor. It has chosen to run a process it understands to discriminate. The students most likely to be falsely accused are frequently those least able to absorb the consequences, the ones for whom a misconduct finding can mean a lost visa, a severed future, a return home in disgrace for something they did not do. To know this and to proceed regardless is a decision with a moral weight that no procurement contract can offload onto a software company.

The cost, and who pays it #

Every system of accusation produces error. The relevant question is never whether mistakes will happen, because they always will, but who is made to bear them. The deepest failure of the current arrangement is that it has loaded its errors onto the people least able to carry them and least responsible for the system's design.

Consider the distribution one more time, because it is the whole argument in miniature. The detector is most likely to be wrong about non-native speakers and international students. Those same students are, by the survey evidence, the most stressed by the threat of false accusation. They are also, frequently, the ones facing the gravest consequences if a finding sticks, because their right to study and to stay can hang on a clean record. So the technology's errors flow, with a kind of grim efficiency, precisely toward the people who can least afford them. This is not a system that happens to be unfair at the margins. It is a system whose unfairness is concentrated and predictable, aimed with statistical precision at the already vulnerable.

And the cost is not only borne by the individuals falsely accused, real and severe as that cost is. It is borne by the educational relationship itself. A university works on trust, on the basic assumption that the institution and the student are engaged in a shared project rather than locked in an adversarial contest. Every false accusation withdraws a little from that account. The student who learns to write worse to stay safe, the honest student running her own essays through a chatbot to see if she will be flagged, the international student who has concluded there is nothing she can do to make her own prose look human enough: each of them has been taught a lesson the curriculum never intended, which is that the institution does not trust them and cannot be trusted in return. That is a debt that compounds, and it is not clear how it is ever repaid.

The defenders of detection will say that the alternative is a free-for-all, that without these tools academic integrity collapses under a flood of machine-written work. The concern is real, and nothing here suggests that AI misuse is imaginary or that universities should simply surrender. But the choice was never between flawed detectors and anarchy. The choice is between treating a detector's output as the beginning of a fair inquiry and treating it as the end of one. The former is difficult, demanding human time, judgement, and the willingness to be wrong and correct it. The latter is cheap, fast, and unjust. We have, in too many places, chosen the cheap and unjust thing and called it integrity.

The presumption of innocence is not a technicality that fairness can afford to suspend when the caseload climbs and the tools get convenient. It is the load-bearing wall of any process that claims the authority to judge. When an institution lets a machine that cannot explain itself, cannot be questioned, and is known to be biased stand in for that judgement, it has not modernised its disciplinary process. It has abandoned the principle that gave the process its legitimacy in the first place. The students caught in the gap between the tool's confidence and its competence are not collateral damage in an otherwise sound system. They are the evidence that the system was unsound from the moment a number was permitted to mean guilt.

A finding by an AI detector is not, and cannot be, sufficient basis for disciplinary action. Not because the technology will one day improve, though it may, but because the question of whether a human being cheated is not the kind of question a resemblance score can answer, and the cost of pretending otherwise is paid, every term, by the students least able to pay it. Until institutions accept that, the email with the case number in the subject line will keep arriving, and the people opening it will keep discovering that in the eyes of the machine they have already been found guilty, and that proving a machine wrong is a thing most of them will never be given the means to do.

References and Sources #

Australian Broadcasting Corporation reporting on Australian Catholic University misconduct referrals, as summarised in “University wrongly accuses thousands of students of AI cheating: major scandal at ACU,” The Australian Business Journal, 10 October 2025.

https://theabj.com.au/2025/10/10/acu-ai-cheating-scandal-false-accusations/Liang, W., et al., “GPT detectors are biased against non-native English writers,” Stanford University, arXiv preprint, 2023.

https://arxiv.org/abs/2304.02819Stanford Institute for Human-Centered Artificial Intelligence, “AI-Detectors Biased Against Non-Native English Writers.”

https://hai.stanford.edu/news/ai-detectors-biased-against-non-native-english-writersMathewson, Tara García, “AI Detection Tools Falsely Accuse International Students of Cheating,” The Markup, 14 August 2023.

https://themarkup.org/machine-learning/2023/08/14/ai-detection-tools-falsely-accuse-international-students-of-cheatingCBS New York, “Adelphi student Orion Newby sues over AI plagiarism accusation and wins. Why it's being called a 'groundbreaking' case,” 9 February 2026.

https://www.cbsnews.com/newyork/news/orion-newby-adelphi-university-ai-plagiarism-accusations/Inside Higher Ed, “Adelphi Student Wins AI Plagiarism Lawsuit,” 11 February 2026.

https://www.insidehighered.com/news/quick-takes/2026/02/11/adelphi-student-wins-ai-plagiarism-lawsuitPlagiarism Today, “Adelphi Student Wins AI Plagiarism Lawsuit,” 11 February 2026.

https://www.plagiarismtoday.com/2026/02/11/adelphi-student-wins-ai-plagiarism-lawsuit/EdScoop, “Student accused of AI plagiarism wins lawsuit against Adelphi University,” 10 February 2026.

https://edscoop.com/adelphi-university-student-ai-lawsuit/Williams, Tom, “Fear of being flagged by AI detectors drives stress among students,” Times Higher Education, 23 February 2026.

https://www.timeshighereducation.com/news/fear-being-flagged-ai-detectors-drives-stress-among-studentsInside Higher Ed, “Fear of Being Flagged by AI Detectors Drives Student Stress,” 25 February 2026.

https://www.insidehighered.com/news/faculty-issues/learning-assessment/2026/02/25/fear-being-flagged-ai-detectors-drives-studentThe Washington Post, “Opinion | AI detectors are hurting honest students. Schools should ban them,” 13 April 2026.

https://www.washingtonpost.com/opinions/2026/04/13/ai-detectors-students/Masnick, Mike, “We're Training Students To Write Worse To Prove They're Not Robots, And It's Pushing Them To Use More AI,” Techdirt, 6 March 2026.

https://www.techdirt.com/2026/03/06/were-training-students-to-write-worse-to-prove-theyre-not-robots-and-its-pushing-them-to-use-more-ai/Inside Higher Ed, “Students Embrace AI but Fear False Accusations,” 31 March 2026.

https://www.insidehighered.com/news/student-success/academic-life/2026/03/31/students-embrace-ai-fear-false-accusations

Tim Green UK-based Systems Theorist & Independent Technology Writer

Tim explores the intersections of artificial intelligence, decentralised cognition, and posthuman ethics. His work, published at smarterarticles.co.uk, challenges dominant narratives of technological progress while proposing interdisciplinary frameworks for collective intelligence and digital stewardship.

His writing has been featured on Ground News and shared by independent researchers across both academic and technological communities.

**ORCID:** [0009-0002-0156-9795](https://orcid.org/0009-0002-0156-9795)
**Email:** [tim@smarterarticles.co.uk](mailto:tim@smarterarticles.co.uk)

Listen to the free weekly [SmarterArticles Podcast](https://www.smarterarticles.fm)

source & further reading

smarterarticles.co.uk — original article Whose Displacement Counts: The Human Cost of AI in India Surveillance Pricing: Why AI Charges You More Than Your Neighbour The Score You Never See: AI Bias on the Psychiatric Ward