Teachers say they distrust AI but still accept its harsh grading mistakes, study finds

wpnews.pro

As artificial intelligence becomes more common in professional settings, human oversight is often promoted as a safeguard against automated mistakes. A new study published in * PNAS Nexus* suggests that human experts are significantly more likely to accept incorrectly harsh decisions when they believe those decisions come from an artificial intelligence system rather than a human colleague. This pattern provides evidence that relying on human review to catch algorithmic errors might not be as effective as many expect.

Study author Rigissa Megalokonomou, an associate professor of economics at Monash University, was drawn to the issue because of the rapid integration of new technology in classrooms. “My research is in the economics of education, so pressing topics in education naturally fall within my interests and right now, AI is perhaps the most pressing of all,” she said. Grading decisions offer a realistic way to observe how experts respond to flawed advice in a high-stakes environment.

“Grading is one of the most consequential moments in a student’s school life, shaping their futures and self-perception as learners,” Megalokonomou explained. “When AI enters that process and introduces errors that professionals aren’t catching, that’s a serious problem worth studying.” Automated grading tools promise to save time and offer consistent scoring, but they can still make mistakes or introduce bias.

When organizations use automated tools to help make decisions, humans are generally expected to review the output and catch any mistakes. This expectation assumes that people can objectively evaluate a computer program’s suggestion and correct it when it strays from the truth. Ensuring that these oversight mechanisms work well is necessary to realize the benefits of automated technology without compromising accuracy or public trust.

Teachers are expected to act as a safety net to spot and fix errors, acting as the final decision-makers in the classroom. Research in this area explores how users interact with automated advice, looking at the psychological factors that drive acceptance or rejection. Sometimes people distrust computer programs entirely, a concept known as algorithm aversion. In other situations, individuals might blindly trust automated outputs over human judgment, often called automation bias.

The authors wanted to understand what exactly makes an expert correct a machine’s mistake or let it pass without intervention. “The standard reassurance around AI is ‘don’t worry, a human will check its work.’ Our study tests whether that actually holds up,” Megalokonomou noted. To answer this question, the researchers conducted a preregistered randomized experiment involving active teachers in Greece.

“We ran an experiment with over 1,300 teachers in Greece, randomly assigning them to grade student work paired with a deliberately wrong score labeled as coming from either an AI system or a human colleague,” Megalokonomou said. “We then measured how far their final grade strayed from the objectively correct answer.” The participants taught various subjects, including mathematics, science, and the humanities.

During the study, each teacher reviewed a sample of student work that matched their specific area of expertise. The student work was accompanied by a bulleted checklist showing exactly which parts of the answer were correct or incorrect. Along with the student’s answers, the teachers saw a preassigned score of five out of ten, which was intentionally incorrect.

Add PsyPost to your preferred sources The researchers manipulated the direction of the grading error for different groups of participants. In one scenario, the score of five out of ten was too harsh because the student’s work actually deserved an eight based on the objective checklist. In the second scenario, the same score was too lenient because the student only provided enough correct answers to earn a two.

After reviewing the work and the suggested score, the teachers assigned their own final grade. The main measurement in the study was the grading fairness gap. This metric calculates the absolute mathematical distance between the teacher’s final grade and the objectively correct benchmark grade. A larger gap indicates that the teacher failed to correct the initial flawed recommendation.

“We found that when AI gave a harsh grade, one that was too low, teachers were significantly less likely to correct it than when the same wrong grade came from a human colleague,” Megalokonomou told PsyPost. “The grading fairness gap was 22% larger for harsh AI errors.” Teachers tended to accept the stricter automated grade and leave it largely uncorrected. When the identical harsh grade came from a human colleague, the teachers were more willing to fix the mistake and boost the student’s score.

In the lenient scenario, the source of the recommendation did not make a statistical difference. Teachers corrected the overly generous grades equally well, regardless of whether they thought a machine or a human made the error. They did not show the same deference to the computer program when it gave a student too much credit. This provides evidence that the credibility of algorithmic grading depends heavily on the direction of the recommendation.

The scientists also asked the participants to rate the original grader on five psychological dimensions to understand their thought process. These dimensions included perceived ability, comprehension of the subject, fairness, good intent, and responsibility. The answers helped explain why teachers responded differently to the harsh and lenient computer errors.

Megalokonomou highlighted a major contradiction in the survey responses. The most surprising finding was “the gap between what teachers said about AI and how they actually behaved,” she noted. “They rated AI as less fair, less competent, and less accountable than a human colleague, and most said they didn’t want to use it.”

Despite those negative views, behavior shifted when grading real work. “Yet when the AI gave a harsh grade, they deferred to it more than they did to a human making the identical error,” Megalokonomou explained. “Distrust didn’t make them more vigilant; if anything, it went the other way.”

In the harsh scenario, teachers perceived the algorithm as having high technical ability and responsibility. This perception of competence motivated the educators to accept the strict grade. The harshness itself appeared to function as a signal that the computer program was rigorous and capable. Higher perceived ability and responsibility explained over half of the effect in the harsh scenario.

In the lenient scenario, teachers viewed the artificial intelligence much more negatively across all five psychological dimensions. Because they felt the lenient algorithm lacked competence, fairness, and good intent, they actively rejected its advice. They stepped in to correct the inflated score and return the grade to its fair level. Unless the algorithm scored well on all these traits, the teachers overrode its lenient advice.

The researchers also looked at how different demographic groups reacted. “Strikingly, this pattern was most pronounced among younger, more educated, and more tech-confident teachers, exactly the people we would expect to be the most critical users of AI,” Megalokonomou said. Because these groups are often viewed as early adopters of new technology, this finding challenges the common belief that tech savvy professionals automatically provide stronger oversight.

Humanities teachers also showed a slightly higher tendency to defer to the machine than science and math teachers did. The researchers suggest that algorithmic advice might become more influential when evaluation criteria are highly subjective. At the end of the survey, the researchers also asked the teachers about their general attitudes toward artificial intelligence. Nearly half of the respondents reported using generative artificial intelligence tools at least weekly for lesson preparation.

Despite using these tools for planning, the teachers remained skeptical about delegating actual evaluative authority to machines. In open text responses, many educators voiced concerns about a computer’s inability to account for individual student circumstances. They pointed out that human grading often requires empathy and context, such as understanding a student’s learning difficulties or family issues. This suggests that practices relying solely on improving an algorithm’s technical accuracy are unlikely to overcome teachers’ ethical objections.

But there are a few limitations. “The experiment was conducted with teachers in Greece, so readers should be cautious about generalizing directly to other national contexts or professional settings,” Megalokonomou noted. The way these specific teachers interact with technology might not perfectly reflect the behavior of professionals in other countries or cultural environments.

“The study was also designed around a specific, controlled scenario, a single grading task with a deliberately wrong score, which allowed us to isolate the effect cleanly, but real-world grading involves more complexity and repeated interactions with AI tools over time,” she added. The experimental design made the correct grade relatively easy to figure out using a straightforward checklist. In real classroom environments, grading is often more ambiguous and takes place under severe time pressure.

Real world ambiguity could either increase a person’s reliance on algorithmic advice or prompt stronger independent judgment. Future research could explore whether this deference to harsh automated judgments extends to other evaluative tasks, such as formative assessments or hiring decisions. Scientists might also vary the amount of explanation the computer program provides to see if detailed rationales prompt humans to look closer at the results.

The researchers hope to apply these insights to help improve professional practices. “I am already working on a teacher training program focused specifically on AI oversight: not just how to use AI tools, but how to recognize when your own judgment is likely to go astray,” Megalokonomou shared. “I hope this research reaches the policymakers and school leaders making decisions about AI in education right now.”

“The question of whether human oversight actually works tends to get assumed rather than tested,” she said. The findings offer a strong warning that treating humans as an automatic safeguard is insufficient.

“One thing I want readers to sit with is the broader implication. This study is about teachers and grading, but the dynamic we uncovered, where human oversight breaks down selectively depending on what the AI is doing, applies well beyond education,” Megalokonomou emphasized. As automated tools become common worldwide, these insights offer a useful starting point for understanding how experts interact with machines.

“Any setting where AI recommendations are paired with human review, whether that is healthcare, hiring, or criminal justice, faces the same underlying challenge,” she warned. “Putting a human in the loop is not enough on its own. If we want meaningful oversight, we need to design it deliberately, with structured checks and clear accountability mechanisms, rather than assuming good intentions will be sufficient.”

The study, “Why do experts miss AI’s errors? Evidence from a randomized labeling experiment,” was authored by Sofoklis Goulas, Rigissa Megalokonomou, and Panagiotis Sotirakopoulos.

source & further reading

psypost.org — original article An international brain imaging analysis reveals how psychedelics rewire neural circuits Scientists discover deep brain stimulation physically reshapes the brain’s information superhighway Slapping an “AI-generated” label on political messages doesn’t stop people from believing them

Teachers say they distrust AI but still accept its harsh grading mistakes, study finds

Run your AI side-project on zahid.host