{"slug": "why-do-ai-models-struggle-with-online-hate-speech-detection", "title": "Why do AI models struggle with online hate speech detection?", "summary": "AI models used for online hate speech detection vary widely in how they identify and classify hate speech, according to a 2025 University of Pennsylvania study. The study found significant inconsistencies across systems from OpenAI, Anthropic, DeepSeek, Mistral, and Google, raising concerns about bias and unequal protection. As the UN marks the International Day for Countering Hate Speech, Meta has removed fewer hateful posts since 2023, while TikTok reports removing 96.3% of hate speech before it was reported.", "body_md": "# Why do AI models struggle with online hate speech detection?\n\n*As the United Nations marks the International Day for Countering Hate Speech, Al Jazeera examines how AI handles it – and where it falls short.*\n\nHate speech that once circulated in person now travels farther and faster via anonymous online accounts behind a screen.\n\nAs the United Nations marks the [International Day](https://www.un.org/en/observances/countering-hate-speech) for Countering Hate Speech on June 18, UN Secretary-General Antonio Guterres has warned that social platforms are amplifying the threat.\n\nWith artificial intelligence (AI) increasingly tasked with detecting and removing hate speech online, Al Jazeera looks at where these systems fall short compared with human judgement.\n\n## How is hate speech defined?\n\nAccording to the UN, hate speech covers any communication – spoken, written or behavioural – that discriminates against or incites violence towards a person or group.\n\nThe UN states that hate speech targets a person’s actual or perceived identity, race, ethnicity, religion, gender, sexual orientation or disability. And it isn’t limited to words, with the UN noting it can also take the form of images, cartoons, gestures and even objects.\n\n## How many people encounter hate speech online?\n\nAccording to a [2023](https://www.ipsos.com/sites/default/files/ct/news/documents/2023-11/unesco-ipsos-online-disinformation-hate-speech.pdf) joint survey of 8,000 people in 16 countries done by polling company Ipsos and the UN Educational, Scientific and Cultural Organization (UNESCO), more than two-thirds of internet users encountered hate speech online.\n\nThe survey also found that 33 percent of people thought LGBTQI people experienced the most cases of hate speech, followed by ethnic and racial minorities (28 percent) and women (18 percent).\n\nMeta, which owns Facebook, has removed fewer hateful posts since 2023. In the last quarter of 2025, the company removed 1.3 million posts from Instagram and 1.3 million from Facebook, compared to 7.4 million removed from Instagram and 5.8 million from Facebook in the fourth quarter of 2024.\n\nThis came as the company shifted away from proactive detection of hate speech and relied more on users to report encounters.\n\nOn the other hand, TikTok [said](https://newsroom.tiktok.com/our-first-transparency-report-under-the-eu-hate-speech-code?lang=en-150) it removed 96.3 percent of all hate speech and content in the fourth quarter of 2025 before it was reported.\n\n## AI models detect hate speech differently\n\nTo detect and combat the spread of hate speech online, social media companies have increasingly turned to AI, using content moderation systems powered by large language models (LLMs) that promise to automate content filtering across huge volumes of messages.\n\nIn general, these systems use labeled datasets and pretrained language models to detect abusive language. They then apply rules or score thresholds to decide whether content is hateful or violates company policies.\n\nA 2025 [study](https://penntoday.upenn.edu/news/annenberg-artificial-intelligence-models-vary-widely-identifying-hate-speech) by researchers at the University of Pennsylvania found that these models vary widely in how they identify and classify hate speech, with significant inconsistencies across systems and demographic groups, raising concerns about bias and unequal protection online.\n\nThe study evaluated seven AI moderation systems – including models from OpenAI, Anthropic, DeepSeek, Mistral, and Google – and found major differences in how they identified and scored hate speech across categories.\n\nThis chart shows how different AI moderation systems scored the severity of hate speech targeting the same groups on a 0–1 scale. Higher values indicate the model judged the content as more hateful.\n\nMistral Moderation Endpoint is often clustered very close to 1, meaning it labels many examples as highly hateful regardless of the target group.\n\nOpenAI Moderation Endpoint tends to produce much lower scores for many categories, sometimes less than half the score assigned by other models.\n\nAs the study authors put it, “If two systems produce different outcomes for the same piece of content – flagging it as hate speech in one case but not in another – it undermines the legitimacy of the moderation process.”\n\n## The limitations of AI hate speech detection\n\nWhile AI systems are able to detect explicit hate speech – for example, when profanities and slurs are used against a particular group – more nuanced examples are missed by LLMs.\n\n“One challenging example is the case of implicit hate speech, which is often not detected as such because it contains no mention of slurs,” Arkaitz Zubiaga, an associate professor at Queen Mary University of London, and co-lead of the university’s Social Data Science lab, told Al Jazeera. “This could be the case of a positive-sounding message such as “I would love to see how great the world would be if…” followed by a derogatory message disparaging a demographic group. AI systems can struggle to see the hate in those messages if they focus instead on the positive side of the message.”\n\nZubiaga adds that the opposite is also true, where seemingly offensive words, which are now incorporated into language for more endearing purposes, are highlighted as hate speech.\n\n“This is the case of reclaimed language, where keywords that are historically deemed slurs are embraced and repurposed by the communities they were initially used to disparage, and the slurs are then used between members of the marginalised community,” he said. “While these cases should not be flagged as hateful, AI systems have a tendency to do it.”", "url": "https://wpnews.pro/news/why-do-ai-models-struggle-with-online-hate-speech-detection", "canonical_source": "https://www.aljazeera.com/news/2026/6/18/why-do-ai-models-struggle-with-online-hate-speech-detection?traffic_source=rss", "published_at": "2026-06-18 08:41:21+00:00", "updated_at": "2026-06-18 09:01:18.122320+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-ethics", "natural-language-processing", "ai-safety"], "entities": ["Meta", "TikTok", "OpenAI", "Anthropic", "DeepSeek", "Mistral", "Google", "University of Pennsylvania"], "alternates": {"html": "https://wpnews.pro/news/why-do-ai-models-struggle-with-online-hate-speech-detection", "markdown": "https://wpnews.pro/news/why-do-ai-models-struggle-with-online-hate-speech-detection.md", "text": "https://wpnews.pro/news/why-do-ai-models-struggle-with-online-hate-speech-detection.txt", "jsonld": "https://wpnews.pro/news/why-do-ai-models-struggle-with-online-hate-speech-detection.jsonld"}}