
Google, OpenAI, DeepSeek, and Anthropic vary widely in how they identify hate speech, according to new research.
The study, from researchers at the University of Pennsylvania’s Annenberg School for Communication and published in Findings of the Association for Computational Linguistics, is the first large-scale comparative analysis of AI content moderation systems—used by tech companies and social media platforms—that looks at how consistent they are in evaluating hate speech.
Research shows online hate speech both increases political polarization and damages mental health.
The University of Pennsylvania study found different systems produce different outcomes for the same content, undermining consistency and predictability, and leading to moderation decisions that appear arbitrary or unfair.
“Private technology companies have become the de facto arbiters of what speech is permissible in the digital public square, yet they do so without any consistent standard,” Yphtach Lelkes, study co-author, and associate professor at the Annenberg School for Communication, said.
Lelkes and doctoral student Neil Fasching analyzed seven leading models, some designed specifically for content classification, while others were more general. They include: two from OpenAI and two from Mistral, along with Claude 3.5 Sonnet, DeepSeek V3, and Google Perspective API.
Their analysis included 1.3 million synthetic sentences that made statements about 125 distinct groups—including both neutral terms and slurs, ranging from religion, to disabilities, to age. Each sentence included “all” or “some,” a group, and a hate speech phrase. It revealed systematic differences in how models establish decision boundaries around harmful content, highlighting significant implications for automated content moderation.
Overall, the study made three major findings.
Key study takeaways
First, among the models, one demonstrated high predictability for how it would classify similar content; a second, produced different results for similar content; while others did not over-flag nor under-detect content as hate speech.
“These differences highlight the challenge of balancing detection accuracy with avoiding over-moderation,” researchers said.
Second, the models were more similar when they evaluated group statements regarding sexual orientation, race, and gender; and more inconsistent when it came to education level, personal interest, and economic class. Researchers concluded, “that systems generally recognize hate speech targeting traditional protected classes more readily than content targeting other groups.”
Finally, the study found that Claude 3.5 Sonnet and Mistral’s specialized content classification system treated slurs as harmful across the board, while other models prioritized context and intent—with little middle ground between the two.
A recent survey from Vanderbilt University’s non-partisan think tank, The Future of Free Speech, concluded there was “low public support for allowing AI tools to generate content that might offend or insult.”