overcoming the challenges of automated hate speech detection

Friday 12th of September 2025

Reading time: 3 min

● AI tools can be used to moderate large quantities of social media posts but adapting them to effectively vet material from a variety of social and cultural contexts is not an easy task.

● Researchers from the US and the UK have recently presented a new model for the detection of hate speech which strikes an optimal balance between accuracy and fairness.

● A multidisciplinary team at Orange is also investigating ways to boost the efficiency and fairness of these technologies by combining AI-generated hate speech with social science data.

Social media companies have long made use of moderation systems to impose, with varying degrees of stringency, rules on the content of online posts, and to limit the proliferation of hate speech which is particularly harmful to Internet users. Given the volumes of data involved, they are increasingly keen to automate this time-consuming work with natural language processing (NLP) tools) that can sift through vast quantities of material to identify content that needs to be removed. However, this effort is complicated by the many different languages, worldviews and nuanced social contexts that feature in interaction between social media users. The imposition of generic rules inevitably increases the probability that innocent posts will be flagged as harmful, and the performance of NLP moderation tools can vary widely when they are used to evaluate media created by different demographic groups.

Moderation needs to adapt to different social contexts

Some algorithms that generally do a good job of detecting harmful content fail to perform in certain social and geographical contexts, where they tend to flag large amounts of material that may be considered abusive in some milieus but not in others. At the same time, their replacement by detection systems that are less stringent increases the risk of exposing social media users to hate speech. “Studies that investigated the processing of verlan in France found that systems labelled most of this slang as harmful, which it isn’t,” explains Orange sociology researcher and cyberbullying specialist Lara Alouan. In addition, the abusive nature of certain words can vary over time, and terms that are currently flagged as insulting today may not be viewed as harmful in the future.

Algorithms that strike a balance between accuracy and fairness

Most fairness measures, which enable us to ensure that systems do not privilege or disadvantage different social groups, are not integrated in AI models because they cannot be deployed using conventional optimization techniques. To address this issue, a team of researchers from the University of Texas (USA) and the University of Birmingham (UK) has developed an algorithm to help stakeholders strike a better balance between accuracy and fairness, which facilitates the creation of more equitable NLP architectures. The researchers successfully tested a new Group Accuracy Parity (GAP) measure and accompanying formulae that effectively boosted the performance of machine learning models, enabling them to harmonize the treatment of both parameters in their analysis of textual corpora.

Multidisciplinary collaboration to enhance AI models at Orange

At Orange, a multidisciplinary team of sociologists, data scientists, and language model specialists is developing a project aimed at detecting and preventing hate speech and toxic content, which will focus on a dataset created by Marlène Dulaurans of Bordeaux Montaigne University in collaboration with the French Gendarmerie. The researchers are also exploring the possibility of augmenting this corpus with synthetic data: “We need to see what uncensored LLMs can bring to the table, i.e., those that can produce synthetic cyberbullying data, because in France, the corpora are not very extensive,” explains Orange data science researcher Franck Meyer. Uncensored LLMs may well prove to be a solution for better detection of harmful text if they can generate material that is realistic enough to be indistinguishable from real data collected by law enforcement agencies.

Sociological studies to evaluate perceptions of hate speech

Along with the development of semantic dictionaries, the team will also evaluate users’ ability to distinguish between real and synthetic data with a view to assessing the latter’s suitability for AI training. “These data will also have to be tested in workshops to evaluate individuals’ perceptions of how they receive information, bearing in mind that those attending these workshops will be exposed to different types of material,” adds Lara Alouan. The project will initially tackle hate speech with an in-depth sociological study that will analyse different use cases among different populations. “Some instances of cyberbullying are easier to detect than others, and some have a more serious impact than others,” points out Franck Meyer. Tactics that aim to camouflage abusive material are another obstacle: “Users who attempt to circumvent automatic moderation by adopting a so-called ‘gossip’ vocabulary of coded or distorted language present a particular challenge to detection algorithms,” concludes Lara Alouan.

overcoming the challenges of automated hate speech detection

Algorithms that strike a balance between accuracy and fairness

Multidisciplinary collaboration to enhance AI models at Orange

Sociological studies to evaluate perceptions of hate speech

Continue Reading

More posts

Graphene just broke a fundamental law of physics

Belarus prisoners released to Lithuania in US deal confused by ‘forced deportations’ | Belarus

Study dengue mosquito range backs 200m spraying zone

UN general assembly to back Hamas-free government for Palestine | Israel-Gaza war