Independent Researcher, USA
* Corresponding author

Article Main Content

This paper examines the weaknesses of machine learning models to adversarial attacks in online abuse detection. With the growth of user- generated content online, platforms rely on automated systems to detect and filter harmful content at scale. However, these systems remain vulnerable to manipulations by bad actors designed to circumvent detection. We investigate two prominent attack strategies TextFooler and HotFlip against transformer-based models trained on the Jigsaw Toxic Comment Classification dataset. Our experiments reveal considerable degradation in model performance under attack conditions, with accuracy drops of approximately 20%. This paper provides a detailed analysis of these attack strategies, implementation methods, and their impact on model reliability. The findings highlight critical vulnerabilities in current abuse detection systems and demonstrate the need for more robust approaches to maintain platform safety and integrity.

Introduction

There has been an influx of user-generated content on the online platforms and this has led to more instances of cybercrimes, hate speech, cyber bullying and misleading content [1], [2]. To prevent this at the scale and speed of the generation of user content, platforms use machine learning and artificial intelligence to detect abusive patterns at scale and action against them [3]. These models though are not robust and are prone to adversarial attacks where bad actors deliberately manipulate data to deceive the model to make incorrect predictions helping them gain entry into the online ecosystems spreading hate, misinformation and harmful content [4].

Adversarial attacks in Natural Language Processing involve very minor text manipulations where bad actors change alphabets to characters like l to an exclamation mark which on reading can feel similar to human readers but ML models fail to catch them [5]. This is a vulnerability that is a risk for abuse detection systems and hackers and bad actors use these tactics to exploit the system and use these weaknesses to spread harmful content without coming in the limelight [6].

This paper tries to investigate the impact of these adversarial attacks on abuse detection machine learning models and strategies to make the models more robust [7]. This paper focuses on the Jigsaw Toxic Comment Classification dataset which uses superior transformer based models and attempts to mimic both attack and defense mechanisms to evaluate the effectiveness [8].

Related Work

Adversarial Attacks in NLP

Adversarial attacks in NLP have gathered significant attention due to their potential to challenge the reliability of language models [9]. Jin et al. [10] introduced TextFooler which is a black box attack that replaces words with their synonyms to game model predictions without changing the sentence’s meaning. A similar approach by Ebrahimi et al. [11] who proposed HotFlip, which is a white box attack that only uses gradient information to identify character-level changes that often mislead models.

These attacks show a crucial demonstration where even small modifications to simple text can significantly change model performance [12], Ebrahimi et al. [11] highlights the need for building strong and robust mechanisms which can tackle these modifications and still identify hate speech or abusive text in the system.

Robustness in Abuse Detection

Hosseini et al. [13] has examined the weaknesses of Google’s Perspective API which shows that simple misspellings or obfuscations could easily bypass the toxic and profanity filters in the ML model. These findings further showcase the limitations of current abuse detection systems and the immediate need for models that can sustain the adversarial manipulations [14].

Defense Strategies

To prevent and reduce these adversarial attacks, researchers have found different defense strategies [15]:

Improve Training Dataset with adversarial examples: Adding more adversarial examples into the training dataset can improve model performance [16]. Some researchers have shown the positive effects of this method to improve model performance against harmful attacks which include text manipulations and attacks [17].

Preprocessing Input Text: There are techniques like adding a spelling check, character recognition and character normalization can reduce the impact of the adversarial modifications [18]. Pruthi et al. [19] showed that robust word recognition methods could help models handle misspelled or words with characters that don’t make sense as inputs.

Removing noise from Autoencoders: Another way is to use automated encoders to reconstruct cleaner inputs from manipulated text thus reducing the influence of adversarial noise [20].

These strategies aim to strengthen models against adversarial attacks, ensuring more reliable abuse detection [21].

Materials and Methods

Dataset and Preprocessing

We used the Jigsaw Toxic Comment Classification dataset that has 150 K user comments from Wikipedia categorized into various forms of toxic comments online [8]. Each comment in the data is classified across six different levels of toxicity ranging from toxic, severe toxic, obscene, threat, insult and identity hate. We use this dataset to train and test models which help to detect and identify toxic behaviors in content generated by users online [22].

The dataset is split into two parts as 80% training and 20% testing. This split maintains a balanced distribution of labels across the two splits. The data is cleaned before conducting the analysis to ensure there is consistency across all models. Data cleaning steps include:

Cleaning: All text characters (uppercase, camelcase) were converted to lowercase to reduce text variations [23].

HTML Tags Removed: Any HTML tags were removed using regular expressions to ensure consistency.

Remove special digits and characters: All non-alphabetic tokens that do not carry meaningful context were deleted and we just kept it to plain text.

Use Tokenization: We chose to tokenize with the WordPiece tokenizer because it aligns well with the RoBERTa architecture and helped to reduce token fragmentation we noticed during early experiments [24].

Appending or Truncating text to consistent length: To keep things consistent, we padded or trimmed every input to 256 tokens.

These steps ensure the model receives clean, consistent input, minimizing noise while keeping semantic integrity. All adversarial examples were subjected to the same preprocessing pipeline unless otherwise modified as part of a defense strategy.

Model Architectures

Our base model is RoBERTa-base which is a robust transformer-based language model trained on a large English corpus [25]. It contains 12 transformer layers and has approximately 125 million parameters which is excellent to be used in text classification tasks.

For toxicity classification, we fine-tune RoBERTa by appending a dense feed-forward layer with a sigmoid activation function, applied to the [CLS] token output [26]. The design allows for multi-label predictions and provides an output in the form of six probability scores that correspond to each of the toxicity classes.

To evaluate model and its robustness, we implemented the different variations of the RoBERTa model:

• RoBERTa + Adversarial Training

• RoBERTa + Preprocessing Defense

• RoBERTa + Autoencoder Defense

• Ensemble of All Defenses

Each configuration is trained using the same hyperparameters (learning rate of 2e-5, batch size of 16, and 4 epochs), ensuring consistent training conditions [27]. The only variations come from the actual application of robustness-enhancing techniques. We also explore adversarial training [28], semisupervised training [29], and ensemble-based defenses [30] as robustness strategies.

Adversarial Attacks

We implement two prominent adversarial attack strategies to evaluate the model’s robustness:

TextFooler (Black-box Attack)

TextFooler operates by identifying important words in a sentence and replacing them with synonyms that maintain the original meaning but can alter the model’s prediction [10]. The attack ensures semantic similarity by leveraging word embeddings and language models.

Implementation Steps:

• We first loaded the trained RoBERTa model and defined the classification pipeline using the Hugging Face and text attack.

• We then applied the TextFoolerJin2019 attack recipe from the TextAttack library.

• For each correctly classified test example, we generated an adversarial version and recorded:

– whether the model is fooled (i.e., the label changes)

– total number of word substitutions

– confidence drop in the original class

Code Snippet (simplified):

from textattack.attack_recipes import TextFoolerJin2019

from textattack.models.wrappers import HuggingFaceModelWrapper

from textattack.datasets import Dataset

model_wrapper = HuggingFaceModelWrapper(model, tokenizer)

attack = TextFoolerJin2019.build(model_wrapper)

dataset = Dataset(test_samples) # Format: (text, label)

attack_results = attack.attack_dataset(dataset)

The HuggingFaceModelWrapper enables compatibility between the TextAttack framework and our pre-trained RoBERTa model [31]. We created a Dataset object from a subset of our test data and applied the attack_dataset() method to generate adversarial samples along with their corresponding attack success metrics.

Techniques like TextBugger [32], discrete attacks [33], and black-box generation [34] have shown the feasibility of text-based adversarial strategies.

We evaluated each model on 1000 adversarial samples. Attack Success Rate (ASR) is computed as the percentage of adversarial examples that caused a misclassification.

This implementation allowed us to systematically measure the drop in model performance and calculate the Attack Success Rate (ASR) under adversarial settings.

The detailed performance metrics for RoBERTa under attack are summarized in Table I, showing how adversarial methods significantly reduce model accuracy and robustness.

Model Clean TextFooler AUC HotFlip AUC Precision Recall F1 Score
RoBERTa 0.952 0.762 0.781 0.710 0.645 0.675
Table I. Performance Metrics for a RoBERTa Model under Attack

HotFlip (White-box Attack)

HotFlip is a white-box attack that flips characters based on gradient information [11]. It identifies character substitutions that would most decrease the model’s confidence in the correct class. Implementation highlights:

• The gradient of the loss w.r.t. each input token is computed.

• Changes at the character level–-like flipping, swapping, or inserting letters–-are made to confuse the model and increase its prediction error.

• We used the OpenAttack framework for efficient execution.

This attack is most effective on models sensitive to tokenization, which includes BERT-like models. Zang et al. [35] proposed a word-level attack strategy using combinatorial optimization, which could further reduce model robustness under constrained perturbation.

Evaluation Metrics

To assess the performance of each model under attack, the following metrics were used:

AUC (Area Under Curve): It measures the model’s ability to rank true labels higher than false ones for both clean and adversarial inputs.

Precision, Recall, and F1 Score: These metrics are evaluated for multi-label classification using micro-averaging.

ASR (Attack Success Rate): Calculated as a percentage of originally correct predictions that are misclassified after an attack.

The results of these experiments are presented in Section 6, demonstrating how each attack impacts model performance.

Results

Evaluation Metrics

We assess model performance using the following metrics:

• AUC (Area Under ROC Curve)

• Precision

• Recall

• F1 Score

• Attack Success Rate (ASR)

Results

Baseline Model Performance under attack that shows performance metrics for a RoBERTa model (Table I).

Adversarial attacks significantly degrade model performance, with AUC dropping by approximately 20% and F1 score similarly affected.

Discussion

Error Analysis

When we took a small sample of misclassified adversarial it reveals that:

Word Substitutions: When abusive terms are replaced with innocuous synonyms (e.g., “moron” → “fool”), the model still missed it as they were trained only on the original toxic phrasing [36].

Obfuscation Tricks: Replacing “hate” by using characters like “h@te” or “idiot” with numeric digits for some letter like “id10t” confuses the tokenizers [37].

Contextual Weakness: Attacks that contain toxic words in a more neutral context (e.g., “He’s not a hater, just misunderstood”) are particularly effective [38].

These cases highlight gaps in understanding of context, generalization of vocabulary, and limitations in preprocessing.

These contextual attacks align with previous work emphasizing limitations of neural networks to adversarial inputs [39].

Generalizability to Real-World Scenarios

While our evaluation is based on synthetically created adversarial attacks, similar things are frequently seen on social media in the actual world [40]. Real-world adversarial behavior include a variety of misspellings and substitutions to bypass content moderation done intentionally by bad actors, coded language and use of sarcasm which cannot be easily interpreted by the system and coordinated behavior with consistent subtle evasion. The defense mechanisms evaluated here can have a direct impact in the live moderation pipelines and these can increase precision without sacrificing recall.

Conclusion

The detailed analysis of adversarial attacks on abuse detection systems reveals major vulnerabilities in current NLP models. The experiments with TextFooler and HotFlip attack strategies demonstrated that even cutting-edge transformer models like RoBERTa could not identify some of the adversarial examples, with performance degradation of around 20% in the AUC scores. The highly sophisticated techniques bad actors use which include word substitutions, character manipulation and contextual changes even bypass the high quality moderation systems.

The findings in this paper further emphasize the urgent need for developing more robust abuse detection models that can withstand different types of adversarial manipulations. As online platforms continue to serve billions of users daily, ensuring their safety requires high-performing people that maintain their integrity when deliberately challenged by bad actors. Future research must focus on understanding model vulnerabilities in more depth and developing targeted solutions to address the specific weaknesses identified in this study.

References

  1. Hosseini E, Kannan S, Zhang B, Poovendran R. Deceiving google’s perspective API built for detecting toxic comments. Proceedings of the Workshop on Attacking and Defending AI, 2017.
     Google Scholar
  2. Kumar S, Hamilton WL, Leskovec J, Jurafsky D. Community interaction and conflict on the web. Proceedings of the 2018 World Wide Web Conference, pp. 933–43, 2018.
     Google Scholar
  3. Davidson T, Warmsley D, Macy M, Weber I. Automated hate speech detection and the problem of offensive language. Proc International AAAI Conf Web Soc Med. 2017;11(1):512–5.
     Google Scholar
  4. Yuan X, He P, Zhu Q, Li X. Adversarial examples: attacks and defenses for deep learning. IEEE Trans Neural Netw Learn Syst. 2019;30(9):2805–24.
     Google Scholar
  5. Nisioti G, Sechidis K. Deception detection in text: a comprehensive survey. Inf Fusion. 2022;84:176–209.
     Google Scholar
  6. Schuster T, Schuster R, Shah DJ, Barzilay R. The limitations of stylometry for detecting machine-generated fake news. Comput Linguist. 2020;46(2):499–510.
     Google Scholar
  7. Eger M, Peng N, Choi E. On the calibration and uncertainty of neu- ral learning to rank models for conversational search. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pp. 160–75, 2021.
     Google Scholar
  8. Jigsaw Toxic Comment Classification Challenge. Kaggle. [Online]. Available from: https://www.kaggle.com/c/jigsaw-toxic-comment- classification-challenge.
     Google Scholar
  9. Zhang WE, Sheng QZ, Alhazmi A, Li C. Adversarial attacks on deep-learning models in natural language processing: a survey. ACM Trans Intell Syst Technol. 2020;11(3):1–41.
     Google Scholar
  10. Jin D, Jin Z, Zhou JT, Szolovits P. A strong baseline for natural language attack on text classification and entailment. Proc AAAI Conf Artif Intell. 2020;34(05):8018–25.
     Google Scholar
  11. Ebrahimi J, Rao A, Lowd D, Dou D. HotFlip: white-box adversar- ial examples for text classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 31–6, 2018.
     Google Scholar
  12. Michel P, Li X, Neubig G, Pino JM. On evaluation of adversarial perturbations for sequence-to-sequence models. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 3103–14, 2019.
     Google Scholar
  13. Hosseini E, Kannan S, Zhang B, Poovendran R. Deceiving google’s perspective API built for detecting toxic comments. arXiv preprint arXiv:1702.08138. 2017.
     Google Scholar
  14. Waseem Z, Hovy D. Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. Proceedings of the NAACL Student Research Workshop, pp. 88–93, 2016.
     Google Scholar
  15. Morris JX, Lifland E, Yoo JY, Grigsby J, Jin D, Qi Y. TextAt- tack: a framework for adversarial attacks, data augmentation, and adversarial training in NLP. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 119–26, 2020.
     Google Scholar
  16. Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A. Towards deep learning models resistant to adversarial attacks. International Conference on Learning Representations (ICLR), 2018.
     Google Scholar
  17. Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow I, et al. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. 2013.
     Google Scholar
  18. Karimi S, Yin X, Baral C, Zhou J, Harer J. Adversarial text normalization. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7767–84, 2021.
     Google Scholar
  19. Pruthi D, Dhingra B, Lipton ZC. Combating adversarial mis- spellings with robust word recognition. Proceedings of ACL, pp. 5582–91, 2019.
     Google Scholar
  20. Zhang H, Goodfellow I, Metaxas D, Odena A. Self-attention gen- erative adversarial networks. International Conference on Machine Learning, pp. 7354–63, 2019.
     Google Scholar
  21. Shen S, Jia R, Cheng Y, Wu J, Su H. Gradient-based adversar- ial attack on transformer-based natural language understanding models. arXiv preprint arXiv:2009.06297. 2020.
     Google Scholar
  22. Mathew B, Saha P, Tharad H, Rajgaria S, Singhania P, Maity SK, et al. Thou shalt not hate: countering online hate speech. Proc Int AAAI Conf Web and Soc Med. 2019;13:369–80.
     Google Scholar
  23. Schmidt A, Wiegand M. A survey on hate speech detection using natural language processing. Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, pp. 1– 10, 2017.
     Google Scholar
  24. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, et al. Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. 2016.
     Google Scholar
  25. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. 2019.
     Google Scholar
  26. Devlin J, Chang MW, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–86, 2019.
     Google Scholar
  27. Zhang H, Cisse M, Dauphin YN, Lopez-Paz D. mixup: beyond empirical risk minimization. International Conference on Learning Representations, 2018.
     Google Scholar
  28. Goodfellow IJ, Shlens J, Szegedy C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. 2014.
     Google Scholar
  29. Miyato T, Dai AM, Goodfellow I. Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725. 2016.
     Google Scholar
  30. Kittler J, Hatef M, Duin RPW, Matas J. On combining classifiers. IEEE Trans Pattern Anal Mach Intell. 1998;20(3):226–39.
     Google Scholar
  31. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Huggingface’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771. 2019.
     Google Scholar
  32. Li J, Ji S, Du T, Li B, Wang T. TextBugger: generating adversar- ial text against real-world applications. 26th Annual Network and Distributed System Security Symposium, 2019.
     Google Scholar
  33. Lei Q, Wu L, Chen P-Y, Dimakis AG, Dhillon IS, Witbrock M. Discrete adversarial attacks and submodular optimization with applications to text classification. Systems and Machine Learning Conference, 2019
     Google Scholar
  34. Gao J, Lanchantin J, Soffa ML, Qi Y. Black-box generation of adversarial text sequences to evade deep learning classifiers. 2018 IEEE Security and Privacy Workshops, pp. 50–6, 2018.
     Google Scholar
  35. Zang Y, Qi F, Yang C, Liu Z, Zhang M, Liu Q, et al. Word- level textual adversarial attacking as combinatorial optimization. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6066–80, 2020.
     Google Scholar
  36. Yin D, Xue Z, Lu L, Cheng P, Zheng R, Liu Q, et al. On the robust- ness of language encoders against grammatical errors. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3386–403, 2020.
     Google Scholar
  37. Belinkov Y, Bisk Y. Synthetic and natural noise both break neural machine translation. International Conference on Learning Repre- sentations (ICLR), 2018.
     Google Scholar
  38. Rizvi SA, Keith SL, Castillo K, Bouhenni RA. Context-aware classification of toxic language: capturing contextual shifts in online misogyny. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9927–34, 2021.
     Google Scholar
  39. Vincent P, Larochelle H, Bengio Y, Manzagol P-A. Extracting and composing robust features with denoising autoencoders. Proceed- ings of the 25th International Conference on Machine Learning, pp. 1096–103, 2008.
     Google Scholar
  40. Stenetorp P, Pyysalo S, Topi´ c G, Ohta T, Ananiadou S, Tsujii. J. BRAT: a web-based tool for NLP-assisted text annotation. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 102–7, 2012.
     Google Scholar