A large-scale English multi-label Twitter dataset for online abuse detection

Semiu Salawu, Jo Lumsden, Yulan He

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)
81 Downloads (Pure)

Abstract

In this paper, we introduce a new English Twitter-based dataset for online abuse and cyberbullying detection. Comprising 62,587 tweets, this dataset was sourced from Twitter using specific query terms designed to retrieve tweets with high probabilities of various forms of bullying and offensive content, including insult, profanity, sarcasm, threat, porn and exclusion. Analysis performed on the dataset confirmed common cyberbullying themes reported by other studies and revealed interesting relationships between the classes. The dataset was used to train a number of transformer-based deep learning models returning impressive results.

Original languageEnglish
Title of host publicationProceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)
EditorsAida Mostafazadeh Davani, Douwe Kiela, Mathias Lambert, Bertie Vidgen, Vinodkumar Prabhakaran, Zeerak Waseem
PublisherAssociation for Computational Linguistics (ACL)
Pages146-156
Number of pages11
ISBN (Electronic)9781954085596
DOIs
Publication statusPublished - Aug 2021
Externally publishedYes
Event5th Workshop on Online Abuse and Harms - Bangkok and Online, Bangkok, Thailand
Duration: 6 Aug 20216 Aug 2021

Publication series

Name
ISSN (Print)None

Conference

Conference5th Workshop on Online Abuse and Harms
Abbreviated titleWOAH 2021
Country/TerritoryThailand
CityBangkok
Period6/08/216/08/21

ASJC Scopus subject areas

  • Language and Linguistics
  • Computational Theory and Mathematics
  • Software
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'A large-scale English multi-label Twitter dataset for online abuse detection'. Together they form a unique fingerprint.

Cite this