TY - GEN
T1 - A large-scale English multi-label Twitter dataset for online abuse detection
AU - Salawu, Semiu
AU - Lumsden, Jo
AU - He, Yulan
PY - 2021/8
Y1 - 2021/8
N2 - In this paper, we introduce a new English Twitter-based dataset for online abuse and cyberbullying detection. Comprising 62,587 tweets, this dataset was sourced from Twitter using specific query terms designed to retrieve tweets with high probabilities of various forms of bullying and offensive content, including insult, profanity, sarcasm, threat, porn and exclusion. Analysis performed on the dataset confirmed common cyberbullying themes reported by other studies and revealed interesting relationships between the classes. The dataset was used to train a number of transformer-based deep learning models returning impressive results.
AB - In this paper, we introduce a new English Twitter-based dataset for online abuse and cyberbullying detection. Comprising 62,587 tweets, this dataset was sourced from Twitter using specific query terms designed to retrieve tweets with high probabilities of various forms of bullying and offensive content, including insult, profanity, sarcasm, threat, porn and exclusion. Analysis performed on the dataset confirmed common cyberbullying themes reported by other studies and revealed interesting relationships between the classes. The dataset was used to train a number of transformer-based deep learning models returning impressive results.
U2 - 10.18653/v1/2021.woah-1.16
DO - 10.18653/v1/2021.woah-1.16
M3 - Conference contribution
AN - SCOPUS:85113698755
SP - 146
EP - 156
BT - Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)
A2 - Davani, Aida Mostafazadeh
A2 - Kiela, Douwe
A2 - Lambert, Mathias
A2 - Vidgen, Bertie
A2 - Prabhakaran, Vinodkumar
A2 - Waseem, Zeerak
PB - Association for Computational Linguistics (ACL)
T2 - 5th Workshop on Online Abuse and Harms
Y2 - 6 August 2021 through 6 August 2021
ER -