As part of my recent MSc thesis, the subject of which was investigating using cloud services to aid in the detection of cyberbullying, I wanted to train some some machine learning models to be used to classify text as cyberbullying. As I was using a supervised machine learning approach, I required existing labelled datasets in order to train the models. I was surprised to find that not many labelled datasets exist for the cyberbullying domain, at least ones which are publicly available. In fact, Salawu et al. in their 2017 paper [1], found the lack of labelled datasets to be one of the main challenges today facing research focused on the automated detection of cyberbullying. Their research revealed only five distinct publicly available cyberbullying datasets, and these only relate to traditional social media platforms that involve text, and don’t represent newer media platforms such as SnapChat.
The datasets I came across while attempting to look for training input to my ML models were:
- MySpace Bullying Data [2]
- University of Wisconsin-Madison Data [3]
- Formspring.me Data [4]
- Data from “Anti Bully” project [5]
- Max Planck Institute Data [6]
Each of these varies in terms of size, origin, and quality of the data labeling, but were a good starting point to my research. Some of the datasets are also quite old (some date back to 2010), but still useful nonetheless. All except the Max Planck Institute data are specific to cyberbullying – this is labelled for positive / negative sentiment, but I still found this useful for my use case.
I was surprised that larger cyberbullying datasets don’t exist in the public domain, considering the amount of research that seems to be happening in this area for the past 10 years, and the prevalence of the issue itself. If anyone can point me to any publicly available datasets that I’ve missed, then I would love to hear from you.
[1] Approaches to Automated Detection of Cyberbullying: A Survey, Salawu, S.; He, Y.; Lumsden, J., IEEE Transactions on Affective Computing 2017, vol. PP, no. 99, pp. 1-1.
[2] Detecting the Presence of Cyberbullying Using Computer Software, Detecting the Presence of Cyberbullying Using Computer Software, Poster presentation at WebSci11, June 14th 2011.
[3] Understanding and Fighting Bullying with Machine Learning, Sui, Junming, PhD thesis, Department of Computer Sciences, University of Wisconsin-Madison, 2015.
[4] Using Machine Learning to Detect Cyberbullying, In Proceedings of the 2011 10th International Conference on Machine Learning and Applications Workshops (ICMLA 2011), Reynolds, K; Kontostathis, A.; Edwards, L., December 2011.
[5] Anti Bully, Li, Michelle, DevPost Submission, 2017.
[6] Sentiment Analysis in Twitter with Lightweight Discourse Analysis, Mukherjee, Subhabrata; Bhattacharyya, Pushpak, 2012.
Hi Jimmy
Really interesting thesis, I’d love to chat further with you on this and put you in touch with some people in academia whom I work with in Ireland and Florida in this area to hopefully help you and provide some data for you to leverage. Plus I’d love to leverage your work in our McAfee Online Safety Program as a point of reference and perhaps to use in our content.
Let me know your thoughts.
Kind regards,
Irene