Building a Dataset from Twitter Using Tweepy

I am always on the lookout for interesting datasets to mess about with machine learning and data visualization. Mostly I use datasets from sources like which has lots of interesting datasets that are specific to Ireland. Sometimes, for the topic I am interested in, there isn’t a dataset readily available, and I want to create one. Mostly I use Twitter for this. Obviously one of the drawbacks here is that the data will be unlabeled, and if you are looking to use it in supervised machine learning then you will need to label the data which can be both laborious and time consuming. Tweepy is a great Python library for accessing the Twitter API, which is very easy to use. In this post I will demonstrate how to use this to grab tweets from Twitter, and also add some other features to the dataset that might be useful for machine learning models later.

I will demonstrate how to do this using a Jupyter notebook here, in reality you would probably want to write the dataset to a CSV file or some other format for later consumption in model training.

The first thing you will need to do is create a new application on the Twitter developer portal. This will give you the access keys and tokens which you will need to access the Twitter API. Standard access is free, but there are a number of limits which can be seen in the documentation that you should be aware of. Once you have done this, create a new Jupyter notebook, and import Tweepy and create some variables to hold your access keys and tokens.

Now we can initialize Tweepy, and grab some tweets. In this example, we will get 100 tweets relating to the term ‘trump‘. Print out the raw tweets also so you verify that your access keys work and you are actually receiving tweets.

Now that you have gotten this far, we can parse the tweet data and create a pandas dataframe to store the relevant attributes that we want. The data will come back from Twitter in JSON format, and depending on what you are looking for, you won’t necessarily want all the data. Below I am doing a bunch things:

  • Creating a new pandas dataframe and creating columns for the items I am interested in from the Tweet data.
  • Removing duplicate tweets.
  • Removing any URL’s in the tweet text – in my case I was planning on using this data in some text classification experiments, so I don’t want these included.
  • Creating a sentiment measure for the tweet text using the TextBlob library.

Click image to enlarge.

At this point, you have the beginnings of a dataset. You can also add more features to the dataset easily. In my case I wanted to add the tweet text length and the count of punctuation in the tweet text. This is easy to do. The below calculates these and adds two new columns to the dataframe.

This post hopefully illustrated how easy it is to create datasets from Twitter. The full Jupyter notebook is available on my Github here, which also has an example of generating a wordcloud from the data.

AWS SageMaker

I have played around with AWS SageMaker a bit more recently. This is Amazon’s managed machine learning service that allows you to build and run machine learning models in the AWS public cloud. The nice thing about this is that you can productionize a machine learning solution very quickly because the operational aspects – namely hosting the model and scaling an endpoint to allow inferences against the model – are removed. So called ‘MLOps’ has almost become a field of its own, so abstracting all this complexity away and just focusing on the core of the problem you are trying to solve is very beneficial. Of course, like everything else in public cloud, this comes at a monetary cost, but it is well worth the cost if you don’t have specialists in this area, or just want to do a fast proof-of-concept.

I will discuss here the basic flow of creating a model in SageMaker – of course some of these are general things that would be done as part of any machine learning project. The first setup you will need to do is head on over to AWS and create a new Jupyter Notebook instance in AWS SageMaker, this is where the logic for the training of the model, and deployment of the ML endpoint will reside.

Assuming you have identified the problem you are trying to solve, you will need to identify the dataset which you will use for training and evaluation of the model. You will want to read the AWS documentation for the algorithm you choose, as this will likely require the data to be in a specific format for the training process. I have found that many of the built-in algorithms in SageMaker require data in different formats, which has been a bit frustrating. I recommend looking at the AWS SageMaker examples repository, as it has detailed examples of all the available algorithms, and examples you can walk through that solve real world problems.

Once you have the dataset gathered and in the correct format, and you have identified the algorithm you want to use, the next step is to kick off a training job. It is likely your data will be stored on AWS S3, and as usual you would split into training data and data you will use later for model evaluation. Make sure that the S3 bucket where you store your data is located in the same AWS region as your Jupyter Notebook instance or you may see issues. SageMaker makes it very easy to kick off a training job. Let’s take a look at an example.

Here, I’m setting up a new training job for some experiments I was doing around anomaly detection using the Random Cut Forest (RCF) algorithm provided by AWS SageMaker. This is an unsupervised algorithm for detecting anomalous data points within a dataset.


Above we are specifying things like the EC2 instance type we want the training to execute on, the number of EC2 instances, and the input and output locations of our data. The final parameters above where we are specifying the number of samples per tree and the number of trees are specific to the RCF algorithm. These are known as hyperparameters. Each algorithm will have its own hyperparameters that can be tuned, for example see here for the list available when using RCF. When the above is executed, the training process starts and you will see some output in the console, note that you will be charged for the model training time, once the job completes you will see the amount of seconds you have been billed for.

At this point, you have a model, but now you want to productionize it and allow endpoints to run inferences against it. Of course, it is not as easy as train and deploy – I am completely ignoring the testing/validation of the model and tuning based on that, as here I just want to show how SageMaker is effective at abstracting away the operational aspects of deploying a model. With SageMaker, you can deploy an endpoint, which is essentially your model hosted on a server with an API that allows queries to be run against it, with a prediction returned to the requester. The endpoint can be spun up in a few lines of code:


Once you get confirmation that the endpoint is deployed – this will generally take a few minutes – you can use the predict function to run some inference, for example:


Once you are done playing around with your model and endpoint, don’t forget to turn off your Jupyter instance (you don’t need to delete it), and to destroy any endpoints that you have created or you will continue to be charged.


AWS SageMaker is powerful in terms of putting the ability to create machine learning models and setup endpoints to serve requests to them in anybody’s hands. It is still a complex beast that requires knowledge of the machine learning process in order for you to be successful. However, in terms of being able to train a model quickly and put it into production, it is a very cool offering from AWS. You also get benefits like autoscaling of your endpoints should you need to scale up to meet demand. There is a lot to learn about SageMaker, and I’m barely scratching the surface here, but if you are interested in ML I highly recommend you take a look.

Cyberbullying Datasets

As part of my recent MSc thesis, the subject of which was investigating using cloud services to aid in the detection of cyberbullying, I wanted to train some some machine learning models to be used to classify text as cyberbullying. As I was using a supervised machine learning approach, I required existing labelled datasets in order to train the models. I was surprised to find that not many labelled datasets exist for the cyberbullying domain, at least ones which are publicly available. In fact, Salawu et al. in their 2017 paper [1], found the lack of labelled datasets to be one of the main challenges today facing research focused on the automated detection of cyberbullying. Their research revealed only five distinct publicly available cyberbullying datasets, and these only relate to traditional social media platforms that involve text, and don’t represent newer media platforms such as SnapChat.

The datasets I came across while attempting to look for training input to my ML models were:

  • MySpace Bullying Data [2]
  • University of Wisconsin-Madison Data [3]
  • Data [4]
  • Data from “Anti Bully” project [5]
  • Max Planck Institute Data [6]

Each of these varies in terms of size, origin, and quality of the data labeling, but were a good starting point to my research. Some of the datasets are also quite old (some date back to 2010), but still useful nonetheless. All except the Max Planck Institute data are specific to cyberbullying – this is labelled for positive / negative sentiment, but I still found this useful for my use case.

I was surprised that larger cyberbullying datasets don’t exist in the public domain, considering the amount of research that seems to be happening in this area for the past 10 years, and the prevalence of the issue itself. If anyone can point me to any publicly available datasets that I’ve missed, then I would love to hear from you.

[1] Approaches to Automated Detection of Cyberbullying: A Survey, Salawu, S.; He, Y.; Lumsden, J., IEEE Transactions on Affective Computing 2017, vol. PP, no. 99, pp. 1-1.

[2] Detecting the Presence of Cyberbullying Using Computer Software, Detecting the Presence of Cyberbullying Using Computer Software, Poster presentation at WebSci11, June 14th 2011.

[3] Understanding and Fighting Bullying with Machine Learning, Sui, Junming, PhD thesis, Department of Computer Sciences, University of Wisconsin-Madison, 2015.

[4] Using Machine Learning to Detect Cyberbullying, In Proceedings of the 2011 10th International Conference on Machine Learning and Applications Workshops (ICMLA 2011), Reynolds, K; Kontostathis, A.; Edwards, L., December 2011.

[5] Anti Bully, Li, Michelle, DevPost Submission, 2017.

[6] Sentiment Analysis in Twitter with Lightweight Discourse Analysis, Mukherjee, Subhabrata; Bhattacharyya, Pushpak, 2012.