I am always on the lookout for interesting datasets to mess about with machine learning and data visualization. Mostly I use datasets from sources like data.gov.ie which has lots of interesting datasets that are specific to Ireland. Sometimes, for the topic I am interested in, there isn’t a dataset readily available, and I want to create one. Mostly I use Twitter for this. Obviously one of the drawbacks here is that the data will be unlabeled, and if you are looking to use it in supervised machine learning then you will need to label the data which can be both laborious and time consuming. Tweepy is a great Python library for accessing the Twitter API, which is very easy to use. In this post I will demonstrate how to use this to grab tweets from Twitter, and also add some other features to the dataset that might be useful for machine learning models later.
I will demonstrate how to do this using a Jupyter notebook here, in reality you would probably want to write the dataset to a CSV file or some other format for later consumption in model training.
The first thing you will need to do is create a new application on the Twitter developer portal. This will give you the access keys and tokens which you will need to access the Twitter API. Standard access is free, but there are a number of limits which can be seen in the documentation that you should be aware of. Once you have done this, create a new Jupyter notebook, and import Tweepy and create some variables to hold your access keys and tokens.
Now we can initialize Tweepy, and grab some tweets. In this example, we will get 100 tweets relating to the term ‘trump‘. Print out the raw tweets also so you verify that your access keys work and you are actually receiving tweets.
Now that you have gotten this far, we can parse the tweet data and create a pandas dataframe to store the relevant attributes that we want. The data will come back from Twitter in JSON format, and depending on what you are looking for, you won’t necessarily want all the data. Below I am doing a bunch things:
- Creating a new pandas dataframe and creating columns for the items I am interested in from the Tweet data.
- Removing duplicate tweets.
- Removing any URL’s in the tweet text – in my case I was planning on using this data in some text classification experiments, so I don’t want these included.
- Creating a sentiment measure for the tweet text using the TextBlob library.
Click image to enlarge.
At this point, you have the beginnings of a dataset. You can also add more features to the dataset easily. In my case I wanted to add the tweet text length and the count of punctuation in the tweet text. This is easy to do. The below calculates these and adds two new columns to the dataframe.
This post hopefully illustrated how easy it is to create datasets from Twitter. The full Jupyter notebook is available on my Github here, which also has an example of generating a wordcloud from the data.