Friday, June 13, 2014

World Cup and Word Clouds in R

Yesterday two cool things happened. Wait, three cool things happened. It was the start of the 2014 World Cup, brazil was the first time to score in the world cup (in their own goal oops!) and I learned how to create word clouds in R! Here is a tutorial on how to create word clouds in R with a world cup theme.


Whether you have a set of words already or you are interested in scraping data from social media sites such as Twitter, beautiful word clouds in R are only a few steps away with the help of some fantastic R-packages.


Set up Twitter authentication with R
If you are interested in obtaining a set of tweets with a hashtag (e.g. #worldcup), there are a few steps you must complete first.  As with all new relationships, it begins with a "handshake". This handshake is essentially a communication between you and and the Twitter server telling the two systems the type of information to be communicated. This entails the following five steps:

1. Go to https://dev.twitter.com and sign in using your twitter account.  Your picture will appear in the top right corner. Hover over the picture and click on My Applications.  If this is your first application, click on the Create a New App. Fill out the required application details which includes a Name, Description and Website.  None of this really matters, so just fill it in with whatever you wish.

2. Once you have created your app, you will see some details related to the OAuth Settings.  This will include a Consumer Key, Consumer secret, Request token URL, Authorize URL, Access token URL, Callback URL.  Keep all this information handy as you will need it in just a minute.

3. Load the Rcurl package and set the SSL certifications globally

library(RCurl)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))

4. Open up R or Rstudio and load the twitteR R-package

install.packages("twitterR")
library(twitteR)

5.  Once the twitteR package is loaded, type the following in R

reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
apiKey <- "yourAPIKey"
apiSecret <- "yourAPISecret"
twitCred <- OAuthFactory$new(consumerKey=apiKey,consumerSecret=apiSecret,requestURL=reqURL,accessURL=accessURL,authURL=authURL)
twitCred$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
registerTwitterOAuth(twitCred)

The yourAPIkey and yourAPISecret should be changed with your key found on your twitter application page.  The second to last step will ask you to go to a specific URL and enter the code found on the webpage in R.  If all went well, you have successful had a handshake with Twitter!


Get tweets using searchTwitter()
First, we can obtain a set of tweets containing a searchString. This searchString can be a hashtag or just a string of characters in quotations.  Here I will get a set of tweets containing the hashtag #worldcup.

mytweets <- searchTwitter("#worldcup", n = 100, cainfo = "cacert.pem")
length(mytweets)


Clean tweets using clean.tweets()
Load the following R-packages and extract the text portion from the tweet using getText():

library(tm)
library(wordcloud)
library(RColorBrewer)
library(plyr)
mytweets.text <- laply(mytweets, function(x) x$getText() )

Using a function clean.tweets() from this gist (originally obtained from here), we can remove invalid characters that cannot be analyzed.

clean_text = clean.tweets(tweets.text)


Create a word cloud using tweets
tweet_corpus = Corpus(VectorSource(clean_text))
tdm = TermDocumentMatrix(tweet_corpus, control = list(removePunctuation = TRUE, 
  stopwords = c("machine", "learning", stopwords("english")), 
removeNumbers = TRUE, tolower = TRUE))
m = as.matrix(tdm)
word_freqs = sort(rowSums(m), decreasing=TRUE)
dm = data.frame(word=names(word_freqs), freq=word_freqs)
wordcloud(dm[,1], dm[,2], random.order=FALSE, colors=brewer.pal(8, "Dark2")) 


Additional Resources