Tidytext Analysis of the 2017 OpenVisConf Talk Transcripts

We recently published a blog post on key takeaways from OpenVisConf 2017. They just released all of the speaker videos  as well as the talk transcripts. Since Julia Silge gave such an awesome talk on using tidytext to mine text data, specifically from Jane Austen novels, we thought this would be a perfect opportunity to analyze the transcripts from the conference.

Relying heavily on the tidytext tutorial and following many of the examples, we did an analysis based on gender of the speakers.

First, we downloaded the data. Since the links are being generated by JavaScript, we used PhantomJS to write the contents of the webpage to a local file.

R can call this script via the system function e.g. system("phantomjs scrape.js").  The full R script used for this analysis is located here.

The next steps involve downloading and storing the transcript text files into lists in R.

From here, we can count the occurrence of words used by each gender in their talks. unnest_tokens puts our data into a tidy format and anti_join(stop_words) removes common stop words such as ‘the’ and ‘a’.

Now that we’ve combined the count of words by gender into one tidy frame we can calculate the tf-idf or term-document inverse-document-frequency. This is a metric that allows us to determine the importance of a word in a document within a collection of documents. In this case, we are identifying words that are common or uncommon between male and female speakers.

We can clearly see the male speakers’ presentation topics bubbling to the top such as vega, gdal and Matt Daniel’s talk about visualizing incarceration rates (prison) in the United States.  Of course Julia Silge’s talk about sentiment and analyzing Jane Austen’s novels also appears throughout the most important words mentioned by women speakers.

Analyzing sentiment using tidytext is easy to do as well. We have the option of using various dictionaries but in this case we will use the simplistic Bing et al dictionary, which only classifies a word as having either a positive or negative sentiment. Which negative and positive words do male vs. female speakers tend to use?

Female Speaker Sentiment Male Speaker Sentiment

Based on the above plots, the Bing dictionary clearly classifies ‘prison’ as a negative word and therefore Matt Daniels’ talk is contributing to that negative score. Both genders tend to use the negative words ‘hard’, ‘weird’, ‘wrong’ and ‘negative’.  Women mentioned ’empathy’ and ‘happy’ on the positive side, while the men said ‘excited’ and ‘nice’.

Lastly, let’s visualize the relationships between the words conditioned on gender. This will allow us to quickly see words used most commonly by women, men and both genders.

The opacity of the arrow indicates the frequency at which the word was mentioned.  The upper right cluster indicates words used by women (note the ‘F’ node) , the middle cluster indicates words used by both genders, and the lower left cluster indicates words used by the men.

Everyone mentioned ‘data’ and ‘visualization’ – no big surprise there. Do you notice any interesting differences in the talk transcripts? We’ve made the data available as an .RData object that you can download.  If you conduct your own analysis by following the above examples leave a comment here.

About the author

2 Responses
  1. Thank you! Very nice work and PhantomJs was new to me. One minor point. Your commands:

    ## separate transcripts out by gender
    female_dfs <- lapply(transcript_txts[genders$gender == 'F'], function(x) data_frame(txt=x))
    male_dfs <- lapply(transcript_txts[genders$gender == 'M'], function(x) data_frame(txt=x))

    rely on dplyr (data_frame) and it's not listed anywhere…

    Thanks again.

Leave a Reply