Data Visualization

Tidytext Analysis of the 2017 OpenVisConf Talk Transcripts

We recently published a blog post on key takeaways from OpenVisConf 2017. They just released all of the speaker videos  as well as the talk transcripts. Since Julia Silge gave such an awesome talk on using tidytext to mine text data, specifically from Jane Austen novels, we thought this would be a perfect opportunity to analyze the transcripts from the conference.

Relying heavily on the tidytext tutorial and following many of the examples, we did an analysis based on gender of the speakers.

First, we downloaded the data. Since the links are being generated by JavaScript, we used PhantomJS to write the contents of the webpage to a local file.

var url = 'https://openvisconf.com/#home';
var fs = require('fs');
var page = require('webpage').create();
page.open(url, function(status) {
    if (status === 'success') {
        var html = page.evaluate(function() {
            return document.documentElement.outerHTML;
        });
        try {
            fs.write("openviz.txt", html, 'w');
        } catch(e) {
            console.log(e);
        }
    }
    phantom.exit();
});

R can call this script via the system function e.g. system("phantomjs scrape.js").  The full R script used for this analysis is located here.

The next steps involve downloading and storing the transcript text files into lists in R.

library(dplyr)
 
h <- html("openviz.txt")
 
## get all <a> tags
links <- h %>% html_nodes('a')
 
## map them to a data.frame using purrr
links_df <- 
  links %>% 
  map(xml_attrs) %>% 
  map_df(~as.list(.))
 
## grep out just the transcript URLs
transcript_urls <- links_df[grep("transcripts", links_df$href),]$href
 
## store them into a list
base_url <- 'https://openvisconf.com'
transcripts <- paste0(base_url, transcript_urls)
transcript_txts<-sapply(transcripts, function(x) readLines(x, encoding = "UTF-8"))
 
## Assign genders to the speakers
genders<-data.frame(transcript = transcripts, gender="M")
genders[grep("Shirley|Amanda|Lisa|Julia|Amy|Amelia", genders$transcript),]$gender <- 'F'
 
## Remove this talk because it has male and female speakers and no way to differentiate
i <- grep("Ignazio", genders$transcript)
genders <- genders[-i,]
transcript_txts <- transcript_txts[-i]
 
## separate transcripts out by gender
female_dfs <- lapply(transcript_txts[genders$gender == 'F'], function(x) data_frame(txt=x))
male_dfs <- lapply(transcript_txts[genders$gender == 'M'], function(x) data_frame(txt=x))
female_dfs_all <- do.call('rbind', female_dfs)
male_dfs_all <- do.call('rbind', male_dfs)

From here, we can count the occurrence of words used by each gender in their talks. unnest_tokens puts our data into a tidy format and anti_join(stop_words) removes common stop words such as ‘the’ and ‘a’.

data(stop_words)
 
## 1grams for women
f_1grams <- 
  female_dfs_all %>%
  unnest_tokens(word, txt) %>%
  anti_join(stop_words)
 
## 1grams for men
m_1grams <- 
  male_dfs_all %>%
  unnest_tokens(word, txt) %>%
  anti_join(stop_words)
 
## Genders UNITE!
genders_united <- bind_rows(mutate(f_1grams, gender = "F"),
                            mutate(m_1grams, gender = "M"))

Now that we’ve combined the count of words by gender into one tidy frame we can calculate the tf-idf or term-document inverse-document-frequency. This is a metric that allows us to determine the importance of a word in a document within a collection of documents. In this case, we are identifying words that are common or uncommon between male and female speakers.

We can clearly see the male speakers’ presentation topics bubbling to the top such as vega, gdal and Matt Daniel’s talk about visualizing incarceration rates (prison) in the United States.  Of course Julia Silge’s talk about sentiment and analyzing Jane Austen’s novels also appears throughout the most important words mentioned by women speakers.

Analyzing sentiment using tidytext is easy to do as well. We have the option of using various dictionaries but in this case we will use the simplistic Bing et al dictionary, which only classifies a word as having either a positive or negative sentiment. Which negative and positive words do male vs. female speakers tend to use?

## Sentiment
both_gender_word_counts <- genders_united %>%
  group_by(gender) %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) 
 
## Plot Female Sentiment
both_gender_word_counts %>%
  filter(gender == 'F') %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
Female Speaker Sentiment Male Speaker Sentiment

Based on the above plots, the Bing dictionary clearly classifies ‘prison’ as a negative word and therefore Matt Daniels’ talk is contributing to that negative score. Both genders tend to use the negative words ‘hard’, ‘weird’, ‘wrong’ and ‘negative’.  Women mentioned ’empathy’ and ‘happy’ on the positive side, while the men said ‘excited’ and ‘nice’.

Lastly, let’s visualize the relationships between the words conditioned on gender. This will allow us to quickly see words used most commonly by women, men and both genders.

library(igraph)
 
## Calculate frequency of words used by gender
frequency <- genders_united %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(gender, word) %>%
  mutate(proportion = n / sum(n))
 
## filter by proportion (since men had more words total)
freq_for_graph <- frequency %>%
  filter(proportion >= .002) %>%
  graph_from_data_frame()
 
## Visualize a network graph of words by gender
library(ggraph)
set.seed(2121)
 
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(freq_for_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

The opacity of the arrow indicates the frequency at which the word was mentioned.  The upper right cluster indicates words used by women (note the ‘F’ node) , the middle cluster indicates words used by both genders, and the lower left cluster indicates words used by the men.

Everyone mentioned ‘data’ and ‘visualization’ – no big surprise there. Do you notice any interesting differences in the talk transcripts? We’ve made the data available as an .RData object that you can download.  If you conduct your own analysis by following the above examples leave a comment here.