This post was originally published by Feng Lim at Towards Data Science
Thanks to the internet, now the world knew about the Presidential Debate 2020 that went out of control. All of the major news stations were reporting about how the participants were interrupting and sniping at one another.
I decided to put together an article that focuses on analyzing the words used in the event and see if there are any hidden insights.
This article focuses on finding out the most used words, categorized by each spokesperson, and sentiment analysis of the speeches.
The first 2020 Presidential Debate overview
– Incumbent President Donald Trump
– Former Vice President Joe Biden (Democratic nominee)
– Chris Wallace
- The candidates’ political records
- The Supreme Court
- The coronavirus
- The economy
- Race and violence in cities
- The integrity of the election
Cleaning the dataset
In total, close to 20,000 words were used in the event. After removing names and common stop words, around 6000 words were left for analysis.
#tokenize text_df <- text %>% unnest_tokens(word, Text)#Remove stop words my_stop_words <- tibble( word = c("chris","wallace","trump","donald","joe","biden","vice","president"))#Prepare stop words tibble all_stop_words <- stop_words %>% bind_rows(my_stop_words)textClean_df <- text_df %>% anti_join(all_stop_words, by = "word")
The first Presidential Debate in short
The word correlation network graph illustrates how words are used either in the same sentence or next to each other in the debate. I grouped some of the words into networks that may be relevant to the topics covered in the debate:
- The U.S. economy topic includes words such as ‘affordable’, ‘job’, ‘act’, etc.
- The supreme court topic includes words such as ‘justice’, ‘reason’, ‘judge’, etc.
- The race and violence cities topic includes words such as ‘peaceful’ and ‘protest’
- The election topic includes words such as ‘ballots’, ‘management’, ‘mail’, etc.
Classifying the debate with its corresponding part-of-speech
We can also tag each participants’ words with a part-of-speech category (noun, proper noun, adjective, adverb, etc.). This section will look specifically at the most occurring proper nouns, nouns, and adjectives used by Trump and Biden.
- ‘China’ is the most used proper noun by Trump
- ‘People’ is used by Trump and Biden for more than 60 times each in the debate
- Interestingly, the most-used adjective by Trump is ‘wrong’ and ‘true’ by Joe Biden
library(udpipe) udmodel <- udpipe_download_model(language = "english") udmodel <- udpipe_load_model(file = udmodel$file_model#annotate data frame tidy_text <- udpipe_annotate(udmodel, x = text$Text,doc_id = text$ï..Spokeperson) tidy_text <- as.data.frame(tidy_text)library(igraph) library(ggraph) library(ggplot2) #how many times nouns and adjectives are used in the same sentence cooc <- cooccurrence(x = subset(textClean_df, upos %in% c("NOUN", "ADJ")), term = "lemma", group = c("doc_id", "paragraph_id", "sentence_id"))wordnetwork <- head(cooc, 50) head(wordnetwork) wordnetwork <- graph_from_data_frame(wordnetwork) ggraph(wordnetwork, layout = "fr") + #geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "#FF62BC") + geom_edge_link(aes(edge_alpha = cooc), edge_colour = color2) + geom_node_point(color = color1, size = 1) + geom_node_text(aes(label = name), col = color1, size = 5, repel = TRUE) + theme_graph(base_family = "Arial") + theme(legend.position = "none") + labs(title = "First Presidential Debate 2020: Cooccurrences within sentence", subtitle = "Nouns & Adjective")ggsave("C:/Users/fengyueh/Documents/data analysis/R/17 presidential debate/cooccurence N&A 50.png", width = 6, height = 6, dpi = 1500)
library(lattice)tidy_textToUse <- textClean_df %>% #filter(doc_id == "Donald Trump") #filter(doc_id == "Chris Wallace") #filter(doc_id == "Joe Biden")#stats <- subset(tidy_textToUse, upos %in% c("PROPN")) #stats <- subset(tidy_textToUse, upos %in% c("NOUN")) #stats <- subset(tidy_textToUse, upos %in% c("VERB")) #stats <- subset(tidy_textToUse, upos %in% c("ADJ")) #stats <- subset(tidy_text, upos %in% c("ADV")) stats <- txt_freq(stats$token) stats$key <- factor(stats$key, levels = rev(stats$key)) barchart(key ~ freq, data = head(stats, 20), col = color1, labels = stats$freq, main = "First Presidential Debate 2020: Most occurring nouns - Joe Biden", xlab = "Freq")
The polarity of the debate
One of the basic tasks of sentiment analysis is understanding the polarity of a given text, whether the opinions expressed in the text are positive, negative, or neutral.
The first plot below shows the polarity of speech text by each spokesperson across the debate event. The blue ticks mean negative opinions were used, and the red ticks mean positive opinions. It is interesting that there is an empty chunk between sentences 1500–2500 for Trump.
The second plot below shows the number of negative, neutral, and positive words used by the Presidential debate spokesperson.
Overall, it seems that Biden and Trump expressed almost an equal amount of positive and negative opinions.
Positivity of each participant in the Presidential Debate
The stacked bar chart below shows each spokesperson’s overall polarity by categorizing their speeches as either positive or negative.
Based on the graph, Biden seems to use more positive words compared to Trump in the debate. Next, we will look at the spokesperson’s exact words, colored with polarity/positivity.
word_sentiment <- word_counts %>% inner_join(get_sentiments("nrc"))word_sentiment %>% count(ï..Spokeperson, sentiment, sort = TRUE)# Count by person & sentiment words_person_count <- word_counts %>% inner_join(get_sentiments("nrc")) %>% filter(grepl("positive|negative", sentiment)) %>% count(ï..Spokeperson, sentiment)data_pos <- words_person_count %>% group_by(ï..Spokeperson) %>% mutate(percent_positive = 100 * n / sum(n) )ggplot(data_pos, aes(x = reorder(ï..Spokeperson, n), y = n, fill = sentiment)) + # Add a col layer geom_col(position = "fill") + coord_flip() + theme(axis.text.x = element_text(angle = 90, vjust = 0.1)) + ggtitle("Polarity of participants, sorted by sentiment word frequency ") + xlab("Person") + ylab("Polarity") + My_Theme
Positivity word clouds by each spokesperson
Words with negative sentiment are colored in blue and positive sentiment as red in the word clouds. The size of the words depends on their respective frequencies.
The most used positive word by Trump is ‘won’ and ‘affordable’ by Biden. Both participants used the word ‘wrong’ often. The word ‘support’ also shown up in all of the word clouds.
Next, we will look at some sentiment analysis in the debate.
Sentiment analysis of the First Presidential Debate 2020
The bar chart below shows the count of words used in the debate associated with each emotion. Words associated with the positive emotion of “trust” occurred the most, whereas words associated with the negative emotion of “disgust” occurred the least in the debate.
textClean_df %>% # filter(ï..Spokeperson == "Chris Wallace") %>% # filter(ï..Spokeperson == "Donald Trump") %>% # filter(ï..Spokeperson == "Joe Biden") %>% inner_join(get_sentiments("bing")) %>% count(word, sentiment, sort = TRUE) %>% acast(word ~ sentiment, value.var = "n", fill = 0) %>% comparison.cloud(colors = c(color1, color2),title.size=1.0, max.words = 50)
text_nrc_sub %>% count(ï..Spokeperson, sentiment, ï..Spokeperson) %>% mutate(sentiment = reorder(sentiment, n), Spokeperson = reorder(ï..Spokeperson, n)) %>% ggplot(aes(sentiment, n, fill = sentiment)) + geom_col() + facet_wrap( ~ ï..Spokeperson, scales = "free_x", labeller = label_both) + theme(panel.grid.major.x = element_blank(), axis.text.x = element_blank()) + labs(x = NULL, y = NULL) + ggtitle("NRC Sentiment Analysis - First Presidential Debate 2020") + coord_flip()
Next, we will look a the exact words related to each emotion, categorized by each of the spokespeople:
Wallace’s most frequent word is ‘sir,’ which occurred more than 40 times in the debate.
Based on the chart below, words including ‘vote,’ ‘deal,’ and ‘tax’ are highly mentioned by Biden.
Trump highly mentions words, including ‘military,’ ‘law,’ and ‘job.’
nrc_words %>% # Count by word and sentiment # filter(ï..Spokeperson == "Donald Trump") %>% # filter(ï..Spokeperson == "Joe Biden") %>% # filter(ï..Spokeperson == "Chris Wallace") %>% count(word, sentiment) %>% # Group by sentiment group_by(sentiment) %>% # Take the top 10 words for each sentiment top_n(10) %>% ungroup() %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n, fill = sentiment)) + geom_col(show.legend = FALSE) + facet_wrap(~ sentiment, scales = "free") + coord_flip() + ggtitle("First Presidential Debate: Sentiment word frequency - Joe Biden") + My_Theme
The debate has caused quite a stir on the internet that I thought it would be fun to analyze the event’s speeches. One particular word stood out to me from the analysis — ‘people.’ The word was highly mentioned by both participants and seemed to be the only topic they were aligned with. ‘People’ also extends to other words such as ‘job,’ ‘affordable,’ ‘court,’ etc., that are related to the major topics discussed.
This post was originally published by Feng Lim at Towards Data Science