<!-- background-color: #006DAE --> <!-- class: middle center hide-slide-number --> <div class="shade_black" style="width:60%;right:0;bottom:0;padding:10px;border: dashed 4px white;margin: auto;"> <i class="fas fa-exclamation-circle"></i> These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See <a href=/>here for PDF <i class="fas fa-file-pdf"></i></a>. </div> <br> .white[Press the **right arrow** to progress to the next slide!] --- background-image: url(images/bg1.jpg) background-size: cover class: hide-slide-number split-70 title-slide count: false .column.shade_black[.content[ <br> # .monash-blue.outline-text[ETC5510: Introduction to Data Analysis] <h2 class="monash-blue2 outline-text" style="font-size: 30pt!important;">Week 8, part B</h2> <br> <h2 style="font-weight:900!important;">Text analysis Part 2</h2> .bottom_abs.width100[ Lecturer: *Nicholas Tierney & Stuart Lee* Department of Econometrics and Business Statistics
<i class="fas fa-envelope faa-float animated "></i>
ETC5510.Clayton-x@monash.edu May 2020 <br> ] ]] <div class="column transition monash-m-new delay-1s" style="clip-path:url(#swipe__clip-path);"> <div class="background-image" style="background-image:url('images/large.png');background-position: center;background-size:cover;margin-left:3px;"> <svg class="clip-svg absolute"> <defs> <clipPath id="swipe__clip-path" clipPathUnits="objectBoundingBox"> <polygon points="0.5745 0, 0.5 0.33, 0.42 0, 0 0, 0 1, 0.27 1, 0.27 0.59, 0.37 1, 0.634 1, 0.736 0.59, 0.736 1, 1 1, 1 0, 0.5745 0" /> </clipPath> </defs> </svg> </div> </div> --- class: refresher # Recap - tidying up text - stop_words - (I, am, be, the, this, what, we, myself) --- # Overview - tidy text continued --- class: transition # Sentiment analysis Sentiment analysis tags words or phrases with an emotion, and summarises these, often as the positive or negative state, over a body of text. --- # Sentiment analysis: examples - Examining effect of emotional state in twitter posts - Determining public reactions to government policy, or new product releases - Trying to make money in the stock market by modeling social media posts on listed companies - Evaluating product reviews on Amazon, restaurants on zomato, or travel options on TripAdvisor --- # Lexicons The `tidytext` package has a lexicon of sentiments, based on four major sources: [AFINN](http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010), [bing](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html), [Loughran](https://sraf.nd.edu/textual-analysis/resources/#LM%20Sentiment%20Word%20Lists), [nrc](http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm) --- # emotion What emotion do these words elicit in you? - summer - hot chips - hug - lose - stolen - smile --- # Different sources of sentiment - The `nrc` lexicon categorizes words in a binary fashion ("yes"/"no") into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. - The `bing` lexicon categorizes words in a binary fashion into positive and negative categories. - The `AFINN` lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. --- # Different sources of sentiment ```r get_sentiments("afinn") ## # A tibble: 2,477 x 2 ## word value ## <chr> <dbl> ## 1 abandon -2 ## 2 abandoned -2 ## 3 abandons -2 ## 4 abducted -2 ## 5 abduction -2 ## 6 abductions -2 ## 7 abhor -3 ## 8 abhorred -3 ## 9 abhorrent -3 ## 10 abhors -3 ## # … with 2,467 more rows ``` --- # Sentiment analysis - Once you have a bag of words, you need to join the sentiments dictionary to the words data. - Particularly the lexicon `nrc` has multiple tags per word, so you may need to use an "inner_join". - `inner_join()` returns all rows from x where there are matching values in y, and all columns from x and y. - If there are multiple matches between x and y, all combination of the matches are returned. --- # Exploring sentiment in Jane Austen `janeaustenr` package contains the full texts, ready for analysis for for Jane Austen's 6 completed novels: 1. "Sense and Sensibility" 2. "Pride and Prejudice" 3. "Mansfield Park" 4. "Emma" 5. "Northanger Abbey" 6. "Persuasion" --- # Exploring sentiment in Jane Austen ```r library(janeaustenr) library(stringr) tidy_books <- austen_books() %>% group_by(book) %>% mutate(linenumber = row_number(), chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>% ungroup() %>% unnest_tokens(word, text) ``` --- # Exploring sentiment in Jane Austen ```r tidy_books ## # A tibble: 725,055 x 4 ## book linenumber chapter word ## <fct> <int> <int> <chr> ## 1 Sense & Sensibility 1 0 sense ## 2 Sense & Sensibility 1 0 and ## 3 Sense & Sensibility 1 0 sensibility ## 4 Sense & Sensibility 3 0 by ## 5 Sense & Sensibility 3 0 jane ## 6 Sense & Sensibility 3 0 austen ## 7 Sense & Sensibility 5 0 1811 ## 8 Sense & Sensibility 10 1 chapter ## 9 Sense & Sensibility 10 1 1 ## 10 Sense & Sensibility 13 1 the ## # … with 725,045 more rows ``` --- # Count joyful words in "Emma" ```r nrc_joy <- get_sentiments("nrc") %>% filter(sentiment == "joy") tidy_books %>% filter(book == "Emma") %>% inner_join(nrc_joy) %>% count(word, sort = TRUE) ## # A tibble: 303 x 2 ## word n ## <chr> <int> ## 1 good 359 ## 2 young 192 ## 3 friend 166 ## 4 hope 143 ## 5 happy 125 ## 6 love 117 ## 7 deal 92 ## 8 found 92 ## 9 present 89 ## 10 kind 82 ## # … with 293 more rows ``` --- # Count joyful words in "Emma" "Good" is the most common joyful word, followed by "young", "friend", "hope". All make sense until you see "found". Is "found" a joyful word? --- # Comparing lexicons .pull-left[ - All of the lexicons have a measure of positive or negative. - We can tag the words in Emma by each lexicon, and see if they agree. ] .pull-right[ ```r nrc_pn <- get_sentiments("nrc") %>% filter(sentiment %in% c("positive", "negative")) emma_nrc <- tidy_books %>% filter(book == "Emma") %>% inner_join(nrc_pn) emma_bing <- tidy_books %>% filter(book == "Emma") %>% inner_join(get_sentiments("bing")) emma_afinn <- tidy_books %>% filter(book == "Emma") %>% inner_join(get_sentiments("afinn")) ``` ] --- # Comparing lexicons ```r emma_nrc ## # A tibble: 13,944 x 5 ## book linenumber chapter word sentiment ## <fct> <int> <int> <chr> <chr> ## 1 Emma 15 1 clever positive ## 2 Emma 16 1 happy positive ## 3 Emma 16 1 blessings positive ## 4 Emma 17 1 existence positive ## 5 Emma 18 1 distress negative ## 6 Emma 21 1 marriage positive ## 7 Emma 22 1 mistress negative ## 8 Emma 22 1 mother negative ## 9 Emma 22 1 mother positive ## 10 Emma 23 1 indistinct negative ## # … with 13,934 more rows ``` --- # Comparing lexicons ```r emma_afinn ## # A tibble: 10,901 x 5 ## book linenumber chapter word value ## <fct> <int> <int> <chr> <dbl> ## 1 Emma 15 1 clever 2 ## 2 Emma 15 1 rich 2 ## 3 Emma 15 1 comfortable 2 ## 4 Emma 16 1 happy 3 ## 5 Emma 16 1 best 3 ## 6 Emma 18 1 distress -2 ## 7 Emma 20 1 affectionate 3 ## 8 Emma 22 1 died -3 ## 9 Emma 24 1 excellent 3 ## 10 Emma 25 1 fallen -2 ## # … with 10,891 more rows ``` --- # Comparing lexicons ```r emma_nrc %>% count(sentiment) %>% mutate(n / sum(n)) ## # A tibble: 2 x 3 ## sentiment n `n/sum(n)` ## <chr> <int> <dbl> ## 1 negative 4473 0.321 ## 2 positive 9471 0.679 emma_bing %>% count(sentiment) %>% mutate(n / sum(n)) ## # A tibble: 2 x 3 ## sentiment n `n/sum(n)` ## <chr> <int> <dbl> ## 1 negative 4809 0.402 ## 2 positive 7157 0.598 ``` --- # Comparing lexicons ```r emma_afinn %>% mutate(sentiment = ifelse(value > 0, "positive", "negative")) %>% count(sentiment) %>% mutate(n / sum(n)) ## # A tibble: 2 x 3 ## sentiment n `n/sum(n)` ## <chr> <int> <dbl> ## 1 negative 4429 0.406 ## 2 positive 6472 0.594 ``` --- class: transition # Your Turn: Sentiment of Austen - What are the most common "anger" words used in Emma? - What are the most common "surprise" words used in Emma? - Which book is the most positive? negative? - Using your choice of lexicon (nrc, bing, or afinn) compute the proportion of positive words in each of Austen's books. --- # Lab exercise: The Simpsons Data from the popular animated TV series, The Simpsons, has been made available on [kaggle](https://www.kaggle.com/wcukierski/the-simpsons-by-the-data/data). - `simpsons_script_lines.csv`: Contains the text spoken during each episode (including details about which character said it and where) - `simpsons_characters.csv`: Contains character names and a character id --- # Lab exercise (bonus) Origin of Species - Downloading books from gutenberg - Using tf-idf to look at how editions of the Darwin's book have changed --- background-image: url(images/bg1.jpg) background-size: cover class: hide-slide-number split-70 count: false .column.shade_black[.content[ <br><br> # That's it! .bottom_abs.width100[ Lecturer: Nicholas Tierney & Stuart Lee Department of Econometrics and Business Statistics<br>
<i class="fas fa-envelope faa-float animated "></i>
ETC5510.Clayton-x@monash.edu May 2020 ] <br /> This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>. <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a> ]] <div class="column transition monash-m-new delay-1s" style="clip-path:url(#swipe__clip-path);"> <div class="background-image" style="background-image:url('images/large.png');background-position: center;background-size:cover;margin-left:3px;"> <svg class="clip-svg absolute"> <defs> <clipPath id="swipe__clip-path" clipPathUnits="objectBoundingBox"> <polygon points="0.5745 0, 0.5 0.33, 0.42 0, 0 0, 0 1, 0.27 1, 0.27 0.59, 0.37 1, 0.634 1, 0.736 0.59, 0.736 1, 1 1, 1 0, 0.5745 0" /> </clipPath> </defs> </svg> </div> </div>