library("dplyr")
library("schrute")
library("sentiment.ai")
library("SentimentAnalysis")
library("sentimentr")
library("SnowballC")
library("syuzhet")
This summer, I am mentoring some high school students through their data science projects. For a data set, I am suggesting the repository of scripts from the TV show The Office through the schrute
package, but with sentiment scores attached to the data.
Loading the Data
Following the documentation at the schrute
package, we can get copy of the episode scripts.
<- schrute::theoffice
script_data colnames(script_data)
[1] "index" "season" "episode" "episode_name"
[5] "director" "writer" "character" "text"
[9] "text_w_direction" "imdb_rating" "total_votes" "air_date"
Computing Sentiment Scores
Following the documentation at the sentiment.ai
package, we can compute sentiment scores for each text
(please refer to the 3 sentiment packages for more information.)
<- Sys.time()
start_time <- sentimentr::sentiment_by(
sentimentr_score get_sentences(script_data$text), 1:length(script_data$text))$ave_sentiment
Warning: Each time `sentiment_by` is run it has to do sentence boundary disambiguation when a
raw `character` vector is passed to `text.var`. This may be costly of time and
memory. It is highly recommended that the user first runs the raw `character`
vector through the `get_sentences` function.
# computation time
<- Sys.time()
end_time - start_time end_time
Time difference of 31.93237 secs
# for some reason, this computation created one extra number
<- sentimentr_score[-length(sentimentr_score)] sentimentr_score
<- Sys.time()
start_time <- SentimentAnalysis::analyzeSentiment(script_data$text)$SentimentQDAP
sentimentAnalysis_score
# computation time
<- Sys.time()
end_time - start_time end_time
Time difference of 2.729575 mins
This code for the sentiment.ai
package isn’t working for me at the moment; I might return to it later.
# Initiate the model
# This will create the sentiment.ai.embed model
# Do this so it can be reused without recompiling - especially on GPU!
# run once overall
# sentiment.ai::install_sentiment.ai()
# run once per session
::init_sentiment.ai()
sentiment.ai
<- Sys.time()
start_time <- sentiment.ai::sentiment_score(script_data$text)
sentiment_ai_score
# computation time
<- Sys.time()
end_time - start_time end_time
<- Sys.time()
start_time <- syuzhet::get_sentiment(script_data$text)
syuzhet_score
# computation time
<- Sys.time()
end_time - start_time end_time
Time difference of 16.43901 secs
Combining the Data
<- cbind(script_data,
office_sentiment
sentimentAnalysis_score,
sentimentr_score, syuzhet_score)
Here is a glimpse of the data set.
set.seed(20240703)
|>
office_sentiment ::select(episode, character,
dplyr
sentimentAnalysis_score,
sentimentr_score,
syuzhet_score, |>
text) ::slice_sample(n = 10) |>
dplyr::as_tibble() dplyr
# A tibble: 10 × 6
episode character sentimentAnalysis_sc…¹ sentimentr_score syuzhet_score text
<int> <chr> <dbl> <dbl> <dbl> <chr>
1 5 Darryl 0 0.447 0 You'…
2 8 Andy -0.167 0.409 -0.4 Our …
3 5 Andy -0.1 0.0559 0.25 Knoc…
4 11 Michael 0.286 0.924 2.25 Just…
5 15 Dwight 0.25 -0.144 0 Impo…
6 13 Michael 0 0 0 Heee…
7 24 Andy 0.25 0 0 Erin…
8 5 Pam 0.333 0.262 0 Come…
9 5 Phyllis 0.5 0.424 0.5 That…
10 14 Jim -0.2 0.8 -0.5 That…
# ℹ abbreviated name: ¹sentimentAnalysis_score
Saving the Data
::write_csv(office_sentiment, "office_sentiment.csv") readr