library("dplyr")
library("schrute")
library("sentiment.ai")
library("SentimentAnalysis")
library("sentimentr")
library("SnowballC")
library("syuzhet")This summer, I am mentoring some high school students through their data science projects. For a data set, I am suggesting the repository of scripts from the TV show The Office through the schrute package, but with sentiment scores attached to the data.
Loading the Data
Following the documentation at the schrute package, we can get copy of the episode scripts.
script_data <- schrute::theoffice
colnames(script_data) [1] "index" "season" "episode" "episode_name"
[5] "director" "writer" "character" "text"
[9] "text_w_direction" "imdb_rating" "total_votes" "air_date"
Computing Sentiment Scores
Following the documentation at the sentiment.ai package, we can compute sentiment scores for each text (please refer to the 3 sentiment packages for more information.)
start_time <- Sys.time()
sentimentr_score <- sentimentr::sentiment_by(
get_sentences(script_data$text), 1:length(script_data$text))$ave_sentimentWarning: Each time `sentiment_by` is run it has to do sentence boundary disambiguation when a
raw `character` vector is passed to `text.var`. This may be costly of time and
memory. It is highly recommended that the user first runs the raw `character`
vector through the `get_sentences` function.
# computation time
end_time <- Sys.time()
end_time - start_timeTime difference of 31.93237 secs
# for some reason, this computation created one extra number
sentimentr_score <- sentimentr_score[-length(sentimentr_score)]start_time <- Sys.time()
sentimentAnalysis_score <- SentimentAnalysis::analyzeSentiment(script_data$text)$SentimentQDAP
# computation time
end_time <- Sys.time()
end_time - start_timeTime difference of 2.729575 mins
This code for the sentiment.ai package isn’t working for me at the moment; I might return to it later.
# Initiate the model
# This will create the sentiment.ai.embed model
# Do this so it can be reused without recompiling - especially on GPU!
# run once overall
# sentiment.ai::install_sentiment.ai()
# run once per session
sentiment.ai::init_sentiment.ai()
start_time <- Sys.time()
sentiment_ai_score <- sentiment.ai::sentiment_score(script_data$text)
# computation time
end_time <- Sys.time()
end_time - start_timestart_time <- Sys.time()
syuzhet_score <- syuzhet::get_sentiment(script_data$text)
# computation time
end_time <- Sys.time()
end_time - start_timeTime difference of 16.43901 secs
Combining the Data
office_sentiment <- cbind(script_data,
sentimentAnalysis_score,
sentimentr_score,
syuzhet_score)Here is a glimpse of the data set.
set.seed(20240703)
office_sentiment |>
dplyr::select(episode, character,
sentimentAnalysis_score,
sentimentr_score,
syuzhet_score,
text) |>
dplyr::slice_sample(n = 10) |>
dplyr::as_tibble()# A tibble: 10 × 6
episode character sentimentAnalysis_sc…¹ sentimentr_score syuzhet_score text
<int> <chr> <dbl> <dbl> <dbl> <chr>
1 5 Darryl 0 0.447 0 You'…
2 8 Andy -0.167 0.409 -0.4 Our …
3 5 Andy -0.1 0.0559 0.25 Knoc…
4 11 Michael 0.286 0.924 2.25 Just…
5 15 Dwight 0.25 -0.144 0 Impo…
6 13 Michael 0 0 0 Heee…
7 24 Andy 0.25 0 0 Erin…
8 5 Pam 0.333 0.262 0 Come…
9 5 Phyllis 0.5 0.424 0.5 That…
10 14 Jim -0.2 0.8 -0.5 That…
# ℹ abbreviated name: ¹sentimentAnalysis_score
Saving the Data
readr::write_csv(office_sentiment, "office_sentiment.csv")