The Median Data Scientist - Sentiment Analysis of The Office

This summer, I am mentoring some high school students through their data science projects. For a data set, I am suggesting the repository of scripts from the TV show The Office through the schrute package, but with sentiment scores attached to the data.

library("dplyr")
library("schrute")
library("sentiment.ai")
library("SentimentAnalysis")
library("sentimentr")
library("SnowballC")
library("syuzhet")

Loading the Data

Following the documentation at the schrute package, we can get copy of the episode scripts.

script_data <- schrute::theoffice
colnames(script_data)

 [1] "index"            "season"           "episode"          "episode_name"    
 [5] "director"         "writer"           "character"        "text"            
 [9] "text_w_direction" "imdb_rating"      "total_votes"      "air_date"

Computing Sentiment Scores

Following the documentation at the sentiment.ai package, we can compute sentiment scores for each text (please refer to the 3 sentiment packages for more information.)

start_time <- Sys.time()
sentimentr_score <- sentimentr::sentiment_by(
  get_sentences(script_data$text), 1:length(script_data$text))$ave_sentiment

Warning: Each time `sentiment_by` is run it has to do sentence boundary disambiguation when a
raw `character` vector is passed to `text.var`. This may be costly of time and
memory.  It is highly recommended that the user first runs the raw `character`
vector through the `get_sentences` function.

# computation time
end_time <- Sys.time()
end_time - start_time

Time difference of 31.93237 secs

# for some reason, this computation created one extra number
sentimentr_score <- sentimentr_score[-length(sentimentr_score)]

start_time <- Sys.time()
sentimentAnalysis_score <- SentimentAnalysis::analyzeSentiment(script_data$text)$SentimentQDAP

# computation time
end_time <- Sys.time()
end_time - start_time

Time difference of 2.729575 mins

This code for the sentiment.ai package isn’t working for me at the moment; I might return to it later.

# Initiate the model
# This will create the sentiment.ai.embed model
# Do this so it can be reused without recompiling - especially on GPU!

# run once overall
# sentiment.ai::install_sentiment.ai() 

# run once per session
sentiment.ai::init_sentiment.ai()

start_time <- Sys.time()
sentiment_ai_score <- sentiment.ai::sentiment_score(script_data$text)

# computation time
end_time <- Sys.time()
end_time - start_time

start_time <- Sys.time()
syuzhet_score <- syuzhet::get_sentiment(script_data$text)

# computation time
end_time <- Sys.time()
end_time - start_time

Time difference of 16.43901 secs

Combining the Data

office_sentiment <- cbind(script_data,
                          sentimentAnalysis_score,
                          sentimentr_score,
                          syuzhet_score)

Here is a glimpse of the data set.

set.seed(20240703)
office_sentiment |>
  dplyr::select(episode, character,
                sentimentAnalysis_score,
                sentimentr_score,
                syuzhet_score, 
                text) |>
  dplyr::slice_sample(n = 10) |>
  dplyr::as_tibble()

# A tibble: 10 × 6
   episode character sentimentAnalysis_sc…¹ sentimentr_score syuzhet_score text 
     <int> <chr>                      <dbl>            <dbl>         <dbl> <chr>
 1       5 Darryl                     0               0.447           0    You'…
 2       8 Andy                      -0.167           0.409          -0.4  Our …
 3       5 Andy                      -0.1             0.0559          0.25 Knoc…
 4      11 Michael                    0.286           0.924           2.25 Just…
 5      15 Dwight                     0.25           -0.144           0    Impo…
 6      13 Michael                    0               0               0    Heee…
 7      24 Andy                       0.25            0               0    Erin…
 8       5 Pam                        0.333           0.262           0    Come…
 9       5 Phyllis                    0.5             0.424           0.5  That…
10      14 Jim                       -0.2             0.8            -0.5  That…
# ℹ abbreviated name: ¹sentimentAnalysis_score

Saving the Data

readr::write_csv(office_sentiment, "office_sentiment.csv")