Scooby Doo analytics: Quantifying what’s groovy about a pop culture icon.

Author

Naveen Kannan

Published

2023-05-06

R Packages and Setup

Code
knitr::opts_chunk$set(comment = NA) 
library(knitr)
library(broom)
library(janitor) 
library(naniar)
library(glue)
library(mice)
library(MASS)
library(pscl)
library(GGally)
library(VGAM)
library(car)
library(nnet)
library(rsample)
library(yardstick)
library(patchwork)
library(rms)
library(tidymodels)
library(boot)
library(conflicted)
library(kableExtra)
library(countreg)
library(topmodels)
library(equatiomatic)
library(tidyverse)

conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")

theme_set(theme_bw()) 

1 Background

The data comes from Kaggle, and was part of Tidy Tuesday’s dataset for 2021-07-13, and can be found here.

“Scooby-Doo” is a television series that has been airing for over 50 years. Centered around Mystery Inc.,a group of iconic mystery solving detectives, including Fred, Daphne, Velma, Shaggy, and the titular Scooby-Doo, a talking dog with a penchant for consuming ridiculously tall sandwiches and Scooby snacks.

The Mystery Inc. Gang. From left to right, Velma, Shaggy, Scooby Doo, Fred, and Daphne.

The data was manually aggregated by user plummye on the Kaggle website.

2 Research Questions

Question 1:

Predicting an iconic catchphrase: Are the logistics of an episode of Scooby-Doo good predictors of the number of times an iconic catchphrase is spoken?

Question 2:

Monster Motivation: Is it possible to predict the motive of the major antagonist of an episode of Scooby-Doo, based on the nature of the monster the antagonist appears as?

3 Data Ingest and Management

3.1 Loading the Raw Data

The data comes from Kaggle, and was part of Tidy Tuesday’s dataset for 2021-07-13, and can be found here.

Code
scooby_raw <- read_csv('scooby_doo.csv', show_col_types = F)
scooby_raw
# A tibble: 603 × 75
   index series_name   network season title imdb  engagement date_aired run_time
   <dbl> <chr>         <chr>   <chr>  <chr> <chr> <chr>      <date>        <dbl>
 1     1 Scooby Doo, … CBS     1      What… 8.1   556        1969-09-13       21
 2     2 Scooby Doo, … CBS     1      A Cl… 8.1   479        1969-09-20       22
 3     3 Scooby Doo, … CBS     1      Hass… 8     455        1969-09-27       21
 4     4 Scooby Doo, … CBS     1      Mine… 7.8   426        1969-10-04       21
 5     5 Scooby Doo, … CBS     1      Deco… 7.5   391        1969-10-11       21
 6     6 Scooby Doo, … CBS     1      What… 8.4   384        1969-10-18       21
 7     7 Scooby Doo, … CBS     1      Neve… 7.6   358        1969-10-25       21
 8     8 Scooby Doo, … CBS     1      Foul… 8.2   358        1969-11-01       21
 9     9 Scooby Doo, … CBS     1      The … 8.1   371        1969-11-08       21
10    10 Scooby Doo, … CBS     1      Bedl… 8     346        1969-11-15       21
# ℹ 593 more rows
# ℹ 66 more variables: format <chr>, monster_name <chr>, monster_gender <chr>,
#   monster_type <chr>, monster_subtype <chr>, monster_species <chr>,
#   monster_real <chr>, monster_amount <dbl>, caught_fred <chr>,
#   caught_daphnie <chr>, caught_velma <chr>, caught_shaggy <chr>,
#   caught_scooby <chr>, captured_fred <chr>, captured_daphnie <chr>,
#   captured_velma <chr>, captured_shaggy <chr>, captured_scooby <chr>, …

3.2 Cleaning the Data

3.2.1 Selecting Variables We’ll Use

The variables that are intended to be used for the two models were selected and placed in a tibble titled scooby_filtered.

Code
scooby_filtered <- scooby_raw |>
  select(index, imdb, engagement, network, season, rooby_rooby_roo,
         motive, monster_amount,monster_gender, monster_real) 
scooby_filtered  <- scooby_filtered %>% replace(.=="NULL", NA) # replace with NA
scooby_filtered
# A tibble: 603 × 10
   index imdb  engagement network season rooby_rooby_roo motive   monster_amount
   <dbl> <chr> <chr>      <chr>   <chr>  <chr>           <chr>             <dbl>
 1     1 8.1   556        CBS     1      1               Theft                 1
 2     2 8.1   479        CBS     1      0               Theft                 1
 3     3 8     455        CBS     1      0               Treasure              1
 4     4 7.8   426        CBS     1      0               Natural…              1
 5     5 7.5   391        CBS     1      0               Competi…              1
 6     6 8.4   384        CBS     1      1               Extorti…              1
 7     7 7.6   358        CBS     1      1               Competi…              1
 8     8 8.2   358        CBS     1      1               Safety                1
 9     9 8.1   371        CBS     1      1               Counter…              1
10    10 8     346        CBS     1      0               Competi…              1
# ℹ 593 more rows
# ℹ 2 more variables: monster_gender <chr>, monster_real <chr>

3.2.2 Converting Variable Types

Each variable was then converted into the appropriate variable type. These variables were then placed in a tibble titled scooby_converted.

Code
scooby_converted <- scooby_filtered |>
  mutate(index = as.character(index),
         imdb = as.numeric(imdb), 
         engagement = as.numeric(engagement),
         network = as_factor(network),
         season = as_factor(season),
         rooby_rooby_roo = as.double(rooby_rooby_roo),
         motive = as_factor(motive),
         monster_gender = as_factor(monster_gender),
         monster_real = as_factor(monster_real))
scooby_converted
# A tibble: 603 × 10
   index  imdb engagement network season rooby_rooby_roo motive   monster_amount
   <chr> <dbl>      <dbl> <fct>   <fct>            <dbl> <fct>             <dbl>
 1 1       8.1        556 CBS     1                    1 Theft                 1
 2 2       8.1        479 CBS     1                    0 Theft                 1
 3 3       8          455 CBS     1                    0 Treasure              1
 4 4       7.8        426 CBS     1                    0 Natural…              1
 5 5       7.5        391 CBS     1                    0 Competi…              1
 6 6       8.4        384 CBS     1                    1 Extorti…              1
 7 7       7.6        358 CBS     1                    1 Competi…              1
 8 8       8.2        358 CBS     1                    1 Safety                1
 9 9       8.1        371 CBS     1                    1 Counter…              1
10 10      8          346 CBS     1                    0 Competi…              1
# ℹ 593 more rows
# ℹ 2 more variables: monster_gender <fct>, monster_real <fct>

3.2.3 Sampling the Data

scooby_converted was then filtered down to a tibble called scooby_sample that had complete cases on both the outcome variables. (rooby-rooby_roo and motive.)

Code
scooby_sample <- scooby_converted |>
  filter(complete.cases(rooby_rooby_roo, motive))
scooby_sample
# A tibble: 535 × 10
   index  imdb engagement network season rooby_rooby_roo motive   monster_amount
   <chr> <dbl>      <dbl> <fct>   <fct>            <dbl> <fct>             <dbl>
 1 1       8.1        556 CBS     1                    1 Theft                 1
 2 2       8.1        479 CBS     1                    0 Theft                 1
 3 3       8          455 CBS     1                    0 Treasure              1
 4 4       7.8        426 CBS     1                    0 Natural…              1
 5 5       7.5        391 CBS     1                    0 Competi…              1
 6 6       8.4        384 CBS     1                    1 Extorti…              1
 7 7       7.6        358 CBS     1                    1 Competi…              1
 8 8       8.2        358 CBS     1                    1 Safety                1
 9 9       8.1        371 CBS     1                    1 Counter…              1
10 10      8          346 CBS     1                    0 Competi…              1
# ℹ 525 more rows
# ℹ 2 more variables: monster_gender <fct>, monster_real <fct>

3.2.4 Working with Categorical Predictors

The categorical variables were then individually examined to assess the levels and the quantity of each level.

Code
scooby_sample |> 
  tabyl(motive)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
motive n percent
Theft 124 0.2317757
Treasure 54 0.1009346
Natural Resource 26 0.0485981
Competition 168 0.3140187
Extortion 4 0.0074766
Safety 2 0.0037383
Counterfeit 6 0.0112150
Inheritance 8 0.0149533
Smuggling 22 0.0411215
Preservation 11 0.0205607
Experimentation 4 0.0074766
Food 11 0.0205607
Trespassing 15 0.0280374
Assistance 5 0.0093458
Abduction 12 0.0224299
Haunt 1 0.0018692
Anger 3 0.0056075
Imagination 6 0.0112150
Bully 1 0.0018692
Loneliness 1 0.0018692
Training 1 0.0018692
Conquer 42 0.0785047
Mistake 1 0.0018692
Automated 1 0.0018692
Production 1 0.0018692
Entertainment 3 0.0056075
Simulation 2 0.0037383

The levels of motive were significantly collapsed into appropriate levels, and were then releveled according to size.

Code
scooby_sample <- scooby_sample |>
  mutate(motive = fct_collapse(motive, 
                               theft = "Theft",
                               treasure = "Treasure",
                               natural_resource = "Natural Resource",
                               competition = "Competition",
                               conquer = "Conquer",
                               indirect_monetary_gain = c("Extortion","Counterfeit","Inheritance","Smuggling"),
                               property_related = c("Preservation","Trespassing"),
                               harm_others = c("Abduction","Bully"),
                               indirect_self_interest = c("Safety","Experimentation","Food",
                                                          "Assistance","Loneliness","Entertainment","Training"),
                               misc_motive =c("Haunt","Anger","Imagination","Training",
                                              "Mistake","Automated","Production","Simulation")))
scooby_sample <- scooby_sample |>
  mutate(motive = fct_collapse(motive,
                               misc_motive = c("misc_motive","harm_others")))

scooby_sample <- scooby_sample |>
  mutate(motive = fct_collapse(motive,
                               indirect_self_interest = c("indirect_self_interest", "natural_resource","indirect_monetary_gain","property_related"),
                               misc_motive = c("misc_motive"))) |>
  mutate(motive = fct_relevel(motive, c("competition","theft","indirect_self_interest","treasure","conquer","misc_motive")))


scooby_sample |> 
  tabyl(motive)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
motive n percent
competition 168 0.3140187
theft 124 0.2317757
indirect_self_interest 118 0.2205607
treasure 54 0.1009346
conquer 42 0.0785047
misc_motive 29 0.0542056

The variable monster_gender was then examined. The variable originally describes the gender (as referred to by the Mystery Inc. gang) of the monster/monsters present in each episode.

Code
scooby_sample|>
  tabyl(monster_gender)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
monster_gender n percent valid_percent
Male 312 0.5831776 0.6178218
Female 28 0.0523364 0.0554455
Male,Male,Male 23 0.0429907 0.0455446
Male,Female 10 0.0186916 0.0198020
Male,Male 47 0.0878505 0.0930693
Female,Male 5 0.0093458 0.0099010
Male,Male,Male,Male,None 1 0.0018692 0.0019802
Male,Male,Male,Male,Male,Male,Male,Male,Male 4 0.0074766 0.0079208
Male,Female,Male,Male,Male,Male,Male 1 0.0018692 0.0019802
Male,Male,Male,Male 14 0.0261682 0.0277228
Male,Male,Male,Male,Male,Male,Male,Male 4 0.0074766 0.0079208
Male,Female,Male 4 0.0074766 0.0079208
Female,Male, 1 0.0018692 0.0019802
Male,Male,Male,Female 1 0.0018692 0.0019802
Male,Female,Male,Male,Male 2 0.0037383 0.0039604
Male,Male,Male,Female,Male,Male 1 0.0018692 0.0019802
Female,Male,Male,Female 1 0.0018692 0.0019802
Male,Male,Female,Male 3 0.0056075 0.0059406
Male,Male,Male,Male,Female 1 0.0018692 0.0019802
Male,Male,Male,Male,Male,Male 4 0.0074766 0.0079208
Male,Male,Female 4 0.0074766 0.0079208
Male,Male,Male,Male,Male 2 0.0037383 0.0039604
Female,Female,Female,Female 1 0.0018692 0.0019802
Male,Male,Male,Male,Male,Male,Male 5 0.0093458 0.0099010
Female,Male,Male,Male 1 0.0018692 0.0019802
Male,Male,Male,Female,Female,Female,Male,Male,Male,Female,Male,Male,Male,Male,Male,Male,Male,Male,Male 1 0.0018692 0.0019802
Male,Male,Male,Male,Female,Female,Male 1 0.0018692 0.0019802
Male,Male,Male,Male,Male,Male,Female,Female,Female,Male 1 0.0018692 0.0019802
Male,Male,Male,Female,Male,Male,Female,Female 1 0.0018692 0.0019802
Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male 4 0.0074766 0.0079208
Female,Female 1 0.0018692 0.0019802
Male,Female,Female,Male,Male 2 0.0037383 0.0039604
Female,Male,Male 1 0.0018692 0.0019802
Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male 1 0.0018692 0.0019802
Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Female 1 0.0018692 0.0019802
Male,Male,Male,Female,Male 1 0.0018692 0.0019802
Male,Male,Male,Male,Male,Male,Male,Male,Male,Male 2 0.0037383 0.0039604
Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male 2 0.0037383 0.0039604
Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male 2 0.0037383 0.0039604
Female,Male,Female,Male 1 0.0018692 0.0019802
Male,Male,Male,Male,Female,Male,Male,Male 1 0.0018692 0.0019802
Male,Female,Male,Male 1 0.0018692 0.0019802
Male,Male,Male,Male,Male,Male,Male,Female,Male,Male,Male,Male 1 0.0018692 0.0019802
NA 30 0.0560748 NA

If a particular value of the variable contained the term “Female”, it was then converted into the term “female” alone, and “male” alone otherwise. Missing values were left untouched. The variable was then releveled to have “male” as the baseline level.

Code
# Collapsing values into a binary variable
scooby_sample$monster_gender <- 
  ifelse(grepl("Female", scooby_sample$monster_gender), "female", ifelse(grepl("Male", scooby_sample$monster_gender), "male", scooby_sample$monster_gender))
scooby_sample <- scooby_sample |>
  mutate(monster_gender = fct_relevel(monster_gender, c("male","female")))
scooby_sample|>
  tabyl(monster_gender)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
monster_gender n percent valid_percent
male 427 0.7981308 0.8455446
female 78 0.1457944 0.1544554
NA 30 0.0560748 NA

The network variable describes the television networks that ran the relevant episode of Scooby-Doo.

Code
scooby_sample|>
  tabyl(network)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
network n percent
CBS 49 0.0915888
ABC 218 0.4074766
Syndication 3 0.0056075
TBC 0 0.0000000
Cartoon Network 82 0.1532710
Warner Home Video 38 0.0710280
Warner Bros. Picture 3 0.0056075
Adult Swim 1 0.0018692
The WB 41 0.0766355
The CW 27 0.0504673
Boomerang 73 0.1364486

The levels of the variables were then collapsed to appropriate levels. The variable was then releveled according to level size.

Code
scooby_sample <- scooby_sample |>
  mutate(network = fct_collapse(network, 
                               other_networks = c("Syndication","TBC","Adult Swim"),
                               warner_brothers = c("Warner Home Video", "Warner Bros. Picture", "The WB"),
                               cw = "The CW",
                               cbs = "CBS",
                               abc = "ABC",
                               cartoon_network = "Cartoon Network",
                               boomerang = "Boomerang")) |>
  mutate(network = fct_relevel(network, c("abc","cartoon_network","warner_brothers","boomerang","cbs","cw","other_networks")))
scooby_sample <- scooby_sample |>
  mutate(network = fct_collapse(network, 
                               other_networks = c("cbs","cw","other_networks")))
                               
scooby_sample|>
  tabyl(network)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
network n percent
abc 218 0.4074766
cartoon_network 82 0.1532710
warner_brothers 82 0.1532710
boomerang 73 0.1364486
other_networks 80 0.1495327

The season variable describes the season of the episode of Scooby-Doo.

Code
scooby_sample|>
  tabyl(season)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
season n percent
1 275 0.5140187
2 143 0.2672897
Crossover 7 0.0130841
3 54 0.1009346
Movie 40 0.0747664
Special 12 0.0224299
4 4 0.0074766

The levels were renamed, and then the variable itself was releveled according to level size.

Code
scooby_sample <- scooby_sample |>
  mutate(season = fct_collapse(season, 
                               season_1 = "1",
                               season_2 = "2",
                               season_3 = "3",
                               season_4 = "4",
                               crossover = "Crossover",
                               movie = "Movie",
                               special = "Special")) |>
  mutate(season = fct_relevel(season, c("season_1","season_2","season_3","movie","special","crossover","season_4")))
scooby_sample <- scooby_sample |>
  mutate(season = fct_collapse(season, 
                               season_3_4 = c("season_3","season_4"),
                               special_crossover = c("special","crossover")))
scooby_sample|>
  tabyl(season)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
season n percent
season_1 275 0.5140187
season_2 143 0.2672897
season_3_4 58 0.1084112
movie 40 0.0747664
special_crossover 19 0.0355140

The monster_real variable describes if the monster was actually real or if was just a man in a mask, mechanically controlled or trained/hypnotized by a human.

The levels were renamed.

Code
scooby_sample|>
  tabyl(monster_real)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
monster_real n percent valid_percent
FALSE 402 0.7514019 0.7960396
TRUE 103 0.1925234 0.2039604
NA 30 0.0560748 NA
Code
scooby_sample <- scooby_sample |>
  mutate(monster_real = fct_collapse(monster_real, 
                               not_real = "FALSE",
                               real = "TRUE"))
scooby_sample|>
  tabyl(monster_real)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
monster_real n percent valid_percent
not_real 402 0.7514019 0.7960396
real 103 0.1925234 0.2039604
NA 30 0.0560748 NA

3.2.5 Working with Quantitative Variables

Each individual quantitative variable was then described.

Code
scooby_sample |>
  select(imdb) |>
  describe()
select(scooby_sample, imdb) 

 1  Variables      535  Observations
--------------------------------------------------------------------------------
imdb 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     520       15       48    0.996    7.347   0.7684      5.9      6.5 
     .25      .50      .75      .90      .95 
     7.1      7.4      7.7      8.2      8.4 

lowest : 4.6 4.8 4.9 5   5.1, highest: 9   9.1 9.2 9.3 9.6
--------------------------------------------------------------------------------
Code
scooby_sample |>
  select(engagement) |>
  describe()
select(scooby_sample, engagement) 

 1  Variables      535  Observations
--------------------------------------------------------------------------------
engagement 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     520       15      229        1    645.2     1153     17.0     22.9 
     .25      .50      .75      .90      .95 
    31.0     75.0    166.2    384.6   1891.4 

lowest :      7      9     11     12     13, highest:   6929  12620  20226  50847 100951
--------------------------------------------------------------------------------
Code
scooby_sample |>
  select(monster_amount) |>
  describe()
select(scooby_sample, monster_amount) 

 1  Variables      535  Observations
--------------------------------------------------------------------------------
monster_amount 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     535        0       16    0.739    1.944    1.781        0        1 
     .25      .50      .75      .90      .95 
       1        1        2        4        7 
                                                                            
Value          0     1     2     3     4     5     6     7     8     9    10
Frequency     30   341    65    32    23     7     5     7     6     4     3
Proportion 0.056 0.637 0.121 0.060 0.043 0.013 0.009 0.013 0.011 0.007 0.006
                                        
Value         11    12    13    15    19
Frequency      4     3     3     1     1
Proportion 0.007 0.006 0.006 0.002 0.002

For the frequency table, variable is rounded to the nearest 0
--------------------------------------------------------------------------------
Code
scooby_sample |>
  select(rooby_rooby_roo) |>
  describe()
select(scooby_sample, rooby_rooby_roo) 

 1  Variables      535  Observations
--------------------------------------------------------------------------------
rooby_rooby_roo 
       n  missing distinct     Info     Mean      Gmd 
     535        0        8    0.782    0.729   0.7098 
                                                          
Value          0     1     2     3     4     5     6     7
Frequency    210   289    19    11     3     1     1     1
Proportion 0.393 0.540 0.036 0.021 0.006 0.002 0.002 0.002

For the frequency table, variable is rounded to the nearest 0
--------------------------------------------------------------------------------

3.2.6 Identifying Missingness

Code
scooby_sample |>
  miss_var_summary() |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
variable n_miss pct_miss
monster_gender 30 5.607477
monster_real 30 5.607477
imdb 15 2.803738
engagement 15 2.803738
index 0 0.000000
network 0 0.000000
season 0 0.000000
rooby_rooby_roo 0 0.000000
motive 0 0.000000
monster_amount 0 0.000000

The variables monster_gender, monster_real, imdb and engagement have missing values.

4 Code Book and Description

Variable Description Type Analytic Role Missing Values
index Unique Episode/Movie Identifier Character Variable Unique Identifier 0
imdb The IMDB (Internet Movie Database) rating for the episode/movie. Quantitative Variable Predictor 15
engagement The number of people who voted on IMDB to rate the episode/movie. Quantitative Variable Predictor 15
network The network/producer that produced the epsiode/movie of Scooby-Doo. Multi-categorical Variable with 5 levels Predictor 0
season The season the episode belonged to, or if it was a movie or a special. Multi-categorical Variable with 5 levels Predictor 0
rooby-rooby-roo The number of times the phrase, “Rooby Rooby Roo!” was said in an episode/movie Quantitative count variable Count Outcome 0
motive The motivation of the primary antagonist of the episode/movie of Scooby-Doo. Multi-categorical variable with 6 levels Multi-categorical Outcome 0
monster_amount The number of monsters that appeared in an episode/movie of Scooby-Doo Quantitative variable Predictor 0
monster_gender Describes if the monsters seen were all male, or if they included a female monster in their count. Binary Variable Predictor 30
monster_real Describes if the monster was actually real, or if it was a human in a suit, a robot, or a brainwashed person. Binary Variable Predictor 30

The levels of the categorical variables and their description are as follows:

network

The network/producer that produced the epsiode/movie of Scooby-Doo.

Level Description Distinct Observations
abc Episode/Movie was produced by ABC. 218
cartoon_network Episode/Movie was produced by Cartoon Network. 82
warner_brothers Episode/Movie was produced by Warner Brothers. 82
boomerang Episode/Movie was produced by Boomerang. 73
other_networks Episode/Movie was produced by networks other than the four described above. 80

season

The season the episode belonged to, or if it was a movie or a special.

Level Description Distinct Observations
season_1 Episode was from Season 1. 275
season_2 Episode was from Season 2. 143
season_3_4 Episode was from Season 3 or 4. 58
movie Scooby-Doo media was a movie. 40
special_crossover Scooby-Doo media was a special or a cross-over. 19

motive

The motivation of the primary antagonist of the episode/movie of Scooby-Doo.

Level Description Distinct Observations
competition The antagonist intended to win in a competition of sorts. 168
theft The antagonist intended to steal an item of value. 124
indirect_self_interest The antagonist had a motive that indirectly furthered their self interest. 118
treasure The antagonist intended to obtain treasure of some kind. 54
conquer The antagonist intended to conquer something or someone. 42
misc_motive The antagonist had a more nebulous, rare motivation. 29

monster_gender

Were the monsters seen all male, or did they include a female monster in their count?

The monster’s gender is defined by the pronouns used by the Mystery Inc. gang to refer to them. Monsters referred to as he/him were categorized as male, and monsters referred to as she/her were categorized as female.

Level Description Distinct Observations
male The monsters seen were all male. 427
female There was at least one female monster in the episode. 78

monster_real

Was the monster seen actually real, or was it a human in a suit, a robot, or a brainwashed person?

Level Description Distinct Observations
not_real The monster was not actually real. 402
real The monster turned out to be real. 103

5 Analyses

5.1 Analysis 1

5.1.1 The Research Question

Predicting an iconic catchphrase: Are the logistics of an episode of Scooby-Doo good predictors of the number of times an iconic catchphrase is spoken?

5.1.2 The Outcome

A numerical summary of the outcome variable was obtained.

Code
scooby_sample |>
  select(rooby_rooby_roo) |>
  describe()
select(scooby_sample, rooby_rooby_roo) 

 1  Variables      535  Observations
--------------------------------------------------------------------------------
rooby_rooby_roo 
       n  missing distinct     Info     Mean      Gmd 
     535        0        8    0.782    0.729   0.7098 
                                                          
Value          0     1     2     3     4     5     6     7
Frequency    210   289    19    11     3     1     1     1
Proportion 0.393 0.540 0.036 0.021 0.006 0.002 0.002 0.002

For the frequency table, variable is rounded to the nearest 0
--------------------------------------------------------------------------------

Most of the values are 0s and 1s, with the other values being fewer in comparison.

5.1.3 Visualizing the Outcome Variable

The outcome variable’s distribution was visualized using ggplot.

Code
p1 <- 
  ggplot(scooby_sample, aes(x = rooby_rooby_roo)) +
  geom_histogram(bins = 25, 
  fill = "#2b8cbe", col = "white") +
  labs( x = "Number of times the catchphrase was said", 
        y = "Count",
        title = str_wrap("Distribution of the number of times the catchphrase 'Rooby-Rooby-Roo!' was spoken in each episode of Scooby-Doo", width = 90),
         subtitle = glue("Across ", nrow(scooby_sample), 
                           " episodes/movies/specials of the entirety of the Scooby-Doo show."),
         caption = "")


p1

5.1.4 The Predictors

The predictors for the outcome are imdb, engagement, network and season. Each predictor variable was individually described.

  • imdb is a continuous, quantitative variable.
Code
scooby_sample |>
  select(imdb) |>
  describe()
select(scooby_sample, imdb) 

 1  Variables      535  Observations
--------------------------------------------------------------------------------
imdb 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     520       15       48    0.996    7.347   0.7684      5.9      6.5 
     .25      .50      .75      .90      .95 
     7.1      7.4      7.7      8.2      8.4 

lowest : 4.6 4.8 4.9 5   5.1, highest: 9   9.1 9.2 9.3 9.6
--------------------------------------------------------------------------------
  • engagement is a continuous, quantitative variable.
Code
scooby_sample |>
  select(engagement) |>
  describe()
select(scooby_sample, engagement) 

 1  Variables      535  Observations
--------------------------------------------------------------------------------
engagement 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     520       15      229        1    645.2     1153     17.0     22.9 
     .25      .50      .75      .90      .95 
    31.0     75.0    166.2    384.6   1891.4 

lowest :      7      9     11     12     13, highest:   6929  12620  20226  50847 100951
--------------------------------------------------------------------------------
  • network is a multi-categorical variable.
Code
scooby_sample |>
  tabyl(network) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
network n percent
abc 218 0.4074766
cartoon_network 82 0.1532710
warner_brothers 82 0.1532710
boomerang 73 0.1364486
other_networks 80 0.1495327

season is a multi-categorical variable.

Code
scooby_sample |>
  tabyl(season) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
season n percent
season_1 275 0.5140187
season_2 143 0.2672897
season_3_4 58 0.1084112
movie 40 0.0747664
special_crossover 19 0.0355140

The outcome was then visualized against each predictor.

5.2 rooby_rooby_roo vs network

Code
tmp_data <- scooby_sample

tmp_data$network <-  fct_recode(tmp_data$network,
                               ABC = "abc",
                               "Cartoon Network" = "cartoon_network",
                               "Warner Brothers" = "warner_brothers",
                               "Boomerang" = "boomerang",
                               "Other Networks" = "other_networks")
ggplot(tmp_data, aes(x = network, y = rooby_rooby_roo)) +
    geom_violin(fill = "#2b8cbe", alpha = 0.3, scale = "width")+ 
    geom_boxplot(fill = "#2b8cbe", width = 0.1,
               outlier.color = "red") +
  labs(y = "Number of times 'Rooby-Rooby-Roo!' was said",
       x = "Network/Producer",
       title = str_wrap("Mapping the distribution of the outcome variable across the networks/producers who produced Scooby-Doo.",width = 90),
       subtitle = "Across 535 episodes of Scooby-Doo.") 

Warner Brothers appears to have a higher number of Rooby-Rooby-Roo’s compared to the other networks.

5.3 rooby_rooby_roo vs season

Code
tmp_data <- scooby_sample

tmp_data$season <-  fct_recode(tmp_data$season,
                               "Season 1" = "season_1",
                               "Season 2" = "season_2",
                               "Season 3 and 4" = "season_3_4",
                               "Movie" = "movie",
                               "Special / Crossover" = "special_crossover")
ggplot(tmp_data, aes(x = season, y = rooby_rooby_roo)) +
    geom_violin(fill = "#2b8cbe", alpha = 0.3, scale = "width")+ 
    geom_boxplot(fill = "#2b8cbe", width = 0.1,
               outlier.color = "red") +
  labs(y = "Number of times 'Rooby-Rooby-Roo!' was said",
       x = "Network/Producer",
       title = str_wrap("Mapping the distribution of the outcome variable across the various seasons/movies of Scooby-Doo.",width = 90),
       subtitle = "Across 535 episodes of Scooby-Doo.") 

Movies appear to have a higher number of Rooby-Rooby-Roos compared to the other seasons/specials/crossovers.

5.4 rooby_rooby_roo vs imdb

Code
ggplot(tmp_data, aes(y = as.factor(rooby_rooby_roo), x = imdb)) +
    geom_violin(fill = "#2b8cbe", alpha = 0.4, scale = "width") +
   
    geom_boxplot(fill = "#2b8cbe", width = 0.1,
               outlier.color = "red") +
  labs(y = "Number of times 'Rooby-Rooby-Roo!' was said",
       x = "IMDB Rating",
       title = str_wrap("Mapping the distribution of the outcome variable across the IMDB rating",width = 90),
       subtitle = "Across 535 episodes of Scooby-Doo.") 

Scooby-Doo media with moreRooby-Rooby-Roo’s appear to have lower IMDB ratings.

5.5 rooby_rooby_roo vs engagement

Code
ggplot(tmp_data, aes(x = as.factor(rooby_rooby_roo), y = engagement)) +
  geom_violin(fill = "#2b8cbe", alpha = 0.4, scale = "width") +
   
    geom_boxplot(fill = "#2b8cbe", width = 0.1,
               outlier.color = "red") +
  scale_x_discrete(breaks = c(0,1,2,3,4,5,6,7,8,9,10)) +
  labs(x = "Number of times 'Rooby-Rooby-Roo!' was said",
       y = "IMDB Engagement Score",
       title = str_wrap("Mapping the distribution of the outcome variable across the IMDB engagement metric.",width = 80),
       subtitle = "Across 535 episodes of Scooby-Doo.") 

A significant outlier is seen for the engagement score. This outlier is the live action Scooby-Doo movie, titled “Scooby-Doo”, released in 2002.

Code
scooby_sample |>
  filter(engagement > 100000) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
index imdb engagement network season rooby_rooby_roo motive monster_amount monster_gender monster_real
342 5.1 100951 warner_brothers movie 3 conquer 10 female real

This live action movie was considered a cult classic soon after it’s release due to the plot being more intense that the usual fare for Scooby-Doo movies, with a fair share of more adult jokes appearing as well. The movie was rated on IMDB by over 100,000 people, indicating that it stands several degrees above the other Scooby-Doo media in terms of engagement on IMDB.

5.6 Visualizing the distribution of the zero counts across the network predictor

The distribution of zero counts across the network predictor was visualized.

Code
tmp_data <- scooby_sample

tmp_data$network <-  fct_recode(tmp_data$network,
                               ABC = "abc",
                               "Cartoon Network" = "cartoon_network",
                               "Warner Brothers" = "warner_brothers",
                               "Boomerang" = "boomerang",
                               "Other Networks" = "other_networks")
tmp_data |> group_by(network) |>
    dplyr::summarize(n = n(), 
              percent_0s = round(100*sum(rooby_rooby_roo == 0)/n,1)) |>
    ggplot(aes(y = network, x = percent_0s)) +
    geom_label(aes(label = percent_0s ), col = "#2b8cbe") +
    labs(x = "% of episodes/movies of Scooby-Doo with zero occurences of 'Rooby-Rooby-Roo!'.",
         y = "Network/Producer",
         title = str_wrap("Visualizing the percentage of episodes/movies with a zero count of `rooby_rooby_roo` across the different networks.", width = 80))

5.7 Visualizing the distribution of the zero counts across the season predictor

The distribution of zero counts across the season predictor was visualized.

Code
tmp_data <- scooby_sample

tmp_data$season <-  fct_recode(tmp_data$season,
                               "Season 1" = "season_1",
                               "Season 2" = "season_2",
                               "Season 3 and 4" = "season_3_4",
                               "Movie" = "movie",
                               "Special / Crossover" = "special_crossover")
tmp_data |> group_by(season) |>
    dplyr::summarize(n = n(), 
              percent_0s = round(100*sum(rooby_rooby_roo == 0)/n,1)) |>
    ggplot(aes(y = season, x = percent_0s)) +
    geom_label(aes(label = percent_0s), col = "#2b8cbe") +
    labs(x = "% of episodes/movies of Scooby-Doo with zero occurences of 'Rooby-Rooby-Roo!'.",
         y = "Season / Movie / Special",
         title = str_wrap("Visualizing the percentage of episodes/movies with a zero count of `rooby_rooby_roo` across different seasons/Movies/Specials of Scooby-Doo.", width = 60))

5.8 Preliminary Thoughts

I believe that higher IMDB scores will be associated with higher counts of Rooby-Rooby-Roo, and lower IMDB scores will be associated with lower counts of Rooby-Rooby-Roo.

I believe that movies will have higher counts of Rooby-Rooby-Roo’s.

I believe that media produced by Cartoon Network and Boomerang will have lower counts of Rooby-Rooby-Roo’s.

5.8.1 Missingness

A seperate dataset, containing the identifier variable, along with the outcome and the predictor variables was created and stored in the tibble scooby_q1. This tibble was then assessed for missingness using miss_var_summary.

Code
scooby_q1 <- scooby_sample |>
  select(index,rooby_rooby_roo, imdb, engagement, network, season)
scooby_q1 |>
  miss_var_summary() |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
variable n_miss pct_miss
imdb 15 2.803738
engagement 15 2.803738
index 0 0.000000
rooby_rooby_roo 0 0.000000
network 0 0.000000
season 0 0.000000

The variables imdb and engagement have 15 missing values each.

The data can be assumed to be MAR (Missing At Random), since the missing data are not randomly distributed. It cannot be assumed to be MNAR (Missing Not At Random) since there does not appear to be a relationship between the magnitude of a value and it’s missingness (or inclusion in the non-missing data), since there is no preponderance of values with lower or higher magnitude being missing.

Single imputation was then performed on the dataset, to account for missingness. The complete dataset, containing the data for both the analyses had it’s missing values imputed.

5.8.2 Imputing Missing Variables

Using the mice function, missing values in the scooby_q1 tibble were singly imputed and stored in the q1_simp tibble.

Code
set.seed(4322023)
q1_mice1 <- mice(scooby_q1, m = 10, print = FALSE) 
Warning: Number of logged events: 1
Code
q1_simp <- mice::complete(q1_mice1)

miss_var_summary(q1_simp)
# A tibble: 6 × 3
  variable        n_miss pct_miss
  <chr>            <int>    <dbl>
1 index                0        0
2 rooby_rooby_roo      0        0
3 imdb                 0        0
4 engagement           0        0
5 network              0        0
6 season               0        0

5.8.3 Splitting the data

The data was then split into a training (q1_train) and testing (q1_test) dataset using the initial_split function.

Code
set.seed(4322023)
scoob_split <- initial_split(q1_simp, prop = 0.75)
q1_train = training(scoob_split)
q1_test = testing(scoob_split)
dim(q1_train)
[1] 401   6
Code
dim(q1_test)
[1] 134   6

5.8.4 Assessing Collinearity

The ggpairs function was used to plot a scatterplot matrix to assess for potential collinearity.

Code
ggpairs(q1_simp, columns = c("rooby_rooby_roo", "imdb",
                             "engagement", "network",
                             "season"))

The variables imdb and engagement exhibit collinearity with each other and the outcome rooby-rooby-roo.

Code
mod_poisson_col <- glm(rooby_rooby_roo ~ imdb +  engagement + network + season,data = q1_train, family = "poisson")

car::vif(mod_poisson_col)
               GVIF Df GVIF^(1/(2*Df))
imdb       1.611302  1        1.269371
engagement 1.272811  1        1.128189
network    3.401606  4        1.165361
season     3.689599  4        1.177260

5.8.5 Non-Linear Terms

A Spearman plot was then created to assess if non-linear terms would be needed in the models.

Code
sp2 <- spearman2(rooby_rooby_roo ~ imdb + engagement + network +season, data = q1_simp)

plot(sp2)

The plot suggests the following:

  • An interaction term between the main effects of network and imdb, which would add 4 degrees of freedom to the model.

  • A restricted cubic spline applied to imdb.

  • An interaction term between imdb and season, which would add 4 degrees of freedom to the model.

The Poisson model created for the purpose of testing collinearity was then re-fit using Glm() from the rms package to obtain the number of degrees of freedom in the model without adding a non-linear term.

Code
d <- datadist(q1_simp)
options(datadist = "d")

mod_poisson_rms <- Glm(rooby_rooby_roo ~ imdb + engagement + network +season,
                     family = poisson(), 
                     data = q1_simp, 
                     x = T, y = T)

mod_poisson_rms
General Linear Model

Glm(formula = rooby_rooby_roo ~ imdb + engagement + network + 
    season, family = poisson(), data = q1_simp, x = T, y = T)

                      Model Likelihood    
                            Ratio Test    
         Obs 535    LR chi2      90.48    
Residual d.f.524    d.f.            10    
         g 0.496    Pr(> chi2) <0.0001    

                         Coef    S.E.   Wald Z Pr(>|Z|)
Intercept                 1.8019 0.6048  2.98  0.0029  
imdb                     -0.2779 0.0819 -3.39  0.0007  
engagement                0.0000 0.0000  0.32  0.7470  
network=cartoon_network  -0.4735 0.1931 -2.45  0.0142  
network=warner_brothers  -0.0361 0.1845 -0.20  0.8447  
network=boomerang        -0.5819 0.2147 -2.71  0.0067  
network=other_networks   -0.2212 0.1684 -1.31  0.1890  
season=season_2          -0.1466 0.1435 -1.02  0.3068  
season=season_3_4         0.0265 0.1691  0.16  0.8754  
season=movie              0.6389 0.2305  2.77  0.0056  
season=special_crossover -0.1181 0.3072 -0.38  0.7008  

A restricted cubic spline with 3 knots applied to imdb would add a single degree of freedom to the model, spending a total of 11 degrees of freedom.

5.8.6 Fitting a Poisson Model

A Poisson model, mod_poisson, was fit using glm().

Code
mod_poisson <- glm(rooby_rooby_roo ~ rcs(imdb,3) +  engagement + network + season ,data = q1_train, family = "poisson")

5.8.7 Model Coefficients

The model’s coefficients were obtained using the tidy function.

Code
tidy(mod_poisson) |>
  select(term, estimate, std.error, p.value) |>
  kbl(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)
term estimate std.error p.value
(Intercept) 1.719 1.075 0.110
rcs(imdb, 3)imdb -0.274 0.152 0.072
rcs(imdb, 3)imdb' 0.067 0.253 0.792
engagement 0.000 0.000 0.811
networkcartoon_network -0.454 0.246 0.065
networkwarner_brothers -0.053 0.225 0.814
networkboomerang -0.515 0.242 0.034
networkother_networks -0.188 0.226 0.404
seasonseason_2 -0.208 0.173 0.229
seasonseason_3_4 -0.048 0.205 0.816
seasonmovie 0.837 0.277 0.003
seasonspecial_crossover -0.217 0.353 0.540

5.8.8 Poisson Rootogram

A rootogram was created using the rootogram function from the countreg package.

Code
p_root <- topmodels::rootogram(mod_poisson, max = 10,
                              main = "Rootogram for the Poisson model")

The Poisson model (mod_poisson) under-predicts 0 counts, over-predicts 1 counts, and under-predicts 3, 4 and 5 counts, and over-predicts 6 and 7 counts.

5.8.9 Storing Training Sample Predictions

Using the augment function, the training sample predictions for the mod_poisson model were stored.

Code
mod_poisson_aug <- augment(mod_poisson, q1_train,
                     type.predict = "response")
mod_poisson_aug |> 
  select(index, rooby_rooby_roo, .fitted) |>
  head(3)|> kable(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)
index rooby_rooby_roo .fitted
144 1 0.744
435 0 0.413
345 1 0.706

5.8.10 Summarizing Training Sample mod_poisson Fit

The \(R^2\), Root mean Squared Error (RMSE) and Mean Absolute Error were then obtained from model mod_poisson.

Code
mets <- metric_set(rsq, rmse, mae)
mod_poisson_summary <-
mets(mod_poisson_aug, truth = rooby_rooby_roo, estimate = .fitted) |>
mutate(model = "mod_possion") |> relocate(model)
mod_poisson_summary |> kable(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)
model .metric .estimator .estimate
mod_possion rsq standard 0.228
mod_possion rmse standard 0.727
mod_possion mae standard 0.525

5.8.11 Fitting a Negative Binomial Model

A negative binomial model was fit using the glm.nb function.

Code
mod_nb <- MASS::glm.nb(rooby_rooby_roo ~ rcs(imdb, 3) + engagement + network +season ,data = q1_train)
Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached

Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached

5.8.12 Model Coefficients

The model coefficients were placed in a table.

Code
tidy(mod_nb) |> kable(digits = c(0, 3, 3, 1, 3)) |>
  kable_classic_2(font_size = 24,full_width = F)
term estimate std.error statistic p.value
(Intercept) 1.720 1.075 1.6 0.110
rcs(imdb, 3)imdb -0.274 0.152 -1.8 0.072
rcs(imdb, 3)imdb' 0.067 0.253 0.3 0.792
engagement 0.000 0.000 0.2 0.811
networkcartoon_network -0.454 0.246 -1.8 0.065
networkwarner_brothers -0.053 0.225 -0.2 0.814
networkboomerang -0.515 0.242 -2.1 0.034
networkother_networks -0.188 0.226 -0.8 0.404
seasonseason_2 -0.208 0.173 -1.2 0.229
seasonseason_3_4 -0.048 0.205 -0.2 0.816
seasonmovie 0.837 0.277 3.0 0.003
seasonspecial_crossover -0.217 0.353 -0.6 0.540

5.8.13 Negative Binomial Model Rootogram

Code
nb_root <- topmodels::rootogram(mod_nb, max = 10,
main = "Rootogram for the Negative Binomial Model")

5.8.14 Storing Training Sample Predictions

Code
mod_nb_aug <- augment(mod_nb, q1_train,
                     type.predict = "response")
Warning: The `augment()` method for objects of class `negbin` is not maintained by the broom team, and is only supported through the `glm` tidier method. Please be cautious in interpreting and reporting broom output.

This warning is displayed once per session.
Code
mod_nb_aug |> 
  select(index, rooby_rooby_roo, .fitted) |>
  head(3)|> kable(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)
index rooby_rooby_roo .fitted
144 1 0.744
435 0 0.413
345 1 0.706

5.8.15 Summarizing Training Sample mod_nb Fit

Code
mets <- metric_set(rsq, rmse, mae)
mod_nb_summary <-
mets(mod_nb_aug, truth = rooby_rooby_roo, estimate = .fitted) |>
mutate(model = "mod_nb") |> relocate(model)
mod_nb_summary |> kable(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)
model .metric .estimator .estimate
mod_nb rsq standard 0.228
mod_nb rmse standard 0.727
mod_nb mae standard 0.525

5.8.16 Fitting a Zero-Inflated Poisson Model

A zero-inflated Poisson model was fit using the zeroinfl function.

Code
mod_zip <- pscl::zeroinfl(rooby_rooby_roo ~ rcs(imdb, 3)  + engagement + network +season,data = q1_train)
Warning in value[[3L]](cond): system is computationally singular: reciprocal
condition number = 6.56872e-62FALSE
Code
summary(mod_zip)

Call:
pscl::zeroinfl(formula = rooby_rooby_roo ~ rcs(imdb, 3) + engagement + 
    network + season, data = q1_train)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.4387 -0.6795  0.1373  0.4081  4.1033 

Count model coefficients (poisson with log link):
                          Estimate Std. Error z value Pr(>|z|)
(Intercept)              1.719e+00         NA      NA       NA
rcs(imdb, 3)imdb        -2.741e-01         NA      NA       NA
rcs(imdb, 3)imdb'        6.660e-02         NA      NA       NA
engagement               2.802e-06         NA      NA       NA
networkcartoon_network  -4.538e-01         NA      NA       NA
networkwarner_brothers  -5.290e-02         NA      NA       NA
networkboomerang        -5.150e-01         NA      NA       NA
networkother_networks   -1.883e-01         NA      NA       NA
seasonseason_2          -2.081e-01         NA      NA       NA
seasonseason_3_4        -4.779e-02         NA      NA       NA
seasonmovie              8.372e-01         NA      NA       NA
seasonspecial_crossover -2.166e-01         NA      NA       NA

Zero-inflation model coefficients (binomial with logit link):
                          Estimate Std. Error z value Pr(>|z|)
(Intercept)             -1.511e+01         NA      NA       NA
rcs(imdb, 3)imdb        -3.225e+00         NA      NA       NA
rcs(imdb, 3)imdb'       -1.858e+00         NA      NA       NA
engagement               3.922e-04         NA      NA       NA
networkcartoon_network   1.566e+00         NA      NA       NA
networkwarner_brothers  -1.227e+00         NA      NA       NA
networkboomerang         1.011e+00         NA      NA       NA
networkother_networks    1.266e+00         NA      NA       NA
seasonseason_2           4.908e-01         NA      NA       NA
seasonseason_3_4         4.850e-01         NA      NA       NA
seasonmovie              3.671e-02         NA      NA       NA
seasonspecial_crossover  7.360e-01         NA      NA       NA

Number of iterations in BFGS optimization: 25 
Log-likelihood:  -392 on 24 Df

5.8.17 Zero-Inflated Poisson Model Rootogram

Code
zip_root <- topmodels::rootogram(mod_zip, max = 10,
main = "Rootogram for the Zero-Inflated Poisson Model")

5.8.18 Storing Training Sample Predictions

Code
mod_zip_aug <- q1_train |>
  mutate(".fitted" = predict(mod_zip, type = "response"),
         ".resid" = resid(mod_zip, type = "response"))
mod_zip_aug |> 
  select(index, rooby_rooby_roo, .fitted, .resid) |>
  head(3)
  index rooby_rooby_roo   .fitted     .resid
1   144               1 0.7443436  0.2556564
2   435               0 0.4132138 -0.4132138
3   345               1 0.7061787  0.2938213

5.8.19 Summarizing Training Sample mod_nb Fit

Code
mets <- metric_set(rsq, rmse, mae)
mod_zip_summary <-
mets(mod_zip_aug, truth = rooby_rooby_roo, estimate = .fitted) |>
mutate(model = "mod_zip") |> relocate(model)
mod_zip_summary |> kable(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)
model .metric .estimator .estimate
mod_zip rsq standard 0.228
mod_zip rmse standard 0.727
mod_zip mae standard 0.525

5.8.20 Fitting a Zero-Inflated Negative Binomial Model

A zero-inflated Negative Binomial Model was then fit.

Code
mod_zinb <- pscl::zeroinfl(rooby_rooby_roo ~ rcs(imdb, 3)  + engagement + network + season, dist = "negbin", data = q1_train)
Warning in value[[3L]](cond): system is computationally singular: reciprocal
condition number = 2.1577e-46FALSE
Code
summary(mod_zinb)

Call:
pscl::zeroinfl(formula = rooby_rooby_roo ~ rcs(imdb, 3) + engagement + 
    network + season, data = q1_train, dist = "negbin")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.3855 -0.6603  0.1403  0.3362  4.1881 

Count model coefficients (negbin with log link):
                          Estimate Std. Error z value Pr(>|z|)
(Intercept)              1.257e+00         NA      NA       NA
rcs(imdb, 3)imdb        -2.070e-01         NA      NA       NA
rcs(imdb, 3)imdb'       -4.830e-02         NA      NA       NA
engagement               5.965e-06         NA      NA       NA
networkcartoon_network   3.438e-01         NA      NA       NA
networkwarner_brothers   9.493e-02         NA      NA       NA
networkboomerang        -3.765e-01         NA      NA       NA
networkother_networks   -1.481e-01         NA      NA       NA
seasonseason_2          -1.322e-01         NA      NA       NA
seasonseason_3_4        -8.629e-02         NA      NA       NA
seasonmovie              6.345e-01         NA      NA       NA
seasonspecial_crossover -4.164e-01         NA      NA       NA
Log(theta)               1.663e+01         NA      NA       NA

Zero-inflation model coefficients (binomial with logit link):
                          Estimate Std. Error z value Pr(>|z|)
(Intercept)             -72.139488         NA      NA       NA
rcs(imdb, 3)imdb          5.234140         NA      NA       NA
rcs(imdb, 3)imdb'        -4.966800         NA      NA       NA
engagement               -0.006803         NA      NA       NA
networkcartoon_network   36.255230         NA      NA       NA
networkwarner_brothers   -7.081453         NA      NA       NA
networkboomerang         32.389288         NA      NA       NA
networkother_networks   -31.492774         NA      NA       NA
seasonseason_2            1.735211         NA      NA       NA
seasonseason_3_4         -4.228295         NA      NA       NA
seasonmovie               2.998970         NA      NA       NA
seasonspecial_crossover -96.148898         NA      NA       NA

Theta = 16665909.2848 
Number of iterations in BFGS optimization: 103 
Log-likelihood: -384.7 on 25 Df

5.8.21 Rootogram

Code
zinb_root <- topmodels::rootogram(mod_zinb, max = 10,
main = "Rootogram for Zero-Inflated Negative Binomial Model")

5.8.22 Storing Training Sample Predictions

Code
mod_zinb_aug <- q1_train |>
  mutate(".fitted" = predict(mod_zinb, type = "response"),
         ".resid" = resid(mod_zinb, type = "response"))
mod_zinb_aug |> 
  select(index, rooby_rooby_roo, .fitted, .resid) |>
  head(3)
  index rooby_rooby_roo   .fitted     .resid
1   144               1 0.7525684  0.2474316
2   435               0 0.4927507 -0.4927507
3   345               1 0.8279808  0.1720192

5.8.23 Summarizing Training Sample mod_nb Fit

Code
mets <- metric_set(rsq, rmse, mae)
mod_zinb_summary <-
mets(mod_zinb_aug, truth = rooby_rooby_roo, estimate = .fitted) |>
mutate(model = "mod_zinb") |> relocate(model)
mod_zinb_summary |> kable(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)
model .metric .estimator .estimate
mod_zinb rsq standard 0.289
mod_zinb rmse standard 0.698
mod_zinb mae standard 0.499

5.8.24 Comparing Rootograms

Code
plot(p_root)

Code
plot(nb_root)

Code
plot(zip_root)

Code
plot(zinb_root)

5.8.25 Vuong’s Procedure Comparing the Negative Binomial Model and the Poisson Model

Code
vuong(mod_nb, mod_poisson)
Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A  p-value
Raw                    -2.29614 model2 > model1 0.010834
AIC-corrected          -2.29614 model2 > model1 0.010834
BIC-corrected          -2.29614 model2 > model1 0.010834

5.8.26 Vuong’s Procedure Comparing the Zero-Inflated Poisson Model and the Poisson Model

Code
vuong(mod_zip, mod_poisson)
Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A p-value
Raw               -1.996332e-01 model2 > model1 0.42088
AIC-corrected     -8.766005e+05 model2 > model1 < 2e-16
BIC-corrected     -2.627154e+06 model2 > model1 < 2e-16

5.8.27 Vuong’s Procedure Comparing the ZINB Model and the Poisson Model

Code
vuong(mod_zinb, mod_poisson)
Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A   p-value
Raw                    1.609132 model1 > model2  0.053794
AIC-corrected         -1.030014 model2 > model1  0.151502
BIC-corrected         -6.300337 model2 > model1 1.485e-10

5.8.28 Vuong’s Procedure Comparing the ZINB Model and the Negative Binomial Model

Code
vuong(mod_zinb, mod_nb)
Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A    p-value
Raw                    1.609863 model1 > model2   0.053714
AIC-corrected         -1.029515 model2 > model1   0.151619
BIC-corrected         -6.300301 model2 > model1 1.4853e-10

5.8.29 Testing Sample Validations

Testing sample validations were obtained by using the predict function.

Code
test_poisson <- predict(mod_poisson, newdata = q1_test,
type.predict = "response")
test_nb <- predict(mod_nb, newdata = q1_test,
type.predict = "response")
test_zip <- predict(mod_zip, newdata = q1_test,
type.predict = "response")
test_zinb <- predict(mod_zinb, newdata = q1_test,
type.predict = "response")
Code
test_res <- bind_cols(q1_test,
                      pre_poisson = test_poisson, 
                      pre_nb = test_nb,
                      pre_zip = test_zip, 
                      pre_zinb = test_zinb)
names(test_res)
 [1] "index"           "rooby_rooby_roo" "imdb"            "engagement"     
 [5] "network"         "season"          "pre_poisson"     "pre_nb"         
 [9] "pre_zip"         "pre_zinb"       

The metrics of the testing sample’s fit were obtained.

Code
poisson_sum <- mets(test_res, truth = rooby_rooby_roo, estimate = pre_poisson) |>
  mutate(model = "mod_poisson")

nb_sum <- mets(test_res, truth = rooby_rooby_roo, estimate = pre_nb) |>
mutate(model = "mod_nb")

zip_sum <- mets(test_res, truth = rooby_rooby_roo, estimate = pre_zip) |>
mutate(model = "mod_zip")

zinb_sum <- mets(test_res, truth = rooby_rooby_roo, estimate = pre_zinb) |>
mutate(model = "mod_zinb")

test_sum <- bind_rows(poisson_sum, nb_sum, zip_sum, zinb_sum)|>
  pivot_wider(names_from = model,
              values_from = .estimate)
test_sum |>
select(-.estimator) |> kable(dig = 3) |>
  kable_classic_2(font_size = 24,full_width = F)
.metric mod_poisson mod_nb mod_zip mod_zinb
rsq 0.210 0.210 0.077 0.168
rmse 1.345 1.345 0.705 0.686
mae 1.210 1.210 0.535 0.521

5.8.30 Selecting a Final Model

The final winning model was the Poisson model, because:

  • It has the highest testing sample \(R^2\).

  • From the Vuong test, the other models do not appear to be a significant improvement over the Poisson model.

  • The rootograms for the Poisson model are not very different from the other three models.

A poisson model called mod_poisson_imp was fit on the entire imputed dataset.

Code
mod_poisson_imp <- glm(rooby_rooby_roo ~ rcs(imdb,3) +  engagement + network + season ,data = q1_simp, family = "poisson")
summary(mod_poisson_imp)

Call:
glm(formula = rooby_rooby_roo ~ rcs(imdb, 3) + engagement + network + 
    season, family = "poisson", data = q1_simp)

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)   
(Intercept)              1.782e+00  9.109e-01   1.956  0.05047 . 
rcs(imdb, 3)imdb        -2.750e-01  1.297e-01  -2.121  0.03393 * 
rcs(imdb, 3)imdb'       -6.195e-03  2.095e-01  -0.030  0.97641   
engagement               1.803e-06  5.675e-06   0.318  0.75069   
networkcartoon_network  -4.710e-01  2.116e-01  -2.225  0.02606 * 
networkwarner_brothers  -3.598e-02  1.845e-01  -0.195  0.84539   
networkboomerang        -5.810e-01  2.166e-01  -2.683  0.00731 **
networkother_networks   -2.185e-01  1.921e-01  -1.137  0.25555   
seasonseason_2          -1.463e-01  1.438e-01  -1.017  0.30907   
seasonseason_3_4         2.636e-02  1.692e-01   0.156  0.87621   
seasonmovie              6.397e-01  2.319e-01   2.758  0.00581 **
seasonspecial_crossover -1.186e-01  3.077e-01  -0.385  0.69998   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 469.87  on 534  degrees of freedom
Residual deviance: 379.39  on 523  degrees of freedom
AIC: 1084.7

Number of Fisher Scoring iterations: 5

5.8.31 Rootogram

The rootogram for the final model was obtained.

Code
poisson_imp_root <- topmodels::rootogram(mod_poisson_imp, max = 10,
main = "Rootogram for Final Poisson Model")

5.8.32 Storing Fitted Values and Residuals

The fitted values were stored and the \(R^2\) of the final was obtained.

Code
sm_poiss1 <- augment(mod_poisson_imp, q1_simp,
                     type.predict = "response")

sm_poiss1 |> 
    select(rooby_rooby_roo, .fitted) |>
    head()
# A tibble: 6 × 2
  rooby_rooby_roo .fitted
            <dbl>   <dbl>
1               1   0.512
2               0   0.512
3               0   0.526
4               0   0.557
5               0   0.606
6               1   0.470
Code
# The correlation of observed and fitted values
(poiss_r <- with(sm_poiss1, cor(rooby_rooby_roo, .fitted)))
[1] 0.4658293
Code
poiss_r^2
[1] 0.2169969

The glance function was used to obtain the log-likelihood value, and the AIC and the BIC.

Code
glance(mod_poisson_imp) |> round(3)
# A tibble: 1 × 8
  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
          <dbl>   <dbl>  <dbl> <dbl> <dbl>    <dbl>       <dbl> <dbl>
1          470.     534  -530. 1085. 1136.     379.         523   535
Log-Likelihood AIC BIC \(R^2\)
530.352 1084.70 1136.10 0.22

The \(R^2\) is 0.22, indicating that the model does not do a very good job of explaining the variation seen in the data.

5.8.33 Diagnostic Plots

A plot of the residuals versus the fitted values of rooby-rooby-roo for the final model were obtained.

Code
ggplot(sm_poiss1, aes(x = .fitted, y = .resid)) +
    geom_point() +
    labs(title = "Residuals vs. Fitted `rooby_rooby_roo`",
         subtitle = "Original Poisson Regression model")

The pattern seen here corroborates the pattern seen in the rootogram, with an underfitting of 0 values and an overfitting of 1 values.

glm.diag.plots was used to obtain the other diagnostic plots.

Code
glm.diag.plots(mod_poisson_imp)

The plots show that the model residuals follow a somewhat normal distribution, with some significant outliers. There are no points that exert an undue influence on the model.

Code
par(mfrow = c(2,2))
plot(mod_poisson_imp, which = 1)
plot(mod_poisson_imp, which = 2)
plot(mod_poisson_imp, which = 3)
plot(mod_poisson_imp, which = 4)

326 shows up as a outlier in all of these plots, along with 342.

5.9 Outlier Diagnostics

Code
scooby_raw |>
  filter(index == 326) |>
  select(title,rooby_rooby_roo, engagement, imdb, network, season) |>
  kable() |>
  kable_classic_2(font_size = 24,full_width = F)
title rooby_rooby_roo engagement imdb network season
Dawn of the Spooky Shuttle Scare 1 37 7.1 ABC 3

The first outlier is an episode from Season 3 produced by ABC, with a single rooby-rooby-roo.

Code
scooby_raw |>
  filter(index == 342) |>
  select(title,rooby_rooby_roo, engagement, imdb, network, season) |>
  kable(dig = 3) |>
  kable_classic_2(font_size = 24,full_width = F)
title rooby_rooby_roo engagement imdb network season
Scooby-Doo 3 100951 5.1 Warner Bros. Picture Movie

The second outlier is the live action Scooby-Doo movie, titled “Scooby-Doo” which was very popular at the time it released, and was considered a cult classic. Over 10,000 people voted to rate the movie on IMDB, which is higher than the engagement value for the other Scooby-Doo media.

5.10 Testing for Overdispersion

The Poisson distribution assumes that the variance of the distribution should be equal to it’s mean. Overdispersion occurs when the observed variance of the data is more than what would be expected under a Poisson distributio, and can indicate variability in the data that is not being explained by the model.

Code
cat("overdispersion ratio is ", sum(z^2)/ (n - k), "\n")
overdispersion ratio is  0.6754214 
Code
cat("p value of overdispersion test is ", 
    pchisq(sum(z^2), df = n-k, lower.tail = FALSE), "\n")
p value of overdispersion test is  1 

It does not look like the data has the issue of overdispersion from the large p value, and the data can be assumed to have a Poisson distribution.

5.11 Model Coefficients

The final model’s coefficients were obtained.

Code
tidy(mod_poisson_imp) |>
  select(term, estimate, std.error, p.value) |>
  kbl(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)
term estimate std.error p.value
(Intercept) 1.782 0.911 0.050
rcs(imdb, 3)imdb -0.275 0.130 0.034
rcs(imdb, 3)imdb' -0.006 0.210 0.976
engagement 0.000 0.000 0.751
networkcartoon_network -0.471 0.212 0.026
networkwarner_brothers -0.036 0.185 0.845
networkboomerang -0.581 0.217 0.007
networkother_networks -0.218 0.192 0.256
seasonseason_2 -0.146 0.144 0.309
seasonseason_3_4 0.026 0.169 0.876
seasonmovie 0.640 0.232 0.006
seasonspecial_crossover -0.119 0.308 0.700

5.11.1 Odds Ratios

This was then converted into Odds ratios.

Code
tidy(mod_poisson_imp, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |>
  select(term, estimate, std.error, 
         low95 = conf.low, high95 = conf.high, p = p.value) |>
  kable(digits = 3)|>
  kable_classic_2(font_size = 24,full_width = F)
term estimate std.error low95 high95 p
(Intercept) 5.940 0.911 0.968 34.596 0.050
rcs(imdb, 3)imdb 0.760 0.130 0.591 0.983 0.034
rcs(imdb, 3)imdb' 0.994 0.210 0.653 1.485 0.976
engagement 1.000 0.000 1.000 1.000 0.751
networkcartoon_network 0.624 0.212 0.407 0.935 0.026
networkwarner_brothers 0.965 0.185 0.665 1.372 0.845
networkboomerang 0.559 0.217 0.359 0.842 0.007
networkother_networks 0.804 0.192 0.547 1.162 0.256
seasonseason_2 0.864 0.144 0.648 1.140 0.309
seasonseason_3_4 1.027 0.169 0.730 1.418 0.876
seasonmovie 1.896 0.232 1.205 2.993 0.006
seasonspecial_crossover 0.888 0.308 0.461 1.556 0.700

6 Analysis 2

Monster Motivation: Is it possible to predict the motive of the major antagonist of an episode of Scooby-Doo, based on the nature of the monster the antagonist appears as?

6.1 The Outcome

The outcome variable is motive, describing the motive of the antagonist of each episode/movie of Scooby-Doo.

Code
scooby_sample |>
  tabyl(motive) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
motive n percent
competition 168 0.3140187
theft 124 0.2317757
indirect_self_interest 118 0.2205607
treasure 54 0.1009346
conquer 42 0.0785047
misc_motive 29 0.0542056

6.2 The Predictors

The predictors are monster_amount, monster_gender, monster_real.

  • monster_amount is a continuous quantitative variable describing the number of monsters that appear in each episode/movie.
Code
describe(scooby_sample$monster_amount)
scooby_sample$monster_amount 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     535        0       16    0.739    1.944    1.781        0        1 
     .25      .50      .75      .90      .95 
       1        1        2        4        7 
                                                                            
Value          0     1     2     3     4     5     6     7     8     9    10
Frequency     30   341    65    32    23     7     5     7     6     4     3
Proportion 0.056 0.637 0.121 0.060 0.043 0.013 0.009 0.013 0.011 0.007 0.006
                                        
Value         11    12    13    15    19
Frequency      4     3     3     1     1
Proportion 0.007 0.006 0.006 0.002 0.002

For the frequency table, variable is rounded to the nearest 0
  • monster_gender describes if the monster/monsters present in the episode were either male or female(in terms of the pronouns the Mystery Inc. gang used to refer to them). This variable had a large number of levels, and was collapsed to male if all of the monsters were male, and female if any of the monsters present in the episode/movie were female.
Code
scooby_sample |>
  tabyl(monster_gender) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
monster_gender n percent valid_percent
male 427 0.7981308 0.8455446
female 78 0.1457944 0.1544554
NA 30 0.0560748 NA
  • monster_real describes if the monster was actually real, or someone in a costume, or a mind-controlled person, etc.
Code
scooby_sample |>
  tabyl(monster_real) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
monster_real n percent valid_percent
not_real 402 0.7514019 0.7960396
real 103 0.1925234 0.2039604
NA 30 0.0560748 NA

6.3 Missingness

A seperate dataset, containing the identifier variable, along with the outcome and the predictor variables was created and stored in the tibble scooby_q2. This tibble was then assessed for missingness using miss_var_summary.

Code
scooby_q2 <- scooby_sample |>
  select(index,motive, monster_amount, monster_gender, monster_real)
scooby_q2 |>
  miss_var_summary() |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
variable n_miss pct_miss
monster_gender 30 5.607477
monster_real 30 5.607477
index 0 0.000000
motive 0 0.000000
monster_amount 0 0.000000

The variables monster_gender and monster_real have 15 missing values each.

The data can be assumed to be MAR (Missing At Random), since the missing data are not randomly distributed. It cannot be assumed to be MNAR (Missing Not At Random) since there does not appear to be a relationship between the magnitude of a value and it’s missingness (or inclusion in the non-missing data), since there is no preponderance of values with lower or higher magnitude being missing.

Single imputation was then performed on the dataset, to account for missingness. The complete dataset, containing the data for both the analyses had it’s missing values imputed.

6.4 Imputing Missing Variables

Code
set.seed(4322023)
q2_mice1 <- mice(scooby_q2, m = 10, print = FALSE) 
Warning: Number of logged events: 1
Code
q2_simp <- mice::complete(q2_mice1)

miss_var_summary(q2_simp)
# A tibble: 5 × 3
  variable       n_miss pct_miss
  <chr>           <int>    <dbl>
1 index               0        0
2 motive              0        0
3 monster_amount      0        0
4 monster_gender      0        0
5 monster_real        0        0

6.5 Visualizing Data Exploration

Code
mosaic::favstats(monster_amount ~ motive, data = q2_simp) |>
  kable(digits = 2) |>
  kable_classic_2(font_size = 24,full_width = F)
Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2
motive min Q1 median Q3 max mean sd n missing
competition 0 1 1.0 1.00 19 1.58 2.09 168 0
theft 0 1 1.0 1.00 15 1.56 1.82 124 0
indirect_self_interest 0 1 1.0 2.00 9 1.60 1.34 118 0
treasure 1 1 1.0 2.00 7 1.70 1.28 54 0
conquer 1 3 5.5 9.75 13 6.17 3.96 42 0
misc_motive 0 0 1.0 2.00 5 1.38 1.42 29 0

6.5.1 motive vs monster_amount

Code
tmp_data_2 <- q2_simp

tmp_data_2$motive <-  fct_recode(tmp_data_2$motive,
                               "Competition" = "competition",
                               "Theft" = "theft",
                               "Indirect Self Interest" = "indirect_self_interest",
                               "Treasure" = "treasure",
                               "Conquer" = "conquer",
                               "Miscellaneous Motives" = "misc_motive")
ggplot(tmp_data_2, aes(x = motive, y = monster_amount, fill = motive))  +
  geom_violin(fill = "#2b8cbe", alpha = 0.3, scale = "width")+ 
  geom_boxplot(fill = "#2b8cbe", width = 0.1,
               outlier.color = "red")+
  labs(y = "Number of monsters per episode/movie",
       x = "Motive of the antagonist",
       title = str_wrap("Mapping the motives of the major antagonist of every episode/movie against the number of monsters across the various seasons/movies of Scooby-Doo.",width = 80),
       subtitle = "Across 535 episodes of Scooby-Doo.")  

Antagonists with a motivation to conquer appear to star in episodes with more monsters, compared to antagonists with other motivations.

6.5.2 motive vs monster_real

Code
tmp_data_2 <- tmp_data_2 |>
  mutate(monster_real = fct_recode(monster_real,
                               "Monster was not real" = "not_real",
                               "Monster was real" = "real"))

ggplot(tmp_data_2, aes(motive))+
  geom_bar(fill = "#2b8cbe") + 
  facet_wrap(~ monster_real)+
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  labs(y = "Count",
       x = "Motive of the antagonist",
       title = str_wrap("Mapping the motives of the major antagonist of every episode/movie against the nature of reality of the monsters across the various seasons/movies of Scooby-Doo.",width = 80),
       subtitle = "Across 535 episodes of Scooby-Doo.") 

It appears that for episodes with monsters that are not real, competition is the most numerous motivation. In episodes with monsters that are real, conquering appears to be the most numerous motivation.

6.5.3 motive vs monster_gender

Code
tmp_data_2 <- tmp_data_2 |>
  mutate(monster_gender = fct_recode(monster_gender,
                               "Male" = "male",
                               "Female" = "female"))
ggplot(tmp_data_2, aes(motive))+
  geom_bar(fill = "#2b8cbe") + 
  facet_wrap(~ monster_gender)+
  theme(axis.text.x = element_text(angle = 45, hjust=1)) + 
  labs(y = "Count",
       x = "Motive of the antagonist",
       title = str_wrap("Mapping the motives of the major antagonist of every episode/movie against the gender of the monsters across the various seasons/movies of Scooby-Doo.",width = 80),
       subtitle = "Across 535 episodes of Scooby-Doo.") 

Episodes with only male monsters have competition as the most numerous outcome, and episodes that have female monsters also have competition as the most numerous outcome.

6.6 Assessing collinearity

The ggpairs function was used to plot a scatterplot matrix to assess for potential collinearity.

Code
ggpairs(q2_simp, columns = c("motive", "monster_real",
                             "monster_gender", "monster_amount"))

No collinearity was seen from the scatterplot.

6.7 Non-Linear Terms

A Spearman plot was then created to assess if non-linear terms would be needed in the models.

Code
sp2 <- spearman2(motive ~ monster_amount + monster_real + monster_gender, data = q2_simp)

plot(sp2)

The Spearman plot shows that if adding non-linear terms were to improve the fit of the model, the term most likely to add the most predictive power with a non-linear term applied to it would be monster_real.

However, no non-linear term was selected, given that adding non-linear terms would result in spending of too many degrees of freedom.

6.8 Fitting the Multinomial Model

A multinomial model was fit using the multinom function and was designated as multinom_model.

Code
multinom_model <-multinom(motive ~ monster_amount + monster_real + monster_gender ,data=q2_simp) 
# weights:  30 (20 variable)
initial  value 958.591316 
iter  10 value 806.559330
iter  20 value 761.294580
iter  30 value 760.258265
final  value 760.257932 
converged
Code
summary(multinom_model)
Call:
multinom(formula = motive ~ monster_amount + monster_real + monster_gender, 
    data = q2_simp)

Coefficients:
                       (Intercept) monster_amount monster_realreal
theft                   -0.2984806    -0.03866893        0.3795771
indirect_self_interest  -0.4247981    -0.13144830        1.9282423
treasure                -1.2076838     0.08470106       -0.6903529
conquer                 -4.2541198     0.18107415        4.7542887
misc_motive             -2.1112197    -0.46152297        3.2876692
                       monster_genderfemale
theft                            0.20634400
indirect_self_interest           0.06788956
treasure                        -0.27912724
conquer                         -0.08231355
misc_motive                      1.11501804

Std. Errors:
                       (Intercept) monster_amount monster_realreal
theft                    0.1697542     0.08281784        0.4887687
indirect_self_interest   0.1767427     0.08597467        0.4165262
treasure                 0.2190608     0.09981712        0.8402710
conquer                  0.6053196     0.08056674        0.7138281
misc_motive              0.3661798     0.19340172        0.5299548
                       monster_genderfemale
theft                             0.3499198
indirect_self_interest            0.3704725
treasure                          0.5111410
conquer                           0.5731102
misc_motive                       0.5255567

Residual Deviance: 1520.516 
AIC: 1560.516 

6.9 Model Summary

Code
tidy(multinom_model, exponentiate = FALSE) |> kable(digits = 3)|>
  kable_classic_2(font_size = 24,full_width = F)
y.level term estimate std.error statistic p.value
theft (Intercept) -0.298 0.170 -1.758 0.079
theft monster_amount -0.039 0.083 -0.467 0.641
theft monster_realreal 0.380 0.489 0.777 0.437
theft monster_genderfemale 0.206 0.350 0.590 0.555
indirect_self_interest (Intercept) -0.425 0.177 -2.403 0.016
indirect_self_interest monster_amount -0.131 0.086 -1.529 0.126
indirect_self_interest monster_realreal 1.928 0.417 4.629 0.000
indirect_self_interest monster_genderfemale 0.068 0.370 0.183 0.855
treasure (Intercept) -1.208 0.219 -5.513 0.000
treasure monster_amount 0.085 0.100 0.849 0.396
treasure monster_realreal -0.690 0.840 -0.822 0.411
treasure monster_genderfemale -0.279 0.511 -0.546 0.585
conquer (Intercept) -4.254 0.605 -7.028 0.000
conquer monster_amount 0.181 0.081 2.248 0.025
conquer monster_realreal 4.754 0.714 6.660 0.000
conquer monster_genderfemale -0.082 0.573 -0.144 0.886
misc_motive (Intercept) -2.111 0.366 -5.766 0.000
misc_motive monster_amount -0.462 0.193 -2.386 0.017
misc_motive monster_realreal 3.288 0.530 6.204 0.000
misc_motive monster_genderfemale 1.115 0.526 2.122 0.034

6.10 Fitting a Proportional Odds Model

A proportional odds model was fit using the polr function.

Code
polr_model <- polr(motive~ monster_real + monster_amount + monster_gender ,data=q2_simp, Hess = TRUE)

summary(polr_model)
Call:
polr(formula = motive ~ monster_real + monster_amount + monster_gender, 
    data = q2_simp, Hess = TRUE)

Coefficients:
                       Value Std. Error t value
monster_realreal     1.95262    0.23359  8.3593
monster_amount       0.06793    0.03869  1.7558
monster_genderfemale 0.07528    0.21936  0.3432

Intercepts:
                                Value   Std. Error t value
competition|theft               -0.4092  0.1140    -3.5890
theft|indirect_self_interest     0.6615  0.1148     5.7641
indirect_self_interest|treasure  1.8288  0.1406    13.0087
treasure|conquer                 2.6939  0.1795    15.0048
conquer|misc_motive              3.8677  0.2454    15.7593

Residual Deviance: 1631.342 
AIC: 1647.342 

6.11 Testing the Proportional Odds Assumption

A likelihood ratio test was performed. The multinomial model fits 5 intercepts and 20 slopes, for a total of 25 parameters. The proportional odds model fits 5 intercepts and 4 slopes, for a total of 9 parameters. The difference between them is 16.

G is a chi-square statistic, calculated as the difference in deviance. Deviance = -2 log likelihood.

Code
LL_1 <- logLik(polr_model)
LL_1m <- logLik(multinom_model)
(G <- -2 * (LL_1[1] - LL_1m[1]))
[1] 110.8262

The p value associated with the chi-square statistic, with a difference of 16 degrees of freedom, is calculated.

Code
pchisq(G, 16, lower.tail = FALSE)
[1] 3.12363e-16

The p value suggests that the proportional odds model does not fit the data as well as the multinomial model does.

6.12 Comparing Metrics

Model Effective Degrees of Freedom Deviance AIC
Multinomial Model 20 1515.427 1555.427
Polynomial Model 8 1624.988 1640.988

Based on the results of testing the proportional odds assumption, and the model metrics, the multinomial model appears to be a better fit for the data than the proportional odds model.

6.13 The Final Model

Code
summary(multinom_model)
Call:
multinom(formula = motive ~ monster_amount + monster_real + monster_gender, 
    data = q2_simp)

Coefficients:
                       (Intercept) monster_amount monster_realreal
theft                   -0.2984806    -0.03866893        0.3795771
indirect_self_interest  -0.4247981    -0.13144830        1.9282423
treasure                -1.2076838     0.08470106       -0.6903529
conquer                 -4.2541198     0.18107415        4.7542887
misc_motive             -2.1112197    -0.46152297        3.2876692
                       monster_genderfemale
theft                            0.20634400
indirect_self_interest           0.06788956
treasure                        -0.27912724
conquer                         -0.08231355
misc_motive                      1.11501804

Std. Errors:
                       (Intercept) monster_amount monster_realreal
theft                    0.1697542     0.08281784        0.4887687
indirect_self_interest   0.1767427     0.08597467        0.4165262
treasure                 0.2190608     0.09981712        0.8402710
conquer                  0.6053196     0.08056674        0.7138281
misc_motive              0.3661798     0.19340172        0.5299548
                       monster_genderfemale
theft                             0.3499198
indirect_self_interest            0.3704725
treasure                          0.5111410
conquer                           0.5731102
misc_motive                       0.5255567

Residual Deviance: 1520.516 
AIC: 1560.516 

6.14 p Values

Code
z <- summary(multinom_model)$coefficients/summary(multinom_model)$standard.errors
p <- (1 - pnorm(abs(z), 0, 1)) * 2
p |> kable(dig = 3) |>
  kable_classic_2(font_size = 24,full_width = F) 
(Intercept) monster_amount monster_realreal monster_genderfemale
theft 0.079 0.641 0.437 0.555
indirect_self_interest 0.016 0.126 0.000 0.855
treasure 0.000 0.396 0.411 0.585
conquer 0.000 0.025 0.000 0.886
misc_motive 0.000 0.017 0.000 0.034

6.15 Exponentiated Coefficients

Code
exp(coef(multinom_model, conf_int = TRUE)) |> kable(dig = 3) |>
  kable_classic_2(font_size = 24,full_width = F) 
(Intercept) monster_amount monster_realreal monster_genderfemale
theft 0.742 0.962 1.462 1.229
indirect_self_interest 0.654 0.877 6.877 1.070
treasure 0.299 1.088 0.501 0.756
conquer 0.014 1.199 116.081 0.921
misc_motive 0.121 0.630 26.780 3.050

6.16 Exponentiated Coefficients with p Values

Code
df <- data.frame(
  motive = c("theft", "indirect_self_interest", "treasure", "conquer", "misc_motive"),
  intercept = c(0.741, 0.637, 0.299, 0.014, 0.102),
  intercept_pvalue = c(0.078, 0.011, 0.000, 0.000, 0.000),
  monster_amount = c(0.958, 0.882, 1.087, 1.209, 0.660),
  monster_amount_pvalue = c(0.602, 0.141, 0.404, 0.019, 0.021),
  monster_realreal = c(1.627, 6.949, 0.509, 115.167, 33.966),
  monster_realreal_pvalue = c(0.306, 0.000, 0.419, 0.000, 0.000),
  monster_genderfemale = c(1.229, 1.158, 0.757, 0.961, 2.556),
  monster_genderfemale_pvalue = c(0.555, 0.689, 0.586, 0.945, 0.085)
)

df |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F) 
motive intercept intercept_pvalue monster_amount monster_amount_pvalue monster_realreal monster_realreal_pvalue monster_genderfemale monster_genderfemale_pvalue
theft 0.741 0.078 0.958 0.602 1.627 0.306 1.229 0.555
indirect_self_interest 0.637 0.011 0.882 0.141 6.949 0.000 1.158 0.689
treasure 0.299 0.000 1.087 0.404 0.509 0.419 0.757 0.586
conquer 0.014 0.000 1.209 0.019 115.167 0.000 0.961 0.945
misc_motive 0.102 0.000 0.660 0.021 33.966 0.000 2.556 0.085

6.17 Exponentiated Coefficients with Confidence Intervals

Code
tidy(multinom_model,conf.int = TRUE, conf.level = 0.95, exponentiate = TRUE) |>
  kbl(dig = 3) |>
  kable_classic_2(font_size = 24,full_width = F)  
y.level term estimate std.error statistic p.value conf.low conf.high
theft (Intercept) 0.742 0.170 -1.758 0.079 0.532 1.035
theft monster_amount 0.962 0.083 -0.467 0.641 0.818 1.132
theft monster_realreal 1.462 0.489 0.777 0.437 0.561 3.810
theft monster_genderfemale 1.229 0.350 0.590 0.555 0.619 2.440
indirect_self_interest (Intercept) 0.654 0.177 -2.403 0.016 0.462 0.925
indirect_self_interest monster_amount 0.877 0.086 -1.529 0.126 0.741 1.038
indirect_self_interest monster_realreal 6.877 0.417 4.629 0.000 3.040 15.559
indirect_self_interest monster_genderfemale 1.070 0.370 0.183 0.855 0.518 2.212
treasure (Intercept) 0.299 0.219 -5.513 0.000 0.195 0.459
treasure monster_amount 1.088 0.100 0.849 0.396 0.895 1.324
treasure monster_realreal 0.501 0.840 -0.822 0.411 0.097 2.603
treasure monster_genderfemale 0.756 0.511 -0.546 0.585 0.278 2.060
conquer (Intercept) 0.014 0.605 -7.028 0.000 0.004 0.047
conquer monster_amount 1.199 0.081 2.248 0.025 1.023 1.404
conquer monster_realreal 116.081 0.714 6.660 0.000 28.652 470.298
conquer monster_genderfemale 0.921 0.573 -0.144 0.886 0.300 2.832
misc_motive (Intercept) 0.121 0.366 -5.766 0.000 0.059 0.248
misc_motive monster_amount 0.630 0.193 -2.386 0.017 0.431 0.921
misc_motive monster_realreal 26.780 0.530 6.204 0.000 9.478 75.668
misc_motive monster_genderfemale 3.050 0.526 2.122 0.034 1.089 8.543

6.18 Storing and graphing Predicted Probabilities

The predicted probabilities were stored, tabulated, and then pivoted. This resulting table was used to create graphs of predicted probabilities of the outcome across the different predictors.

6.19 Predicted Probabilities and monster_amount

The predicted probabilities for each motive were graphed against the monster_amount variable.

Code
ggplot(q2_simp_fits_long, aes(x = monster_amount, y = prob, shape = response, col = response)) +
  geom_point(size = 3) +
  scale_fill_brewer(palette = "Set1") +
  labs(title = str_wrap("Comparing predicted probabilities of the different antagonist motive types against the number of monsters of each installment of Scooby-Doo Media.", width = 80),
       subtitle = "Using a multinomial logistic regression model.",
       x = "Number of monsters per installment",
       y = "Predicted Probability")

Higher monster counts are association with higher predicted probability of conquer being the model’s prediction for the antagonist’s motivation, and lower monster counts are associated with higher predicted probability of misc_motive being the model’s prediction for the antagonist’s motivation.

6.20 Predicted Probabilities and monster_real

Code
ggplot(q2_simp_fits_long, aes(x = monster_real, y = prob,
                              col = response,shape = response)) +
  geom_point(size = 3) +
  geom_jitter(width = 0.25) +
  scale_fill_brewer(palette = "Set1")+
  labs(title = str_wrap("Comparing predicted probabilities of the different antagonist motive types against whether the monster(s) of each installment of Scooby-Doo Media was real or not.", width = 80),
       subtitle = "Using a multinomial logistic regression model.",
       x = "Whether the monster was real or not",
       y = "Predicted Probability")

Monsters who are real have a higher predicted probability of having conquering as a motivation, along with having a miscellaneous motivation. Monsters who are not real have a higher predicted probability of having a competition be their motivation, with a very low probability of conquering being their motivation.

6.21 Predicted Probabilities and monster_gender

Code
ggplot(q2_simp_fits_long, aes(x = monster_gender, y = prob,
                              col = response,shape = response)) +
  geom_point(size = 3) +
  geom_jitter(width = 0.25) + 
  scale_fill_brewer(palette = "Set1")  +
  labs(title = str_wrap("Comparing predicted probabilities of the different antagonist motive types against whether the monster(s) of each installment of Scooby-Doo Media included a female monster.", width = 80),
       subtitle = str_wrap("Using a multinomial logistic regression model.", width = 80),
       x = "Whether the monster(s) included a female monster.",
       y = "Predicted Probability") 

Monster gender does not appear to cause a significant effect on the predicted probability of the motivation types.

7 Conclusions and Discussion

7.1 Analysis 1

Code
tidy(mod_poisson_imp, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |>
  select(term, estimate, std.error, 
         low95 = conf.low, high95 = conf.high, p = p.value) |>
  kable(digits = 3)|>
  kable_classic_2(font_size = 24,full_width = F)
term estimate std.error low95 high95 p
(Intercept) 5.940 0.911 0.968 34.596 0.050
rcs(imdb, 3)imdb 0.760 0.130 0.591 0.983 0.034
rcs(imdb, 3)imdb' 0.994 0.210 0.653 1.485 0.976
engagement 1.000 0.000 1.000 1.000 0.751
networkcartoon_network 0.624 0.212 0.407 0.935 0.026
networkwarner_brothers 0.965 0.185 0.665 1.372 0.845
networkboomerang 0.559 0.217 0.359 0.842 0.007
networkother_networks 0.804 0.192 0.547 1.162 0.256
seasonseason_2 0.864 0.144 0.648 1.140 0.309
seasonseason_3_4 1.027 0.169 0.730 1.418 0.876
seasonmovie 1.896 0.232 1.205 2.993 0.006
seasonspecial_crossover 0.888 0.308 0.461 1.556 0.700

7.1.1 The relationship between the Rooby-Rooby-Roo count and the producing network

Cartoon Network - Scooby-Doo media being produced by Cartoon Network results in a 37.6%(95%CI: 6.5% to 59.3%) decrease in count of Rooby-Rooby-Roo’s per episode compared to being produced by ABC , holding everything else constant.

Scooby-Doo media produced by Cartoon Network is more likely to have a lower count of Rooby-Rooby-Roos per episode when compared to Scooby-Doo media produced by ABC, provided they have the same IMDB rating, the same IMDB engagement score, and are from the same season.

Boomerang Episodes - Scooby-Doo media being produced by Boomerang results in a 44.1% (95%CI: 15.8% to 64.1%) decrease in count of Rooby-Rooby-Roo’s per episode compared to being produced by ABC, holding everything else constant.

Scooby-Doo media produced by Boomerang is more likely to have a lower count of Rooby-Rooby-Roos per episode when compared to Scooby-Doo media produced by ABC, provided they have the same IMDB rating, the same IMDB engagement score, and are from the same season.

7.1.2 The relationship between the Rooby-Rooby-Roo count and the type of Scooby-Doo media

Movies: The form of Scooby-Doo media being a movie results in a 89.6%(95%CI: 20.5% to 199.3%) increase in count of Rooby-Rooby-Roo’s compared to season 1 episodes, holding everything else constant.

Movies are more likely to have a higher count of Rooby-Rooby-Roo’s when compared to Season 1 episodes, provided they are both produced by the same network/producer, have the same IMDB score, and have the same IMDB engagement rating.

7.2 Analysis 2

Code
df |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)
motive intercept intercept_pvalue monster_amount monster_amount_pvalue monster_realreal monster_realreal_pvalue monster_genderfemale monster_genderfemale_pvalue
theft 0.741 0.078 0.958 0.602 1.627 0.306 1.229 0.555
indirect_self_interest 0.637 0.011 0.882 0.141 6.949 0.000 1.158 0.689
treasure 0.299 0.000 1.087 0.404 0.509 0.419 0.757 0.586
conquer 0.014 0.000 1.209 0.019 115.167 0.000 0.961 0.945
misc_motive 0.102 0.000 0.660 0.021 33.966 0.000 2.556 0.085
Code
tidy(multinom_model,conf.int = TRUE, conf.level = 0.95, exponentiate = TRUE) |>
  kbl(dig = 3) |>
  kable_classic_2(font_size = 24,full_width = F)  
y.level term estimate std.error statistic p.value conf.low conf.high
theft (Intercept) 0.742 0.170 -1.758 0.079 0.532 1.035
theft monster_amount 0.962 0.083 -0.467 0.641 0.818 1.132
theft monster_realreal 1.462 0.489 0.777 0.437 0.561 3.810
theft monster_genderfemale 1.229 0.350 0.590 0.555 0.619 2.440
indirect_self_interest (Intercept) 0.654 0.177 -2.403 0.016 0.462 0.925
indirect_self_interest monster_amount 0.877 0.086 -1.529 0.126 0.741 1.038
indirect_self_interest monster_realreal 6.877 0.417 4.629 0.000 3.040 15.559
indirect_self_interest monster_genderfemale 1.070 0.370 0.183 0.855 0.518 2.212
treasure (Intercept) 0.299 0.219 -5.513 0.000 0.195 0.459
treasure monster_amount 1.088 0.100 0.849 0.396 0.895 1.324
treasure monster_realreal 0.501 0.840 -0.822 0.411 0.097 2.603
treasure monster_genderfemale 0.756 0.511 -0.546 0.585 0.278 2.060
conquer (Intercept) 0.014 0.605 -7.028 0.000 0.004 0.047
conquer monster_amount 1.199 0.081 2.248 0.025 1.023 1.404
conquer monster_realreal 116.081 0.714 6.660 0.000 28.652 470.298
conquer monster_genderfemale 0.921 0.573 -0.144 0.886 0.300 2.832
misc_motive (Intercept) 0.121 0.366 -5.766 0.000 0.059 0.248
misc_motive monster_amount 0.630 0.193 -2.386 0.017 0.431 0.921
misc_motive monster_realreal 26.780 0.530 6.204 0.000 9.478 75.668
misc_motive monster_genderfemale 3.050 0.526 2.122 0.034 1.089 8.543

7.2.1 The relationship between monster motivation and the monster count

For two installments of Scooby-Doo media,“EP-1” and “EP-2”, with the same nature of the monster, and the same gender of the monster, if “EP-1” has one more monster than “EP-2”, EP-1’s predicted odds of the antagonist’s motive being “Conquer” rather than “Competition” is 1.21 times(95% confidence interval - 1.03 to 1.42) EP-2’s odds of the antagonist’s motive being “Conquer” rather than “Competition”.

Episodes with more monsters have a meaningfully higher odds of the antagonist having a motive of “Conquer” rather than “Competition”, provided the monster gender and the reality of the monster remain same.

For two installments of Scooby-Doo media,“EP-1” and “EP-2”, with the same nature of the monster, and the same gender of the monster, if “EP-1” has one more monster than “EP-2”, EP-1’s predicted odds of the antagonist’s motive being “Miscellaneous” rather than “Competition” is only 0.66 times(95% confidence interval - 0.46 to 0.94) EP-2’s odds of the antagonist’s motive being “Miscellaneous” rather than “Competition”.

Episodes with more monsters have a meaningfully lower odds of the antagonist having a motive of “Miscellaneous” rather than “Competition”, provided the monster gender and the reality of the monster remain the same.

7.2.2 The relationship between monster motivation and the nature of the monster’s reality

For two installments of Scooby-Doo media,“EP-1” and “EP-2”, with the same number of monsters, and the same gender of the monster, and the monsters in EP-2 are not real and EP-1 has a real monster, EP-1’s predicted odds of the antagonist’s motive being “Indirect Self Interest” rather than “Competition” is 6.95 times (95% confidence interval - 3.11 to 15.54) EP-2’s odds of the antagonist’s motive being “Indirect Self Interest” rather than “Competition”.

Episodes with real monsters have a meaningfully higher odds of the antagonist having a motive of “Indirect Self Interest” rather than “Competition”, provided the monster gender and the number of monsters remain the same.

For two installments of Scooby-Doo media,“EP-1” and “EP-2”, with the same number of monsters, and the the same gender of the monster, and the monsters in EP-2 are not real and EP-1 has a real monster, EP-1’s predicted odds of the antagonist’s motive being “Conquer” rather than “Competition” is 115.167 times(95% confidence interval - 28.60 to 463.70) EP-2’s odds of the antagonist’s motive being “Conquer” rather than “Competition”.

Episodes with real monsters have a meaningfully higher odds of the antagonist having a motive of “Conquer” rather than “Competition”, provided the monster gender and the number of monsters remain the same.

For two installments of Scooby-Doo media,“EP-1” and “EP-2”, with the same number of monsters, and the the same gender of the monster, and the monsters in EP-2 are not real and EP-1 has a real monster, EP-1’s predicted odds of the antagonist’s motive being “Miscellaneous” rather than “Competition” is 33.966 times(95% confidence interval - 12.10 to 95.74) EP-2’s odds of the antagonist’s motive being “Miscellaneous” rather than “Competition”.

Episodes with real monsters have a meaningfully higher odds of the antagonist having a motive of “Miscellaneous” rather than “Competition”, provided the monster gender and the number of monsters remain the same.

7.2.3 The relationship between monster motivation and the gender of the monsters

Having a female monster in an installment of Scooby-Doo does not have a significant impact on the motive of the antagonist of that installment when compared to an installment with no female monsters, provided the number of monsters and the nature of the monster remains same across both installments.

7.3 Answering My Research Questions

7.4 To Answer My First Question….

Predicting an iconic catchphrase: Are the logistics of an episode of Scooby-Doo good predictors of the number of times an iconic catchphrase is spoken?

The network/producer that produces a particular installment of Scooby-Doo has a meaningful impact on the number of times Rooby-Rooby-Roo is spoken. Cartoon Network and Boomerang produce media that is less likely to have a higher count of Rooby-Rooby-Roo’s compared to ABC, which is the network that has produced the most Scooby-Doo media.(Provided the IMDB logistics and the type of installment remain constant.)

Given that there is a meaningful correlation, the producing network of Scooby-Doo media can be considered a decent predictor of the number of times Rooby-Rooby-Roo’s are spoken per installment.

The nature of the type of installment of Scooby-Doo media has a meaningful impact on the number of times Rooby-Rooby-Roo is spoken.

Movies are more likely to have a higher count of Rooby-Rooby-Roo’s compared to Season 1 episodes, provided the two installments have the same IMDB logistics and the same producing network.

Given that there is a meaningful correlation, the nature of the type of installment of Scooby-Doo media can be considered a decent predictor of the number of times Rooby-Rooby-Roo’s are spoken per installment.

7.5 To Answer My Second Question….

Monster Motivation: Is it possible to predict the motive of the major antagonist of an episode of Scooby-Doo, based on the nature of the monster the antagonist appears as?

The number of monsters per installment of Scooby-Doo are a good predictor of the motive of the major antagonist. Episodes with higher monster counts are more likely to have antagonists that want to conquer, and episodes with lower monster counts are more likely to have antagonists that have more nebulous motivations, provided the monster’s gender and nature of reality remain constant.

The nature of reality of the monster(s) in an installment of Scooby-Doo are a good predictor of the motive of the major antagonist. Episodes with real monsters (as opposed to humans in a costume, or brainwashed/mind-controlled humans) are more likely to have antagonists that want to conquer, have more nebulous motivations, or act in their own self interest in an indirect manner, provided the monster’s gender and monster count remain constant.

Monster gender does not have a meaningful impact on the motivation of the antagonist of an installment of Scooby-Doo, provided the monster’s nature of reality and monster count remain constant. (No gender bias here!)

8 References and Acknowledgments

8.1 References

The data comes from Kaggle, and was part of Tidy Tuesday’s dataset for 2021-07-13, and can be found here.

The data was manually aggregated by user plummye on the Kaggle website.

8.2 Acknowledgments

I would like to thank Stephanie Merlino for being a fantastic TA, and for answering any and every question I had for her without hesitation. She was a big part of why Project B felt like something I could accomplish.

9 Session Information

Code
xfun::session_info()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  abind_1.4-5          arm_1.13-1           askpass_1.2.0       
  backports_1.4.1      base64enc_0.1-3      bit_4.0.5           
  bit64_4.0.5          blob_1.2.4           boot_1.3-28.1       
  brio_1.1.4           broom_1.0.5          broom.helpers_1.14.0
  broom.mixed_0.2.9.4  bslib_0.6.1          cachem_1.0.8        
  callr_3.7.3          car_3.1-2            carData_3.0-5       
  cellranger_1.1.0     checkmate_2.3.1      class_7.3-22        
  cli_3.6.1            clipr_0.8.0          clock_0.7.0         
  cluster_2.1.6        coda_0.19-4          codetools_0.2-19    
  colorspace_2.1-0     commonmark_1.9.0     compiler_4.3.2      
  conflicted_1.2.0     countreg_0.2-1       cpp11_0.4.7         
  crayon_1.5.2         curl_5.2.0           data.table_1.14.10  
  datasets_4.3.2       DBI_1.2.0            dbplyr_2.4.0        
  desc_1.4.3           diagram_1.6.5        dials_1.2.0         
  DiceDesign_1.10      diffobj_0.3.5        digest_0.6.33       
  distributions3_0.2.1 dplyr_1.1.4          dtplyr_1.3.1        
  ellipsis_0.3.2       equatiomatic_0.3.1   evaluate_0.23       
  fansi_1.0.6          farver_2.1.1         fastmap_1.1.1       
  fontawesome_0.5.2    forcats_1.0.0        foreach_1.5.2       
  foreign_0.8-86       Formula_1.2-5        fs_1.6.3            
  furrr_0.3.1          future_1.33.1        future.apply_1.11.1 
  gargle_1.5.2         generics_0.1.3       GGally_2.2.0        
  ggformula_0.12.0     ggplot2_3.4.4        ggridges_0.5.5      
  ggstats_0.5.1        glmnet_4.1-8         globals_0.16.2      
  glue_1.6.2           googledrive_2.1.1    googlesheets4_1.1.1 
  gower_1.0.1          GPfit_1.0-8          graphics_4.3.2      
  grDevices_4.3.2      grid_4.3.2           gridExtra_2.3       
  gtable_0.3.4         hardhat_1.3.0        haven_2.5.4         
  highr_0.10           Hmisc_5.1-1          hms_1.1.3           
  htmlTable_2.4.2      htmltools_0.5.7      htmlwidgets_1.6.4   
  httpuv_1.6.13        httr_1.4.7           ids_1.0.1           
  infer_1.0.5          ipred_0.9-14         isoband_0.2.7       
  iterators_1.0.14     janitor_2.2.0        jomo_2.7-6          
  jquerylib_0.1.4      jsonlite_1.8.8       kableExtra_1.3.4    
  KernSmooth_2.23.22   knitr_1.45           labeling_0.4.3      
  labelled_2.12.0      later_1.3.2          lattice_0.22-5      
  lava_1.7.3           lhs_1.1.6            lifecycle_1.0.4     
  listenv_0.9.0        lme4_1.1-35.1        lubridate_1.9.3     
  magrittr_2.0.3       MASS_7.3-60          Matrix_1.6-4        
  MatrixModels_0.5-3   memoise_2.0.1        methods_4.3.2       
  mgcv_1.9.1           mice_3.16.0          mime_0.12           
  minqa_1.2.6          mitml_0.4-5          modeldata_1.2.0     
  modelenv_0.1.1       modelr_0.1.11        mosaic_1.9.0        
  mosaicCore_0.9.4.0   mosaicData_0.20.4    multcomp_1.4-25     
  munsell_0.5.0        mvtnorm_1.2-4        naniar_1.0.0        
  nlme_3.1-164         nloptr_2.0.3         nnet_7.3-19         
  norm_1.0.11.1        numDeriv_2016.8.1.1  openssl_2.1.1       
  ordinal_2023.12.4    pan_1.9              parallel_4.3.2      
  parallelly_1.36.0    parsnip_1.1.1        patchwork_1.1.3     
  pbkrtest_0.5.2       pillar_1.9.0         pkgbuild_1.4.3      
  pkgconfig_2.0.3      pkgload_1.3.3        plyr_1.8.9          
  polspline_1.1.24     praise_1.0.0         prettyunits_1.2.0   
  processx_3.8.3       prodlim_2023.08.28   progress_1.2.3      
  progressr_0.14.0     promises_1.2.1       ps_1.7.5            
  pscl_1.5.5.1         purrr_1.0.2          quantreg_5.97       
  R6_2.5.1             ragg_1.2.7           rappdirs_0.3.3      
  RColorBrewer_1.1-3   Rcpp_1.0.11          RcppEigen_0.3.3.9.4 
  readr_2.1.4          readxl_1.4.3         recipes_1.0.9       
  rematch_2.0.0        rematch2_2.1.2       reprex_2.0.2        
  rlang_1.1.2          rmarkdown_2.25       rms_6.7-1           
  rpart_4.1.23         rprojroot_2.0.4      rsample_1.2.0       
  rstudioapi_0.15.0    rvest_1.0.3          sandwich_3.1-0      
  sass_0.4.8           scales_1.3.0         selectr_0.4.2       
  shape_1.4.6          shiny_1.8.0          slider_0.3.1        
  snakecase_0.11.1     sourcetools_0.1.7.1  SparseM_1.81        
  splines_4.3.2        SQUAREM_2021.1       stats_4.3.2         
  stats4_4.3.2         stringi_1.8.3        stringr_1.5.1       
  survival_3.5-7       svglite_2.1.3        sys_3.4.2           
  systemfonts_1.0.5    testthat_3.2.1       textshaping_0.3.7   
  TH.data_1.1-2        tibble_3.2.1         tidymodels_1.1.1    
  tidyr_1.3.0          tidyselect_1.2.0     tidyverse_2.0.0     
  timechange_0.2.0     timeDate_4032.109    tinytex_0.49        
  tools_4.3.2          topmodels_0.3-0      tune_1.1.2          
  tzdb_0.4.0           ucminf_1.2.1         UpSetR_1.4.0        
  utf8_1.2.4           utils_4.3.2          uuid_1.1.1          
  vctrs_0.6.5          VGAM_1.1-9           viridis_0.6.4       
  viridisLite_0.4.2    visdat_0.6.0         vroom_1.6.5         
  waldo_0.5.2          warp_0.2.1           webshot_0.5.5       
  withr_2.5.2          workflows_1.1.3      workflowsets_1.0.1  
  xfun_0.41            xml2_1.3.6           xtable_1.8-4        
  yaml_2.3.7           yardstick_1.2.0      zoo_1.8-12