Scooby Doo analytics: Quantifying what’s groovy about a pop culture icon.

Author

Naveen Kannan

Published

2023-05-06

R Packages and Setup

Code

knitr::opts_chunk$set(comment = NA) 
library(knitr)
library(broom)
library(janitor) 
library(naniar)
library(glue)
library(mice)
library(MASS)
library(pscl)
library(GGally)
library(VGAM)
library(car)
library(nnet)
library(rsample)
library(yardstick)
library(patchwork)
library(rms)
library(tidymodels)
library(boot)
library(conflicted)
library(kableExtra)
library(countreg)
library(topmodels)
library(equatiomatic)
library(tidyverse)

conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")

theme_set(theme_bw())

1 Background

The data comes from Kaggle, and was part of Tidy Tuesday’s dataset for 2021-07-13, and can be found here.

“Scooby-Doo” is a television series that has been airing for over 50 years. Centered around Mystery Inc.,a group of iconic mystery solving detectives, including Fred, Daphne, Velma, Shaggy, and the titular Scooby-Doo, a talking dog with a penchant for consuming ridiculously tall sandwiches and Scooby snacks.

The Mystery Inc. Gang. From left to right, Velma, Shaggy, Scooby Doo, Fred, and Daphne.

The data was manually aggregated by user plummye on the Kaggle website.

2 Research Questions

Question 1:

Predicting an iconic catchphrase: Are the logistics of an episode of Scooby-Doo good predictors of the number of times an iconic catchphrase is spoken?

Question 2:

Monster Motivation: Is it possible to predict the motive of the major antagonist of an episode of Scooby-Doo, based on the nature of the monster the antagonist appears as?

3 Data Ingest and Management

3.1 Loading the Raw Data

The data comes from Kaggle, and was part of Tidy Tuesday’s dataset for 2021-07-13, and can be found here.

Code

scooby_raw <- read_csv('scooby_doo.csv', show_col_types = F)
scooby_raw

# A tibble: 603 × 75
   index series_name   network season title imdb  engagement date_aired run_time
   <dbl> <chr>         <chr>   <chr>  <chr> <chr> <chr>      <date>        <dbl>
 1     1 Scooby Doo, … CBS     1      What… 8.1   556        1969-09-13       21
 2     2 Scooby Doo, … CBS     1      A Cl… 8.1   479        1969-09-20       22
 3     3 Scooby Doo, … CBS     1      Hass… 8     455        1969-09-27       21
 4     4 Scooby Doo, … CBS     1      Mine… 7.8   426        1969-10-04       21
 5     5 Scooby Doo, … CBS     1      Deco… 7.5   391        1969-10-11       21
 6     6 Scooby Doo, … CBS     1      What… 8.4   384        1969-10-18       21
 7     7 Scooby Doo, … CBS     1      Neve… 7.6   358        1969-10-25       21
 8     8 Scooby Doo, … CBS     1      Foul… 8.2   358        1969-11-01       21
 9     9 Scooby Doo, … CBS     1      The … 8.1   371        1969-11-08       21
10    10 Scooby Doo, … CBS     1      Bedl… 8     346        1969-11-15       21
# ℹ 593 more rows
# ℹ 66 more variables: format <chr>, monster_name <chr>, monster_gender <chr>,
#   monster_type <chr>, monster_subtype <chr>, monster_species <chr>,
#   monster_real <chr>, monster_amount <dbl>, caught_fred <chr>,
#   caught_daphnie <chr>, caught_velma <chr>, caught_shaggy <chr>,
#   caught_scooby <chr>, captured_fred <chr>, captured_daphnie <chr>,
#   captured_velma <chr>, captured_shaggy <chr>, captured_scooby <chr>, …

3.2 Cleaning the Data

3.2.1 Selecting Variables We’ll Use

The variables that are intended to be used for the two models were selected and placed in a tibble titled scooby_filtered.

Code

scooby_filtered <- scooby_raw |>
  select(index, imdb, engagement, network, season, rooby_rooby_roo,
         motive, monster_amount,monster_gender, monster_real) 
scooby_filtered  <- scooby_filtered %>% replace(.=="NULL", NA) # replace with NA
scooby_filtered

# A tibble: 603 × 10
   index imdb  engagement network season rooby_rooby_roo motive   monster_amount
   <dbl> <chr> <chr>      <chr>   <chr>  <chr>           <chr>             <dbl>
 1     1 8.1   556        CBS     1      1               Theft                 1
 2     2 8.1   479        CBS     1      0               Theft                 1
 3     3 8     455        CBS     1      0               Treasure              1
 4     4 7.8   426        CBS     1      0               Natural…              1
 5     5 7.5   391        CBS     1      0               Competi…              1
 6     6 8.4   384        CBS     1      1               Extorti…              1
 7     7 7.6   358        CBS     1      1               Competi…              1
 8     8 8.2   358        CBS     1      1               Safety                1
 9     9 8.1   371        CBS     1      1               Counter…              1
10    10 8     346        CBS     1      0               Competi…              1
# ℹ 593 more rows
# ℹ 2 more variables: monster_gender <chr>, monster_real <chr>

3.2.2 Converting Variable Types

Each variable was then converted into the appropriate variable type. These variables were then placed in a tibble titled scooby_converted.

Code

scooby_converted <- scooby_filtered |>
  mutate(index = as.character(index),
         imdb = as.numeric(imdb), 
         engagement = as.numeric(engagement),
         network = as_factor(network),
         season = as_factor(season),
         rooby_rooby_roo = as.double(rooby_rooby_roo),
         motive = as_factor(motive),
         monster_gender = as_factor(monster_gender),
         monster_real = as_factor(monster_real))
scooby_converted

# A tibble: 603 × 10
   index  imdb engagement network season rooby_rooby_roo motive   monster_amount
   <chr> <dbl>      <dbl> <fct>   <fct>            <dbl> <fct>             <dbl>
 1 1       8.1        556 CBS     1                    1 Theft                 1
 2 2       8.1        479 CBS     1                    0 Theft                 1
 3 3       8          455 CBS     1                    0 Treasure              1
 4 4       7.8        426 CBS     1                    0 Natural…              1
 5 5       7.5        391 CBS     1                    0 Competi…              1
 6 6       8.4        384 CBS     1                    1 Extorti…              1
 7 7       7.6        358 CBS     1                    1 Competi…              1
 8 8       8.2        358 CBS     1                    1 Safety                1
 9 9       8.1        371 CBS     1                    1 Counter…              1
10 10      8          346 CBS     1                    0 Competi…              1
# ℹ 593 more rows
# ℹ 2 more variables: monster_gender <fct>, monster_real <fct>

3.2.3 Sampling the Data

scooby_converted was then filtered down to a tibble called scooby_sample that had complete cases on both the outcome variables. (rooby-rooby_roo and motive.)

Code

scooby_sample <- scooby_converted |>
  filter(complete.cases(rooby_rooby_roo, motive))
scooby_sample

# A tibble: 535 × 10
   index  imdb engagement network season rooby_rooby_roo motive   monster_amount
   <chr> <dbl>      <dbl> <fct>   <fct>            <dbl> <fct>             <dbl>
 1 1       8.1        556 CBS     1                    1 Theft                 1
 2 2       8.1        479 CBS     1                    0 Theft                 1
 3 3       8          455 CBS     1                    0 Treasure              1
 4 4       7.8        426 CBS     1                    0 Natural…              1
 5 5       7.5        391 CBS     1                    0 Competi…              1
 6 6       8.4        384 CBS     1                    1 Extorti…              1
 7 7       7.6        358 CBS     1                    1 Competi…              1
 8 8       8.2        358 CBS     1                    1 Safety                1
 9 9       8.1        371 CBS     1                    1 Counter…              1
10 10      8          346 CBS     1                    0 Competi…              1
# ℹ 525 more rows
# ℹ 2 more variables: monster_gender <fct>, monster_real <fct>

3.2.4 Working with Categorical Predictors

The categorical variables were then individually examined to assess the levels and the quantity of each level.

Code

scooby_sample |> 
  tabyl(motive)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

motive	n	percent
Theft	124	0.2317757
Treasure	54	0.1009346
Natural Resource	26	0.0485981
Competition	168	0.3140187
Extortion	4	0.0074766
Safety	2	0.0037383
Counterfeit	6	0.0112150
Inheritance	8	0.0149533
Smuggling	22	0.0411215
Preservation	11	0.0205607
Experimentation	4	0.0074766
Food	11	0.0205607
Trespassing	15	0.0280374
Assistance	5	0.0093458
Abduction	12	0.0224299
Haunt	1	0.0018692
Anger	3	0.0056075
Imagination	6	0.0112150
Bully	1	0.0018692
Loneliness	1	0.0018692
Training	1	0.0018692
Conquer	42	0.0785047
Mistake	1	0.0018692
Automated	1	0.0018692
Production	1	0.0018692
Entertainment	3	0.0056075
Simulation	2	0.0037383

The levels of motive were significantly collapsed into appropriate levels, and were then releveled according to size.

Code

scooby_sample <- scooby_sample |>
  mutate(motive = fct_collapse(motive, 
                               theft = "Theft",
                               treasure = "Treasure",
                               natural_resource = "Natural Resource",
                               competition = "Competition",
                               conquer = "Conquer",
                               indirect_monetary_gain = c("Extortion","Counterfeit","Inheritance","Smuggling"),
                               property_related = c("Preservation","Trespassing"),
                               harm_others = c("Abduction","Bully"),
                               indirect_self_interest = c("Safety","Experimentation","Food",
                                                          "Assistance","Loneliness","Entertainment","Training"),
                               misc_motive =c("Haunt","Anger","Imagination","Training",
                                              "Mistake","Automated","Production","Simulation")))
scooby_sample <- scooby_sample |>
  mutate(motive = fct_collapse(motive,
                               misc_motive = c("misc_motive","harm_others")))

scooby_sample <- scooby_sample |>
  mutate(motive = fct_collapse(motive,
                               indirect_self_interest = c("indirect_self_interest", "natural_resource","indirect_monetary_gain","property_related"),
                               misc_motive = c("misc_motive"))) |>
  mutate(motive = fct_relevel(motive, c("competition","theft","indirect_self_interest","treasure","conquer","misc_motive")))


scooby_sample |> 
  tabyl(motive)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

motive	n	percent
competition	168	0.3140187
theft	124	0.2317757
indirect_self_interest	118	0.2205607
treasure	54	0.1009346
conquer	42	0.0785047
misc_motive	29	0.0542056

The variable monster_gender was then examined. The variable originally describes the gender (as referred to by the Mystery Inc. gang) of the monster/monsters present in each episode.

Code

scooby_sample|>
  tabyl(monster_gender)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

monster_gender	n	percent	valid_percent
Male	312	0.5831776	0.6178218
Female	28	0.0523364	0.0554455
Male,Male,Male	23	0.0429907	0.0455446
Male,Female	10	0.0186916	0.0198020
Male,Male	47	0.0878505	0.0930693
Female,Male	5	0.0093458	0.0099010
Male,Male,Male,Male,None	1	0.0018692	0.0019802
Male,Male,Male,Male,Male,Male,Male,Male,Male	4	0.0074766	0.0079208
Male,Female,Male,Male,Male,Male,Male	1	0.0018692	0.0019802
Male,Male,Male,Male	14	0.0261682	0.0277228
Male,Male,Male,Male,Male,Male,Male,Male	4	0.0074766	0.0079208
Male,Female,Male	4	0.0074766	0.0079208
Female,Male,	1	0.0018692	0.0019802
Male,Male,Male,Female	1	0.0018692	0.0019802
Male,Female,Male,Male,Male	2	0.0037383	0.0039604
Male,Male,Male,Female,Male,Male	1	0.0018692	0.0019802
Female,Male,Male,Female	1	0.0018692	0.0019802
Male,Male,Female,Male	3	0.0056075	0.0059406
Male,Male,Male,Male,Female	1	0.0018692	0.0019802
Male,Male,Male,Male,Male,Male	4	0.0074766	0.0079208
Male,Male,Female	4	0.0074766	0.0079208
Male,Male,Male,Male,Male	2	0.0037383	0.0039604
Female,Female,Female,Female	1	0.0018692	0.0019802
Male,Male,Male,Male,Male,Male,Male	5	0.0093458	0.0099010
Female,Male,Male,Male	1	0.0018692	0.0019802
Male,Male,Male,Female,Female,Female,Male,Male,Male,Female,Male,Male,Male,Male,Male,Male,Male,Male,Male	1	0.0018692	0.0019802
Male,Male,Male,Male,Female,Female,Male	1	0.0018692	0.0019802
Male,Male,Male,Male,Male,Male,Female,Female,Female,Male	1	0.0018692	0.0019802
Male,Male,Male,Female,Male,Male,Female,Female	1	0.0018692	0.0019802
Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male	4	0.0074766	0.0079208
Female,Female	1	0.0018692	0.0019802
Male,Female,Female,Male,Male	2	0.0037383	0.0039604
Female,Male,Male	1	0.0018692	0.0019802
Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male	1	0.0018692	0.0019802
Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Female	1	0.0018692	0.0019802
Male,Male,Male,Female,Male	1	0.0018692	0.0019802
Male,Male,Male,Male,Male,Male,Male,Male,Male,Male	2	0.0037383	0.0039604
Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male	2	0.0037383	0.0039604
Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male,Male	2	0.0037383	0.0039604
Female,Male,Female,Male	1	0.0018692	0.0019802
Male,Male,Male,Male,Female,Male,Male,Male	1	0.0018692	0.0019802
Male,Female,Male,Male	1	0.0018692	0.0019802
Male,Male,Male,Male,Male,Male,Male,Female,Male,Male,Male,Male	1	0.0018692	0.0019802
NA	30	0.0560748	NA

If a particular value of the variable contained the term “Female”, it was then converted into the term “female” alone, and “male” alone otherwise. Missing values were left untouched. The variable was then releveled to have “male” as the baseline level.

Code

# Collapsing values into a binary variable
scooby_sample$monster_gender <- 
  ifelse(grepl("Female", scooby_sample$monster_gender), "female", ifelse(grepl("Male", scooby_sample$monster_gender), "male", scooby_sample$monster_gender))
scooby_sample <- scooby_sample |>
  mutate(monster_gender = fct_relevel(monster_gender, c("male","female")))
scooby_sample|>
  tabyl(monster_gender)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

monster_gender	n	percent	valid_percent
male	427	0.7981308	0.8455446
female	78	0.1457944	0.1544554
NA	30	0.0560748	NA

The network variable describes the television networks that ran the relevant episode of Scooby-Doo.

Code

scooby_sample|>
  tabyl(network)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

network	n	percent
CBS	49	0.0915888
ABC	218	0.4074766
Syndication	3	0.0056075
TBC	0	0.0000000
Cartoon Network	82	0.1532710
Warner Home Video	38	0.0710280
Warner Bros. Picture	3	0.0056075
Adult Swim	1	0.0018692
The WB	41	0.0766355
The CW	27	0.0504673
Boomerang	73	0.1364486

The levels of the variables were then collapsed to appropriate levels. The variable was then releveled according to level size.

Code

scooby_sample <- scooby_sample |>
  mutate(network = fct_collapse(network, 
                               other_networks = c("Syndication","TBC","Adult Swim"),
                               warner_brothers = c("Warner Home Video", "Warner Bros. Picture", "The WB"),
                               cw = "The CW",
                               cbs = "CBS",
                               abc = "ABC",
                               cartoon_network = "Cartoon Network",
                               boomerang = "Boomerang")) |>
  mutate(network = fct_relevel(network, c("abc","cartoon_network","warner_brothers","boomerang","cbs","cw","other_networks")))
scooby_sample <- scooby_sample |>
  mutate(network = fct_collapse(network, 
                               other_networks = c("cbs","cw","other_networks")))
                               
scooby_sample|>
  tabyl(network)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

network	n	percent
abc	218	0.4074766
cartoon_network	82	0.1532710
warner_brothers	82	0.1532710
boomerang	73	0.1364486
other_networks	80	0.1495327

The season variable describes the season of the episode of Scooby-Doo.

Code

scooby_sample|>
  tabyl(season)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

season	n	percent
1	275	0.5140187
2	143	0.2672897
Crossover	7	0.0130841
3	54	0.1009346
Movie	40	0.0747664
Special	12	0.0224299
4	4	0.0074766

The levels were renamed, and then the variable itself was releveled according to level size.

Code

scooby_sample <- scooby_sample |>
  mutate(season = fct_collapse(season, 
                               season_1 = "1",
                               season_2 = "2",
                               season_3 = "3",
                               season_4 = "4",
                               crossover = "Crossover",
                               movie = "Movie",
                               special = "Special")) |>
  mutate(season = fct_relevel(season, c("season_1","season_2","season_3","movie","special","crossover","season_4")))
scooby_sample <- scooby_sample |>
  mutate(season = fct_collapse(season, 
                               season_3_4 = c("season_3","season_4"),
                               special_crossover = c("special","crossover")))
scooby_sample|>
  tabyl(season)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

season	n	percent
season_1	275	0.5140187
season_2	143	0.2672897
season_3_4	58	0.1084112
movie	40	0.0747664
special_crossover	19	0.0355140

The monster_real variable describes if the monster was actually real or if was just a man in a mask, mechanically controlled or trained/hypnotized by a human.

The levels were renamed.

Code

scooby_sample|>
  tabyl(monster_real)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

monster_real	n	percent	valid_percent
FALSE	402	0.7514019	0.7960396
TRUE	103	0.1925234	0.2039604
NA	30	0.0560748	NA

Code

scooby_sample <- scooby_sample |>
  mutate(monster_real = fct_collapse(monster_real, 
                               not_real = "FALSE",
                               real = "TRUE"))
scooby_sample|>
  tabyl(monster_real)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

monster_real	n	percent	valid_percent
not_real	402	0.7514019	0.7960396
real	103	0.1925234	0.2039604
NA	30	0.0560748	NA

3.2.5 Working with Quantitative Variables

Each individual quantitative variable was then described.

Code

scooby_sample |>
  select(imdb) |>
  describe()

select(scooby_sample, imdb) 

 1  Variables      535  Observations
--------------------------------------------------------------------------------
imdb 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     520       15       48    0.996    7.347   0.7684      5.9      6.5 
     .25      .50      .75      .90      .95 
     7.1      7.4      7.7      8.2      8.4 

lowest : 4.6 4.8 4.9 5   5.1, highest: 9   9.1 9.2 9.3 9.6
--------------------------------------------------------------------------------

Code

scooby_sample |>
  select(engagement) |>
  describe()

select(scooby_sample, engagement) 

 1  Variables      535  Observations
--------------------------------------------------------------------------------
engagement 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     520       15      229        1    645.2     1153     17.0     22.9 
     .25      .50      .75      .90      .95 
    31.0     75.0    166.2    384.6   1891.4 

lowest :      7      9     11     12     13, highest:   6929  12620  20226  50847 100951
--------------------------------------------------------------------------------

Code

scooby_sample |>
  select(monster_amount) |>
  describe()

select(scooby_sample, monster_amount) 

 1  Variables      535  Observations
--------------------------------------------------------------------------------
monster_amount 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     535        0       16    0.739    1.944    1.781        0        1 
     .25      .50      .75      .90      .95 
       1        1        2        4        7 
                                                                            
Value          0     1     2     3     4     5     6     7     8     9    10
Frequency     30   341    65    32    23     7     5     7     6     4     3
Proportion 0.056 0.637 0.121 0.060 0.043 0.013 0.009 0.013 0.011 0.007 0.006
                                        
Value         11    12    13    15    19
Frequency      4     3     3     1     1
Proportion 0.007 0.006 0.006 0.002 0.002

For the frequency table, variable is rounded to the nearest 0
--------------------------------------------------------------------------------

Code

scooby_sample |>
  select(rooby_rooby_roo) |>
  describe()

select(scooby_sample, rooby_rooby_roo) 

 1  Variables      535  Observations
--------------------------------------------------------------------------------
rooby_rooby_roo 
       n  missing distinct     Info     Mean      Gmd 
     535        0        8    0.782    0.729   0.7098 
                                                          
Value          0     1     2     3     4     5     6     7
Frequency    210   289    19    11     3     1     1     1
Proportion 0.393 0.540 0.036 0.021 0.006 0.002 0.002 0.002

For the frequency table, variable is rounded to the nearest 0
--------------------------------------------------------------------------------

3.2.6 Identifying Missingness

Code

scooby_sample |>
  miss_var_summary() |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

variable	n_miss	pct_miss
monster_gender	30	5.607477
monster_real	30	5.607477
imdb	15	2.803738
engagement	15	2.803738
index	0	0.000000
network	0	0.000000
season	0	0.000000
rooby_rooby_roo	0	0.000000
motive	0	0.000000
monster_amount	0	0.000000

The variables monster_gender, monster_real, imdb and engagement have missing values.

3.3 Print and Save The Final Tibble

Code

print(scooby_sample)

# A tibble: 535 × 10
   index  imdb engagement network   season rooby_rooby_roo motive monster_amount
   <chr> <dbl>      <dbl> <fct>     <fct>            <dbl> <fct>           <dbl>
 1 1       8.1        556 other_ne… seaso…               1 theft               1
 2 2       8.1        479 other_ne… seaso…               0 theft               1
 3 3       8          455 other_ne… seaso…               0 treas…              1
 4 4       7.8        426 other_ne… seaso…               0 indir…              1
 5 5       7.5        391 other_ne… seaso…               0 compe…              1
 6 6       8.4        384 other_ne… seaso…               1 indir…              1
 7 7       7.6        358 other_ne… seaso…               1 compe…              1
 8 8       8.2        358 other_ne… seaso…               1 indir…              1
 9 9       8.1        371 other_ne… seaso…               1 indir…              1
10 10      8          346 other_ne… seaso…               0 compe…              1
# ℹ 525 more rows
# ℹ 2 more variables: monster_gender <fct>, monster_real <fct>

Code

write_rds(scooby_sample, file = "scooby_sample.Rds")

4 Code Book and Description

Variable	Description	Type	Analytic Role	Missing Values
`index`	Unique Episode/Movie Identifier	Character Variable	Unique Identifier	0
`imdb`	The IMDB (Internet Movie Database) rating for the episode/movie.	Quantitative Variable	Predictor	15
`engagement`	The number of people who voted on IMDB to rate the episode/movie.	Quantitative Variable	Predictor	15
`network`	The network/producer that produced the epsiode/movie of Scooby-Doo.	Multi-categorical Variable with 5 levels	Predictor	0
`season`	The season the episode belonged to, or if it was a movie or a special.	Multi-categorical Variable with 5 levels	Predictor	0
`rooby-rooby-roo`	The number of times the phrase, “Rooby Rooby Roo!” was said in an episode/movie	Quantitative count variable	Count Outcome	0
`motive`	The motivation of the primary antagonist of the episode/movie of Scooby-Doo.	Multi-categorical variable with 6 levels	Multi-categorical Outcome	0
`monster_amount`	The number of monsters that appeared in an episode/movie of Scooby-Doo	Quantitative variable	Predictor	0
`monster_gender`	Describes if the monsters seen were all male, or if they included a female monster in their count.	Binary Variable	Predictor	30
`monster_real`	Describes if the monster was actually real, or if it was a human in a suit, a robot, or a brainwashed person.	Binary Variable	Predictor	30

The levels of the categorical variables and their description are as follows:

network

The network/producer that produced the epsiode/movie of Scooby-Doo.

Level	Description	Distinct Observations
`abc`	Episode/Movie was produced by ABC.	218
`cartoon_network`	Episode/Movie was produced by Cartoon Network.	82
`warner_brothers`	Episode/Movie was produced by Warner Brothers.	82
`boomerang`	Episode/Movie was produced by Boomerang.	73
`other_networks`	Episode/Movie was produced by networks other than the four described above.	80

season

The season the episode belonged to, or if it was a movie or a special.

Level	Description	Distinct Observations
`season_1`	Episode was from Season 1.	275
`season_2`	Episode was from Season 2.	143
`season_3_4`	Episode was from Season 3 or 4.	58
`movie`	Scooby-Doo media was a movie.	40
`special_crossover`	Scooby-Doo media was a special or a cross-over.	19

motive

The motivation of the primary antagonist of the episode/movie of Scooby-Doo.

Level	Description	Distinct Observations
`competition`	The antagonist intended to win in a competition of sorts.	168
`theft`	The antagonist intended to steal an item of value.	124
`indirect_self_interest`	The antagonist had a motive that indirectly furthered their self interest.	118
`treasure`	The antagonist intended to obtain treasure of some kind.	54
`conquer`	The antagonist intended to conquer something or someone.	42
`misc_motive`	The antagonist had a more nebulous, rare motivation.	29

monster_gender

Were the monsters seen all male, or did they include a female monster in their count?

The monster’s gender is defined by the pronouns used by the Mystery Inc. gang to refer to them. Monsters referred to as he/him were categorized as male, and monsters referred to as she/her were categorized as female.

Level	Description	Distinct Observations
`male`	The monsters seen were all male.	427
`female`	There was at least one female monster in the episode.	78

monster_real

Was the monster seen actually real, or was it a human in a suit, a robot, or a brainwashed person?

Level	Description	Distinct Observations
`not_real`	The monster was not actually real.	402
`real`	The monster turned out to be real.	103

5 Analyses

5.1 Analysis 1

5.1.1 The Research Question

Predicting an iconic catchphrase: Are the logistics of an episode of Scooby-Doo good predictors of the number of times an iconic catchphrase is spoken?

5.1.2 The Outcome

A numerical summary of the outcome variable was obtained.

Code

scooby_sample |>
  select(rooby_rooby_roo) |>
  describe()

select(scooby_sample, rooby_rooby_roo) 

 1  Variables      535  Observations
--------------------------------------------------------------------------------
rooby_rooby_roo 
       n  missing distinct     Info     Mean      Gmd 
     535        0        8    0.782    0.729   0.7098 
                                                          
Value          0     1     2     3     4     5     6     7
Frequency    210   289    19    11     3     1     1     1
Proportion 0.393 0.540 0.036 0.021 0.006 0.002 0.002 0.002

For the frequency table, variable is rounded to the nearest 0
--------------------------------------------------------------------------------

Most of the values are 0s and 1s, with the other values being fewer in comparison.

5.1.3 Visualizing the Outcome Variable

The outcome variable’s distribution was visualized using ggplot.

Code

p1 <- 
  ggplot(scooby_sample, aes(x = rooby_rooby_roo)) +
  geom_histogram(bins = 25, 
  fill = "#2b8cbe", col = "white") +
  labs( x = "Number of times the catchphrase was said", 
        y = "Count",
        title = str_wrap("Distribution of the number of times the catchphrase 'Rooby-Rooby-Roo!' was spoken in each episode of Scooby-Doo", width = 90),
         subtitle = glue("Across ", nrow(scooby_sample), 
                           " episodes/movies/specials of the entirety of the Scooby-Doo show."),
         caption = "")


p1

5.1.4 The Predictors

The predictors for the outcome are imdb, engagement, network and season. Each predictor variable was individually described.

imdb is a continuous, quantitative variable.

Code

scooby_sample |>
  select(imdb) |>
  describe()

select(scooby_sample, imdb) 

 1  Variables      535  Observations
--------------------------------------------------------------------------------
imdb 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     520       15       48    0.996    7.347   0.7684      5.9      6.5 
     .25      .50      .75      .90      .95 
     7.1      7.4      7.7      8.2      8.4 

lowest : 4.6 4.8 4.9 5   5.1, highest: 9   9.1 9.2 9.3 9.6
--------------------------------------------------------------------------------

engagement is a continuous, quantitative variable.

Code

scooby_sample |>
  select(engagement) |>
  describe()

select(scooby_sample, engagement) 

 1  Variables      535  Observations
--------------------------------------------------------------------------------
engagement 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     520       15      229        1    645.2     1153     17.0     22.9 
     .25      .50      .75      .90      .95 
    31.0     75.0    166.2    384.6   1891.4 

lowest :      7      9     11     12     13, highest:   6929  12620  20226  50847 100951
--------------------------------------------------------------------------------

network is a multi-categorical variable.

Code

scooby_sample |>
  tabyl(network) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

network	n	percent
abc	218	0.4074766
cartoon_network	82	0.1532710
warner_brothers	82	0.1532710
boomerang	73	0.1364486
other_networks	80	0.1495327

season is a multi-categorical variable.

Code

scooby_sample |>
  tabyl(season) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

season	n	percent
season_1	275	0.5140187
season_2	143	0.2672897
season_3_4	58	0.1084112
movie	40	0.0747664
special_crossover	19	0.0355140

The outcome was then visualized against each predictor.

5.2 `rooby_rooby_roo` vs `network`

Code

tmp_data <- scooby_sample

tmp_data$network <-  fct_recode(tmp_data$network,
                               ABC = "abc",
                               "Cartoon Network" = "cartoon_network",
                               "Warner Brothers" = "warner_brothers",
                               "Boomerang" = "boomerang",
                               "Other Networks" = "other_networks")
ggplot(tmp_data, aes(x = network, y = rooby_rooby_roo)) +
    geom_violin(fill = "#2b8cbe", alpha = 0.3, scale = "width")+ 
    geom_boxplot(fill = "#2b8cbe", width = 0.1,
               outlier.color = "red") +
  labs(y = "Number of times 'Rooby-Rooby-Roo!' was said",
       x = "Network/Producer",
       title = str_wrap("Mapping the distribution of the outcome variable across the networks/producers who produced Scooby-Doo.",width = 90),
       subtitle = "Across 535 episodes of Scooby-Doo.")

Warner Brothers appears to have a higher number of Rooby-Rooby-Roo’s compared to the other networks.

5.3 `rooby_rooby_roo` vs `season`

Code

tmp_data <- scooby_sample

tmp_data$season <-  fct_recode(tmp_data$season,
                               "Season 1" = "season_1",
                               "Season 2" = "season_2",
                               "Season 3 and 4" = "season_3_4",
                               "Movie" = "movie",
                               "Special / Crossover" = "special_crossover")
ggplot(tmp_data, aes(x = season, y = rooby_rooby_roo)) +
    geom_violin(fill = "#2b8cbe", alpha = 0.3, scale = "width")+ 
    geom_boxplot(fill = "#2b8cbe", width = 0.1,
               outlier.color = "red") +
  labs(y = "Number of times 'Rooby-Rooby-Roo!' was said",
       x = "Network/Producer",
       title = str_wrap("Mapping the distribution of the outcome variable across the various seasons/movies of Scooby-Doo.",width = 90),
       subtitle = "Across 535 episodes of Scooby-Doo.")

Movies appear to have a higher number of Rooby-Rooby-Roos compared to the other seasons/specials/crossovers.

5.4 `rooby_rooby_roo` vs `imdb`

Code

ggplot(tmp_data, aes(y = as.factor(rooby_rooby_roo), x = imdb)) +
    geom_violin(fill = "#2b8cbe", alpha = 0.4, scale = "width") +
   
    geom_boxplot(fill = "#2b8cbe", width = 0.1,
               outlier.color = "red") +
  labs(y = "Number of times 'Rooby-Rooby-Roo!' was said",
       x = "IMDB Rating",
       title = str_wrap("Mapping the distribution of the outcome variable across the IMDB rating",width = 90),
       subtitle = "Across 535 episodes of Scooby-Doo.")

Scooby-Doo media with moreRooby-Rooby-Roo’s appear to have lower IMDB ratings.

5.5 `rooby_rooby_roo` vs `engagement`

Code

ggplot(tmp_data, aes(x = as.factor(rooby_rooby_roo), y = engagement)) +
  geom_violin(fill = "#2b8cbe", alpha = 0.4, scale = "width") +
   
    geom_boxplot(fill = "#2b8cbe", width = 0.1,
               outlier.color = "red") +
  scale_x_discrete(breaks = c(0,1,2,3,4,5,6,7,8,9,10)) +
  labs(x = "Number of times 'Rooby-Rooby-Roo!' was said",
       y = "IMDB Engagement Score",
       title = str_wrap("Mapping the distribution of the outcome variable across the IMDB engagement metric.",width = 80),
       subtitle = "Across 535 episodes of Scooby-Doo.")

A significant outlier is seen for the engagement score. This outlier is the live action Scooby-Doo movie, titled “Scooby-Doo”, released in 2002.

Code

scooby_sample |>
  filter(engagement > 100000) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

index	imdb	engagement	network	season	rooby_rooby_roo	motive	monster_amount	monster_gender	monster_real
342	5.1	100951	warner_brothers	movie	3	conquer	10	female	real

This live action movie was considered a cult classic soon after it’s release due to the plot being more intense that the usual fare for Scooby-Doo movies, with a fair share of more adult jokes appearing as well. The movie was rated on IMDB by over 100,000 people, indicating that it stands several degrees above the other Scooby-Doo media in terms of engagement on IMDB.

5.6 Visualizing the distribution of the zero counts across the `network` predictor

The distribution of zero counts across the network predictor was visualized.

Code

tmp_data <- scooby_sample

tmp_data$network <-  fct_recode(tmp_data$network,
                               ABC = "abc",
                               "Cartoon Network" = "cartoon_network",
                               "Warner Brothers" = "warner_brothers",
                               "Boomerang" = "boomerang",
                               "Other Networks" = "other_networks")
tmp_data |> group_by(network) |>
    dplyr::summarize(n = n(), 
              percent_0s = round(100*sum(rooby_rooby_roo == 0)/n,1)) |>
    ggplot(aes(y = network, x = percent_0s)) +
    geom_label(aes(label = percent_0s ), col = "#2b8cbe") +
    labs(x = "% of episodes/movies of Scooby-Doo with zero occurences of 'Rooby-Rooby-Roo!'.",
         y = "Network/Producer",
         title = str_wrap("Visualizing the percentage of episodes/movies with a zero count of `rooby_rooby_roo` across the different networks.", width = 80))

5.7 Visualizing the distribution of the zero counts across the `season` predictor

The distribution of zero counts across the season predictor was visualized.

Code

tmp_data <- scooby_sample

tmp_data$season <-  fct_recode(tmp_data$season,
                               "Season 1" = "season_1",
                               "Season 2" = "season_2",
                               "Season 3 and 4" = "season_3_4",
                               "Movie" = "movie",
                               "Special / Crossover" = "special_crossover")
tmp_data |> group_by(season) |>
    dplyr::summarize(n = n(), 
              percent_0s = round(100*sum(rooby_rooby_roo == 0)/n,1)) |>
    ggplot(aes(y = season, x = percent_0s)) +
    geom_label(aes(label = percent_0s), col = "#2b8cbe") +
    labs(x = "% of episodes/movies of Scooby-Doo with zero occurences of 'Rooby-Rooby-Roo!'.",
         y = "Season / Movie / Special",
         title = str_wrap("Visualizing the percentage of episodes/movies with a zero count of `rooby_rooby_roo` across different seasons/Movies/Specials of Scooby-Doo.", width = 60))

5.8 Preliminary Thoughts

I believe that higher IMDB scores will be associated with higher counts of Rooby-Rooby-Roo, and lower IMDB scores will be associated with lower counts of Rooby-Rooby-Roo.

I believe that movies will have higher counts of Rooby-Rooby-Roo’s.

I believe that media produced by Cartoon Network and Boomerang will have lower counts of Rooby-Rooby-Roo’s.

5.8.1 Missingness

A seperate dataset, containing the identifier variable, along with the outcome and the predictor variables was created and stored in the tibble scooby_q1. This tibble was then assessed for missingness using miss_var_summary.

Code

scooby_q1 <- scooby_sample |>
  select(index,rooby_rooby_roo, imdb, engagement, network, season)
scooby_q1 |>
  miss_var_summary() |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

variable	n_miss	pct_miss
imdb	15	2.803738
engagement	15	2.803738
index	0	0.000000
rooby_rooby_roo	0	0.000000
network	0	0.000000
season	0	0.000000

The variables imdb and engagement have 15 missing values each.

The data can be assumed to be MAR (Missing At Random), since the missing data are not randomly distributed. It cannot be assumed to be MNAR (Missing Not At Random) since there does not appear to be a relationship between the magnitude of a value and it’s missingness (or inclusion in the non-missing data), since there is no preponderance of values with lower or higher magnitude being missing.

Single imputation was then performed on the dataset, to account for missingness. The complete dataset, containing the data for both the analyses had it’s missing values imputed.

5.8.2 Imputing Missing Variables

Using the mice function, missing values in the scooby_q1 tibble were singly imputed and stored in the q1_simp tibble.

Code

set.seed(4322023)
q1_mice1 <- mice(scooby_q1, m = 10, print = FALSE)

Warning: Number of logged events: 1

Code

q1_simp <- mice::complete(q1_mice1)

miss_var_summary(q1_simp)

# A tibble: 6 × 3
  variable        n_miss pct_miss
  <chr>            <int>    <dbl>
1 index                0        0
2 rooby_rooby_roo      0        0
3 imdb                 0        0
4 engagement           0        0
5 network              0        0
6 season               0        0

5.8.3 Splitting the data

The data was then split into a training (q1_train) and testing (q1_test) dataset using the initial_split function.

Code

set.seed(4322023)
scoob_split <- initial_split(q1_simp, prop = 0.75)
q1_train = training(scoob_split)
q1_test = testing(scoob_split)
dim(q1_train)

[1] 401   6

Code

dim(q1_test)

[1] 134   6

5.8.4 Assessing Collinearity

The ggpairs function was used to plot a scatterplot matrix to assess for potential collinearity.

Code

ggpairs(q1_simp, columns = c("rooby_rooby_roo", "imdb",
                             "engagement", "network",
                             "season"))

The variables imdb and engagement exhibit collinearity with each other and the outcome rooby-rooby-roo.

Code

mod_poisson_col <- glm(rooby_rooby_roo ~ imdb +  engagement + network + season,data = q1_train, family = "poisson")

car::vif(mod_poisson_col)

               GVIF Df GVIF^(1/(2*Df))
imdb       1.611302  1        1.269371
engagement 1.272811  1        1.128189
network    3.401606  4        1.165361
season     3.689599  4        1.177260

5.8.5 Non-Linear Terms

A Spearman plot was then created to assess if non-linear terms would be needed in the models.

Code

sp2 <- spearman2(rooby_rooby_roo ~ imdb + engagement + network +season, data = q1_simp)

plot(sp2)

The plot suggests the following:

An interaction term between the main effects of network and imdb, which would add 4 degrees of freedom to the model.
A restricted cubic spline applied to imdb.
An interaction term between imdb and season, which would add 4 degrees of freedom to the model.

The Poisson model created for the purpose of testing collinearity was then re-fit using Glm() from the rms package to obtain the number of degrees of freedom in the model without adding a non-linear term.

Code

d <- datadist(q1_simp)
options(datadist = "d")

mod_poisson_rms <- Glm(rooby_rooby_roo ~ imdb + engagement + network +season,
                     family = poisson(), 
                     data = q1_simp, 
                     x = T, y = T)

mod_poisson_rms

General Linear Model

Glm(formula = rooby_rooby_roo ~ imdb + engagement + network + 
    season, family = poisson(), data = q1_simp, x = T, y = T)

                      Model Likelihood    
                            Ratio Test    
         Obs 535    LR chi2      90.48    
Residual d.f.524    d.f.            10    
         g 0.496    Pr(> chi2) <0.0001    

                         Coef    S.E.   Wald Z Pr(>|Z|)
Intercept                 1.8019 0.6048  2.98  0.0029  
imdb                     -0.2779 0.0819 -3.39  0.0007  
engagement                0.0000 0.0000  0.32  0.7470  
network=cartoon_network  -0.4735 0.1931 -2.45  0.0142  
network=warner_brothers  -0.0361 0.1845 -0.20  0.8447  
network=boomerang        -0.5819 0.2147 -2.71  0.0067  
network=other_networks   -0.2212 0.1684 -1.31  0.1890  
season=season_2          -0.1466 0.1435 -1.02  0.3068  
season=season_3_4         0.0265 0.1691  0.16  0.8754  
season=movie              0.6389 0.2305  2.77  0.0056  
season=special_crossover -0.1181 0.3072 -0.38  0.7008

A restricted cubic spline with 3 knots applied to imdb would add a single degree of freedom to the model, spending a total of 11 degrees of freedom.

5.8.6 Fitting a Poisson Model

A Poisson model, mod_poisson, was fit using glm().

Code

mod_poisson <- glm(rooby_rooby_roo ~ rcs(imdb,3) +  engagement + network + season ,data = q1_train, family = "poisson")

5.8.7 Model Coefficients

The model’s coefficients were obtained using the tidy function.

Code

tidy(mod_poisson) |>
  select(term, estimate, std.error, p.value) |>
  kbl(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)

term	estimate	std.error	p.value
(Intercept)	1.719	1.075	0.110
rcs(imdb, 3)imdb	-0.274	0.152	0.072
rcs(imdb, 3)imdb'	0.067	0.253	0.792
engagement	0.000	0.000	0.811
networkcartoon_network	-0.454	0.246	0.065
networkwarner_brothers	-0.053	0.225	0.814
networkboomerang	-0.515	0.242	0.034
networkother_networks	-0.188	0.226	0.404
seasonseason_2	-0.208	0.173	0.229
seasonseason_3_4	-0.048	0.205	0.816
seasonmovie	0.837	0.277	0.003
seasonspecial_crossover	-0.217	0.353	0.540

5.8.8 Poisson Rootogram

A rootogram was created using the rootogram function from the countreg package.

Code

p_root <- topmodels::rootogram(mod_poisson, max = 10,
                              main = "Rootogram for the Poisson model")

The Poisson model (mod_poisson) under-predicts 0 counts, over-predicts 1 counts, and under-predicts 3, 4 and 5 counts, and over-predicts 6 and 7 counts.

5.8.9 Storing Training Sample Predictions

Using the augment function, the training sample predictions for the mod_poisson model were stored.

Code

mod_poisson_aug <- augment(mod_poisson, q1_train,
                     type.predict = "response")
mod_poisson_aug |> 
  select(index, rooby_rooby_roo, .fitted) |>
  head(3)|> kable(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)

index	rooby_rooby_roo	.fitted
144	1	0.744
435	0	0.413
345	1	0.706

5.8.10 Summarizing Training Sample `mod_poisson` Fit

The \(R^2\), Root mean Squared Error (RMSE) and Mean Absolute Error were then obtained from model mod_poisson.

Code

mets <- metric_set(rsq, rmse, mae)
mod_poisson_summary <-
mets(mod_poisson_aug, truth = rooby_rooby_roo, estimate = .fitted) |>
mutate(model = "mod_possion") |> relocate(model)
mod_poisson_summary |> kable(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)

model	.metric	.estimator	.estimate
mod_possion	rsq	standard	0.228
mod_possion	rmse	standard	0.727
mod_possion	mae	standard	0.525

5.8.11 Fitting a Negative Binomial Model

A negative binomial model was fit using the glm.nb function.

Code

mod_nb <- MASS::glm.nb(rooby_rooby_roo ~ rcs(imdb, 3) + engagement + network +season ,data = q1_train)

Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached

Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached

5.8.12 Model Coefficients

The model coefficients were placed in a table.

Code

tidy(mod_nb) |> kable(digits = c(0, 3, 3, 1, 3)) |>
  kable_classic_2(font_size = 24,full_width = F)

term	estimate	std.error	statistic	p.value
(Intercept)	1.720	1.075	1.6	0.110
rcs(imdb, 3)imdb	-0.274	0.152	-1.8	0.072
rcs(imdb, 3)imdb'	0.067	0.253	0.3	0.792
engagement	0.000	0.000	0.2	0.811
networkcartoon_network	-0.454	0.246	-1.8	0.065
networkwarner_brothers	-0.053	0.225	-0.2	0.814
networkboomerang	-0.515	0.242	-2.1	0.034
networkother_networks	-0.188	0.226	-0.8	0.404
seasonseason_2	-0.208	0.173	-1.2	0.229
seasonseason_3_4	-0.048	0.205	-0.2	0.816
seasonmovie	0.837	0.277	3.0	0.003
seasonspecial_crossover	-0.217	0.353	-0.6	0.540

5.8.13 Negative Binomial Model Rootogram

Code

nb_root <- topmodels::rootogram(mod_nb, max = 10,
main = "Rootogram for the Negative Binomial Model")

5.8.14 Storing Training Sample Predictions

Code

mod_nb_aug <- augment(mod_nb, q1_train,
                     type.predict = "response")

Warning: The `augment()` method for objects of class `negbin` is not maintained by the broom team, and is only supported through the `glm` tidier method. Please be cautious in interpreting and reporting broom output.

This warning is displayed once per session.

Code

mod_nb_aug |> 
  select(index, rooby_rooby_roo, .fitted) |>
  head(3)|> kable(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)

index	rooby_rooby_roo	.fitted
144	1	0.744
435	0	0.413
345	1	0.706

5.8.15 Summarizing Training Sample `mod_nb` Fit

Code

mets <- metric_set(rsq, rmse, mae)
mod_nb_summary <-
mets(mod_nb_aug, truth = rooby_rooby_roo, estimate = .fitted) |>
mutate(model = "mod_nb") |> relocate(model)
mod_nb_summary |> kable(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)

model	.metric	.estimator	.estimate
mod_nb	rsq	standard	0.228
mod_nb	rmse	standard	0.727
mod_nb	mae	standard	0.525

5.8.16 Fitting a Zero-Inflated Poisson Model

A zero-inflated Poisson model was fit using the zeroinfl function.

Code

mod_zip <- pscl::zeroinfl(rooby_rooby_roo ~ rcs(imdb, 3)  + engagement + network +season,data = q1_train)

Warning in value[[3L]](cond): system is computationally singular: reciprocal
condition number = 6.56872e-62FALSE

Code

summary(mod_zip)


Call:
pscl::zeroinfl(formula = rooby_rooby_roo ~ rcs(imdb, 3) + engagement + 
    network + season, data = q1_train)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.4387 -0.6795  0.1373  0.4081  4.1033 

Count model coefficients (poisson with log link):
                          Estimate Std. Error z value Pr(>|z|)
(Intercept)              1.719e+00         NA      NA       NA
rcs(imdb, 3)imdb        -2.741e-01         NA      NA       NA
rcs(imdb, 3)imdb'        6.660e-02         NA      NA       NA
engagement               2.802e-06         NA      NA       NA
networkcartoon_network  -4.538e-01         NA      NA       NA
networkwarner_brothers  -5.290e-02         NA      NA       NA
networkboomerang        -5.150e-01         NA      NA       NA
networkother_networks   -1.883e-01         NA      NA       NA
seasonseason_2          -2.081e-01         NA      NA       NA
seasonseason_3_4        -4.779e-02         NA      NA       NA
seasonmovie              8.372e-01         NA      NA       NA
seasonspecial_crossover -2.166e-01         NA      NA       NA

Zero-inflation model coefficients (binomial with logit link):
                          Estimate Std. Error z value Pr(>|z|)
(Intercept)             -1.511e+01         NA      NA       NA
rcs(imdb, 3)imdb        -3.225e+00         NA      NA       NA
rcs(imdb, 3)imdb'       -1.858e+00         NA      NA       NA
engagement               3.922e-04         NA      NA       NA
networkcartoon_network   1.566e+00         NA      NA       NA
networkwarner_brothers  -1.227e+00         NA      NA       NA
networkboomerang         1.011e+00         NA      NA       NA
networkother_networks    1.266e+00         NA      NA       NA
seasonseason_2           4.908e-01         NA      NA       NA
seasonseason_3_4         4.850e-01         NA      NA       NA
seasonmovie              3.671e-02         NA      NA       NA
seasonspecial_crossover  7.360e-01         NA      NA       NA

Number of iterations in BFGS optimization: 25 
Log-likelihood:  -392 on 24 Df

5.8.17 Zero-Inflated Poisson Model Rootogram

Code

zip_root <- topmodels::rootogram(mod_zip, max = 10,
main = "Rootogram for the Zero-Inflated Poisson Model")

5.8.18 Storing Training Sample Predictions

Code

mod_zip_aug <- q1_train |>
  mutate(".fitted" = predict(mod_zip, type = "response"),
         ".resid" = resid(mod_zip, type = "response"))
mod_zip_aug |> 
  select(index, rooby_rooby_roo, .fitted, .resid) |>
  head(3)

  index rooby_rooby_roo   .fitted     .resid
1   144               1 0.7443436  0.2556564
2   435               0 0.4132138 -0.4132138
3   345               1 0.7061787  0.2938213

5.8.19 Summarizing Training Sample `mod_nb` Fit

Code

mets <- metric_set(rsq, rmse, mae)
mod_zip_summary <-
mets(mod_zip_aug, truth = rooby_rooby_roo, estimate = .fitted) |>
mutate(model = "mod_zip") |> relocate(model)
mod_zip_summary |> kable(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)

model	.metric	.estimator	.estimate
mod_zip	rsq	standard	0.228
mod_zip	rmse	standard	0.727
mod_zip	mae	standard	0.525

5.8.20 Fitting a Zero-Inflated Negative Binomial Model

A zero-inflated Negative Binomial Model was then fit.

Code

mod_zinb <- pscl::zeroinfl(rooby_rooby_roo ~ rcs(imdb, 3)  + engagement + network + season, dist = "negbin", data = q1_train)

Warning in value[[3L]](cond): system is computationally singular: reciprocal
condition number = 2.1577e-46FALSE

Code

summary(mod_zinb)


Call:
pscl::zeroinfl(formula = rooby_rooby_roo ~ rcs(imdb, 3) + engagement + 
    network + season, data = q1_train, dist = "negbin")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.3855 -0.6603  0.1403  0.3362  4.1881 

Count model coefficients (negbin with log link):
                          Estimate Std. Error z value Pr(>|z|)
(Intercept)              1.257e+00         NA      NA       NA
rcs(imdb, 3)imdb        -2.070e-01         NA      NA       NA
rcs(imdb, 3)imdb'       -4.830e-02         NA      NA       NA
engagement               5.965e-06         NA      NA       NA
networkcartoon_network   3.438e-01         NA      NA       NA
networkwarner_brothers   9.493e-02         NA      NA       NA
networkboomerang        -3.765e-01         NA      NA       NA
networkother_networks   -1.481e-01         NA      NA       NA
seasonseason_2          -1.322e-01         NA      NA       NA
seasonseason_3_4        -8.629e-02         NA      NA       NA
seasonmovie              6.345e-01         NA      NA       NA
seasonspecial_crossover -4.164e-01         NA      NA       NA
Log(theta)               1.663e+01         NA      NA       NA

Zero-inflation model coefficients (binomial with logit link):
                          Estimate Std. Error z value Pr(>|z|)
(Intercept)             -72.139488         NA      NA       NA
rcs(imdb, 3)imdb          5.234140         NA      NA       NA
rcs(imdb, 3)imdb'        -4.966800         NA      NA       NA
engagement               -0.006803         NA      NA       NA
networkcartoon_network   36.255230         NA      NA       NA
networkwarner_brothers   -7.081453         NA      NA       NA
networkboomerang         32.389288         NA      NA       NA
networkother_networks   -31.492774         NA      NA       NA
seasonseason_2            1.735211         NA      NA       NA
seasonseason_3_4         -4.228295         NA      NA       NA
seasonmovie               2.998970         NA      NA       NA
seasonspecial_crossover -96.148898         NA      NA       NA

Theta = 16665909.2848 
Number of iterations in BFGS optimization: 103 
Log-likelihood: -384.7 on 25 Df

5.8.21 Rootogram

Code

zinb_root <- topmodels::rootogram(mod_zinb, max = 10,
main = "Rootogram for Zero-Inflated Negative Binomial Model")

5.8.22 Storing Training Sample Predictions

Code

mod_zinb_aug <- q1_train |>
  mutate(".fitted" = predict(mod_zinb, type = "response"),
         ".resid" = resid(mod_zinb, type = "response"))
mod_zinb_aug |> 
  select(index, rooby_rooby_roo, .fitted, .resid) |>
  head(3)

  index rooby_rooby_roo   .fitted     .resid
1   144               1 0.7525684  0.2474316
2   435               0 0.4927507 -0.4927507
3   345               1 0.8279808  0.1720192

5.8.23 Summarizing Training Sample `mod_nb` Fit

Code

mets <- metric_set(rsq, rmse, mae)
mod_zinb_summary <-
mets(mod_zinb_aug, truth = rooby_rooby_roo, estimate = .fitted) |>
mutate(model = "mod_zinb") |> relocate(model)
mod_zinb_summary |> kable(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)

model	.metric	.estimator	.estimate
mod_zinb	rsq	standard	0.289
mod_zinb	rmse	standard	0.698
mod_zinb	mae	standard	0.499

5.8.24 Comparing Rootograms

Code

plot(p_root)

Code

plot(nb_root)

Code

plot(zip_root)

Code

plot(zinb_root)

5.8.25 Vuong’s Procedure Comparing the Negative Binomial Model and the Poisson Model

Code

vuong(mod_nb, mod_poisson)

Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A  p-value
Raw                    -2.29614 model2 > model1 0.010834
AIC-corrected          -2.29614 model2 > model1 0.010834
BIC-corrected          -2.29614 model2 > model1 0.010834

5.8.26 Vuong’s Procedure Comparing the Zero-Inflated Poisson Model and the Poisson Model

Code

vuong(mod_zip, mod_poisson)

Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A p-value
Raw               -1.996332e-01 model2 > model1 0.42088
AIC-corrected     -8.766005e+05 model2 > model1 < 2e-16
BIC-corrected     -2.627154e+06 model2 > model1 < 2e-16

5.8.27 Vuong’s Procedure Comparing the ZINB Model and the Poisson Model

Code

vuong(mod_zinb, mod_poisson)

Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A   p-value
Raw                    1.609132 model1 > model2  0.053794
AIC-corrected         -1.030014 model2 > model1  0.151502
BIC-corrected         -6.300337 model2 > model1 1.485e-10

5.8.28 Vuong’s Procedure Comparing the ZINB Model and the Negative Binomial Model

Code

vuong(mod_zinb, mod_nb)

Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A    p-value
Raw                    1.609863 model1 > model2   0.053714
AIC-corrected         -1.029515 model2 > model1   0.151619
BIC-corrected         -6.300301 model2 > model1 1.4853e-10

5.8.29 Testing Sample Validations

Testing sample validations were obtained by using the predict function.

Code

test_poisson <- predict(mod_poisson, newdata = q1_test,
type.predict = "response")
test_nb <- predict(mod_nb, newdata = q1_test,
type.predict = "response")
test_zip <- predict(mod_zip, newdata = q1_test,
type.predict = "response")
test_zinb <- predict(mod_zinb, newdata = q1_test,
type.predict = "response")

Code

test_res <- bind_cols(q1_test,
                      pre_poisson = test_poisson, 
                      pre_nb = test_nb,
                      pre_zip = test_zip, 
                      pre_zinb = test_zinb)
names(test_res)

 [1] "index"           "rooby_rooby_roo" "imdb"            "engagement"     
 [5] "network"         "season"          "pre_poisson"     "pre_nb"         
 [9] "pre_zip"         "pre_zinb"

The metrics of the testing sample’s fit were obtained.

Code

poisson_sum <- mets(test_res, truth = rooby_rooby_roo, estimate = pre_poisson) |>
  mutate(model = "mod_poisson")

nb_sum <- mets(test_res, truth = rooby_rooby_roo, estimate = pre_nb) |>
mutate(model = "mod_nb")

zip_sum <- mets(test_res, truth = rooby_rooby_roo, estimate = pre_zip) |>
mutate(model = "mod_zip")

zinb_sum <- mets(test_res, truth = rooby_rooby_roo, estimate = pre_zinb) |>
mutate(model = "mod_zinb")

test_sum <- bind_rows(poisson_sum, nb_sum, zip_sum, zinb_sum)|>
  pivot_wider(names_from = model,
              values_from = .estimate)
test_sum |>
select(-.estimator) |> kable(dig = 3) |>
  kable_classic_2(font_size = 24,full_width = F)

.metric	mod_poisson	mod_nb	mod_zip	mod_zinb
rsq	0.210	0.210	0.077	0.168
rmse	1.345	1.345	0.705	0.686
mae	1.210	1.210	0.535	0.521

5.8.30 Selecting a Final Model

The final winning model was the Poisson model, because:

It has the highest testing sample \(R^2\).
From the Vuong test, the other models do not appear to be a significant improvement over the Poisson model.
The rootograms for the Poisson model are not very different from the other three models.

A poisson model called mod_poisson_imp was fit on the entire imputed dataset.

Code

mod_poisson_imp <- glm(rooby_rooby_roo ~ rcs(imdb,3) +  engagement + network + season ,data = q1_simp, family = "poisson")
summary(mod_poisson_imp)


Call:
glm(formula = rooby_rooby_roo ~ rcs(imdb, 3) + engagement + network + 
    season, family = "poisson", data = q1_simp)

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)   
(Intercept)              1.782e+00  9.109e-01   1.956  0.05047 . 
rcs(imdb, 3)imdb        -2.750e-01  1.297e-01  -2.121  0.03393 * 
rcs(imdb, 3)imdb'       -6.195e-03  2.095e-01  -0.030  0.97641   
engagement               1.803e-06  5.675e-06   0.318  0.75069   
networkcartoon_network  -4.710e-01  2.116e-01  -2.225  0.02606 * 
networkwarner_brothers  -3.598e-02  1.845e-01  -0.195  0.84539   
networkboomerang        -5.810e-01  2.166e-01  -2.683  0.00731 **
networkother_networks   -2.185e-01  1.921e-01  -1.137  0.25555   
seasonseason_2          -1.463e-01  1.438e-01  -1.017  0.30907   
seasonseason_3_4         2.636e-02  1.692e-01   0.156  0.87621   
seasonmovie              6.397e-01  2.319e-01   2.758  0.00581 **
seasonspecial_crossover -1.186e-01  3.077e-01  -0.385  0.69998   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 469.87  on 534  degrees of freedom
Residual deviance: 379.39  on 523  degrees of freedom
AIC: 1084.7

Number of Fisher Scoring iterations: 5

5.8.31 Rootogram

The rootogram for the final model was obtained.

Code

poisson_imp_root <- topmodels::rootogram(mod_poisson_imp, max = 10,
main = "Rootogram for Final Poisson Model")

5.8.32 Storing Fitted Values and Residuals

The fitted values were stored and the \(R^2\) of the final was obtained.

Code

sm_poiss1 <- augment(mod_poisson_imp, q1_simp,
                     type.predict = "response")

sm_poiss1 |> 
    select(rooby_rooby_roo, .fitted) |>
    head()

# A tibble: 6 × 2
  rooby_rooby_roo .fitted
            <dbl>   <dbl>
1               1   0.512
2               0   0.512
3               0   0.526
4               0   0.557
5               0   0.606
6               1   0.470

Code

# The correlation of observed and fitted values
(poiss_r <- with(sm_poiss1, cor(rooby_rooby_roo, .fitted)))

[1] 0.4658293

Code

poiss_r^2

[1] 0.2169969

The glance function was used to obtain the log-likelihood value, and the AIC and the BIC.

Code

glance(mod_poisson_imp) |> round(3)

# A tibble: 1 × 8
  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
          <dbl>   <dbl>  <dbl> <dbl> <dbl>    <dbl>       <dbl> <dbl>
1          470.     534  -530. 1085. 1136.     379.         523   535

Log-Likelihood	AIC	BIC	\(R^2\)
530.352	1084.70	1136.10	0.22

The \(R^2\) is 0.22, indicating that the model does not do a very good job of explaining the variation seen in the data.

5.8.33 Diagnostic Plots

A plot of the residuals versus the fitted values of rooby-rooby-roo for the final model were obtained.

Code

ggplot(sm_poiss1, aes(x = .fitted, y = .resid)) +
    geom_point() +
    labs(title = "Residuals vs. Fitted `rooby_rooby_roo`",
         subtitle = "Original Poisson Regression model")

The pattern seen here corroborates the pattern seen in the rootogram, with an underfitting of 0 values and an overfitting of 1 values.

glm.diag.plots was used to obtain the other diagnostic plots.

Code

glm.diag.plots(mod_poisson_imp)

The plots show that the model residuals follow a somewhat normal distribution, with some significant outliers. There are no points that exert an undue influence on the model.

Code

par(mfrow = c(2,2))
plot(mod_poisson_imp, which = 1)
plot(mod_poisson_imp, which = 2)
plot(mod_poisson_imp, which = 3)
plot(mod_poisson_imp, which = 4)

326 shows up as a outlier in all of these plots, along with 342.

5.9 Outlier Diagnostics

Code

scooby_raw |>
  filter(index == 326) |>
  select(title,rooby_rooby_roo, engagement, imdb, network, season) |>
  kable() |>
  kable_classic_2(font_size = 24,full_width = F)

title	rooby_rooby_roo	engagement	imdb	network	season
Dawn of the Spooky Shuttle Scare	1	37	7.1	ABC	3

The first outlier is an episode from Season 3 produced by ABC, with a single rooby-rooby-roo.

Code

scooby_raw |>
  filter(index == 342) |>
  select(title,rooby_rooby_roo, engagement, imdb, network, season) |>
  kable(dig = 3) |>
  kable_classic_2(font_size = 24,full_width = F)

title	rooby_rooby_roo	engagement	imdb	network	season
Scooby-Doo	3	100951	5.1	Warner Bros. Picture	Movie

The second outlier is the live action Scooby-Doo movie, titled “Scooby-Doo” which was very popular at the time it released, and was considered a cult classic. Over 10,000 people voted to rate the movie on IMDB, which is higher than the engagement value for the other Scooby-Doo media.

5.10 Testing for Overdispersion

The Poisson distribution assumes that the variance of the distribution should be equal to it’s mean. Overdispersion occurs when the observed variance of the data is more than what would be expected under a Poisson distributio, and can indicate variability in the data that is not being explained by the model.

Code

cat("overdispersion ratio is ", sum(z^2)/ (n - k), "\n")

overdispersion ratio is  0.6754214

Code

cat("p value of overdispersion test is ", 
    pchisq(sum(z^2), df = n-k, lower.tail = FALSE), "\n")

p value of overdispersion test is  1

It does not look like the data has the issue of overdispersion from the large p value, and the data can be assumed to have a Poisson distribution.

5.11 Model Coefficients

The final model’s coefficients were obtained.

Code

tidy(mod_poisson_imp) |>
  select(term, estimate, std.error, p.value) |>
  kbl(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)

term	estimate	std.error	p.value
(Intercept)	1.782	0.911	0.050
rcs(imdb, 3)imdb	-0.275	0.130	0.034
rcs(imdb, 3)imdb'	-0.006	0.210	0.976
engagement	0.000	0.000	0.751
networkcartoon_network	-0.471	0.212	0.026
networkwarner_brothers	-0.036	0.185	0.845
networkboomerang	-0.581	0.217	0.007
networkother_networks	-0.218	0.192	0.256
seasonseason_2	-0.146	0.144	0.309
seasonseason_3_4	0.026	0.169	0.876
seasonmovie	0.640	0.232	0.006
seasonspecial_crossover	-0.119	0.308	0.700

5.11.1 Odds Ratios

This was then converted into Odds ratios.

Code

tidy(mod_poisson_imp, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |>
  select(term, estimate, std.error, 
         low95 = conf.low, high95 = conf.high, p = p.value) |>
  kable(digits = 3)|>
  kable_classic_2(font_size = 24,full_width = F)

term	estimate	std.error	low95	high95	p
(Intercept)	5.940	0.911	0.968	34.596	0.050
rcs(imdb, 3)imdb	0.760	0.130	0.591	0.983	0.034
rcs(imdb, 3)imdb'	0.994	0.210	0.653	1.485	0.976
engagement	1.000	0.000	1.000	1.000	0.751
networkcartoon_network	0.624	0.212	0.407	0.935	0.026
networkwarner_brothers	0.965	0.185	0.665	1.372	0.845
networkboomerang	0.559	0.217	0.359	0.842	0.007
networkother_networks	0.804	0.192	0.547	1.162	0.256
seasonseason_2	0.864	0.144	0.648	1.140	0.309
seasonseason_3_4	1.027	0.169	0.730	1.418	0.876
seasonmovie	1.896	0.232	1.205	2.993	0.006
seasonspecial_crossover	0.888	0.308	0.461	1.556	0.700

6 Analysis 2

Monster Motivation: Is it possible to predict the motive of the major antagonist of an episode of Scooby-Doo, based on the nature of the monster the antagonist appears as?

6.1 The Outcome

The outcome variable is motive, describing the motive of the antagonist of each episode/movie of Scooby-Doo.

Code

scooby_sample |>
  tabyl(motive) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

motive	n	percent
competition	168	0.3140187
theft	124	0.2317757
indirect_self_interest	118	0.2205607
treasure	54	0.1009346
conquer	42	0.0785047
misc_motive	29	0.0542056

6.2 The Predictors

The predictors are monster_amount, monster_gender, monster_real.

monster_amount is a continuous quantitative variable describing the number of monsters that appear in each episode/movie.

Code

describe(scooby_sample$monster_amount)

scooby_sample$monster_amount 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     535        0       16    0.739    1.944    1.781        0        1 
     .25      .50      .75      .90      .95 
       1        1        2        4        7 
                                                                            
Value          0     1     2     3     4     5     6     7     8     9    10
Frequency     30   341    65    32    23     7     5     7     6     4     3
Proportion 0.056 0.637 0.121 0.060 0.043 0.013 0.009 0.013 0.011 0.007 0.006
                                        
Value         11    12    13    15    19
Frequency      4     3     3     1     1
Proportion 0.007 0.006 0.006 0.002 0.002

For the frequency table, variable is rounded to the nearest 0

monster_gender describes if the monster/monsters present in the episode were either male or female(in terms of the pronouns the Mystery Inc. gang used to refer to them). This variable had a large number of levels, and was collapsed to male if all of the monsters were male, and female if any of the monsters present in the episode/movie were female.

Code

scooby_sample |>
  tabyl(monster_gender) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

monster_gender	n	percent	valid_percent
male	427	0.7981308	0.8455446
female	78	0.1457944	0.1544554
NA	30	0.0560748	NA

monster_real describes if the monster was actually real, or someone in a costume, or a mind-controlled person, etc.

Code

scooby_sample |>
  tabyl(monster_real) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

monster_real	n	percent	valid_percent
not_real	402	0.7514019	0.7960396
real	103	0.1925234	0.2039604
NA	30	0.0560748	NA

6.3 Missingness

A seperate dataset, containing the identifier variable, along with the outcome and the predictor variables was created and stored in the tibble scooby_q2. This tibble was then assessed for missingness using miss_var_summary.

Code

scooby_q2 <- scooby_sample |>
  select(index,motive, monster_amount, monster_gender, monster_real)
scooby_q2 |>
  miss_var_summary() |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

variable	n_miss	pct_miss
monster_gender	30	5.607477
monster_real	30	5.607477
index	0	0.000000
motive	0	0.000000
monster_amount	0	0.000000

The variables monster_gender and monster_real have 15 missing values each.

Single imputation was then performed on the dataset, to account for missingness. The complete dataset, containing the data for both the analyses had it’s missing values imputed.

6.4 Imputing Missing Variables

Code

set.seed(4322023)
q2_mice1 <- mice(scooby_q2, m = 10, print = FALSE)

Warning: Number of logged events: 1

Code

q2_simp <- mice::complete(q2_mice1)

miss_var_summary(q2_simp)

# A tibble: 5 × 3
  variable       n_miss pct_miss
  <chr>           <int>    <dbl>
1 index               0        0
2 motive              0        0
3 monster_amount      0        0
4 monster_gender      0        0
5 monster_real        0        0

6.5 Visualizing Data Exploration

Code

mosaic::favstats(monster_amount ~ motive, data = q2_simp) |>
  kable(digits = 2) |>
  kable_classic_2(font_size = 24,full_width = F)

Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2

motive	min	Q1	median	Q3	max	mean	sd	n
competition	0	1	1.0	1.00	19	1.58	2.09	168
theft	0	1	1.0	1.00	15	1.56	1.82	124
indirect_self_interest	0	1	1.0	2.00	9	1.60	1.34	118
treasure	1	1	1.0	2.00	7	1.70	1.28	54
conquer	1	3	5.5	9.75	13	6.17	3.96	42
misc_motive	0	0	1.0	2.00	5	1.38	1.42	29

6.5.1 `motive` vs `monster_amount`

Code

tmp_data_2 <- q2_simp

tmp_data_2$motive <-  fct_recode(tmp_data_2$motive,
                               "Competition" = "competition",
                               "Theft" = "theft",
                               "Indirect Self Interest" = "indirect_self_interest",
                               "Treasure" = "treasure",
                               "Conquer" = "conquer",
                               "Miscellaneous Motives" = "misc_motive")
ggplot(tmp_data_2, aes(x = motive, y = monster_amount, fill = motive))  +
  geom_violin(fill = "#2b8cbe", alpha = 0.3, scale = "width")+ 
  geom_boxplot(fill = "#2b8cbe", width = 0.1,
               outlier.color = "red")+
  labs(y = "Number of monsters per episode/movie",
       x = "Motive of the antagonist",
       title = str_wrap("Mapping the motives of the major antagonist of every episode/movie against the number of monsters across the various seasons/movies of Scooby-Doo.",width = 80),
       subtitle = "Across 535 episodes of Scooby-Doo.")

Antagonists with a motivation to conquer appear to star in episodes with more monsters, compared to antagonists with other motivations.

6.5.2 `motive` vs `monster_real`

Code

tmp_data_2 <- tmp_data_2 |>
  mutate(monster_real = fct_recode(monster_real,
                               "Monster was not real" = "not_real",
                               "Monster was real" = "real"))

ggplot(tmp_data_2, aes(motive))+
  geom_bar(fill = "#2b8cbe") + 
  facet_wrap(~ monster_real)+
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  labs(y = "Count",
       x = "Motive of the antagonist",
       title = str_wrap("Mapping the motives of the major antagonist of every episode/movie against the nature of reality of the monsters across the various seasons/movies of Scooby-Doo.",width = 80),
       subtitle = "Across 535 episodes of Scooby-Doo.")

It appears that for episodes with monsters that are not real, competition is the most numerous motivation. In episodes with monsters that are real, conquering appears to be the most numerous motivation.

6.5.3 `motive` vs `monster_gender`

Code

tmp_data_2 <- tmp_data_2 |>
  mutate(monster_gender = fct_recode(monster_gender,
                               "Male" = "male",
                               "Female" = "female"))
ggplot(tmp_data_2, aes(motive))+
  geom_bar(fill = "#2b8cbe") + 
  facet_wrap(~ monster_gender)+
  theme(axis.text.x = element_text(angle = 45, hjust=1)) + 
  labs(y = "Count",
       x = "Motive of the antagonist",
       title = str_wrap("Mapping the motives of the major antagonist of every episode/movie against the gender of the monsters across the various seasons/movies of Scooby-Doo.",width = 80),
       subtitle = "Across 535 episodes of Scooby-Doo.")

Episodes with only male monsters have competition as the most numerous outcome, and episodes that have female monsters also have competition as the most numerous outcome.

6.6 Assessing collinearity

The ggpairs function was used to plot a scatterplot matrix to assess for potential collinearity.

Code

ggpairs(q2_simp, columns = c("motive", "monster_real",
                             "monster_gender", "monster_amount"))

No collinearity was seen from the scatterplot.

6.7 Non-Linear Terms

A Spearman plot was then created to assess if non-linear terms would be needed in the models.

Code

sp2 <- spearman2(motive ~ monster_amount + monster_real + monster_gender, data = q2_simp)

plot(sp2)

The Spearman plot shows that if adding non-linear terms were to improve the fit of the model, the term most likely to add the most predictive power with a non-linear term applied to it would be monster_real.

However, no non-linear term was selected, given that adding non-linear terms would result in spending of too many degrees of freedom.

6.8 Fitting the Multinomial Model

A multinomial model was fit using the multinom function and was designated as multinom_model.

Code

multinom_model <-multinom(motive ~ monster_amount + monster_real + monster_gender ,data=q2_simp)

# weights:  30 (20 variable)
initial  value 958.591316 
iter  10 value 806.559330
iter  20 value 761.294580
iter  30 value 760.258265
final  value 760.257932 
converged

Code

summary(multinom_model)

Call:
multinom(formula = motive ~ monster_amount + monster_real + monster_gender, 
    data = q2_simp)

Coefficients:
                       (Intercept) monster_amount monster_realreal
theft                   -0.2984806    -0.03866893        0.3795771
indirect_self_interest  -0.4247981    -0.13144830        1.9282423
treasure                -1.2076838     0.08470106       -0.6903529
conquer                 -4.2541198     0.18107415        4.7542887
misc_motive             -2.1112197    -0.46152297        3.2876692
                       monster_genderfemale
theft                            0.20634400
indirect_self_interest           0.06788956
treasure                        -0.27912724
conquer                         -0.08231355
misc_motive                      1.11501804

Std. Errors:
                       (Intercept) monster_amount monster_realreal
theft                    0.1697542     0.08281784        0.4887687
indirect_self_interest   0.1767427     0.08597467        0.4165262
treasure                 0.2190608     0.09981712        0.8402710
conquer                  0.6053196     0.08056674        0.7138281
misc_motive              0.3661798     0.19340172        0.5299548
                       monster_genderfemale
theft                             0.3499198
indirect_self_interest            0.3704725
treasure                          0.5111410
conquer                           0.5731102
misc_motive                       0.5255567

Residual Deviance: 1520.516 
AIC: 1560.516

6.9 Model Summary

Code

tidy(multinom_model, exponentiate = FALSE) |> kable(digits = 3)|>
  kable_classic_2(font_size = 24,full_width = F)

y.level	term	estimate	std.error	statistic	p.value
theft	(Intercept)	-0.298	0.170	-1.758	0.079
theft	monster_amount	-0.039	0.083	-0.467	0.641
theft	monster_realreal	0.380	0.489	0.777	0.437
theft	monster_genderfemale	0.206	0.350	0.590	0.555
indirect_self_interest	(Intercept)	-0.425	0.177	-2.403	0.016
indirect_self_interest	monster_amount	-0.131	0.086	-1.529	0.126
indirect_self_interest	monster_realreal	1.928	0.417	4.629	0.000
indirect_self_interest	monster_genderfemale	0.068	0.370	0.183	0.855
treasure	(Intercept)	-1.208	0.219	-5.513	0.000
treasure	monster_amount	0.085	0.100	0.849	0.396
treasure	monster_realreal	-0.690	0.840	-0.822	0.411
treasure	monster_genderfemale	-0.279	0.511	-0.546	0.585
conquer	(Intercept)	-4.254	0.605	-7.028	0.000
conquer	monster_amount	0.181	0.081	2.248	0.025
conquer	monster_realreal	4.754	0.714	6.660	0.000
conquer	monster_genderfemale	-0.082	0.573	-0.144	0.886
misc_motive	(Intercept)	-2.111	0.366	-5.766	0.000
misc_motive	monster_amount	-0.462	0.193	-2.386	0.017
misc_motive	monster_realreal	3.288	0.530	6.204	0.000
misc_motive	monster_genderfemale	1.115	0.526	2.122	0.034

6.10 Fitting a Proportional Odds Model

A proportional odds model was fit using the polr function.

Code

polr_model <- polr(motive~ monster_real + monster_amount + monster_gender ,data=q2_simp, Hess = TRUE)

summary(polr_model)

Call:
polr(formula = motive ~ monster_real + monster_amount + monster_gender, 
    data = q2_simp, Hess = TRUE)

Coefficients:
                       Value Std. Error t value
monster_realreal     1.95262    0.23359  8.3593
monster_amount       0.06793    0.03869  1.7558
monster_genderfemale 0.07528    0.21936  0.3432

Intercepts:
                                Value   Std. Error t value
competition|theft               -0.4092  0.1140    -3.5890
theft|indirect_self_interest     0.6615  0.1148     5.7641
indirect_self_interest|treasure  1.8288  0.1406    13.0087
treasure|conquer                 2.6939  0.1795    15.0048
conquer|misc_motive              3.8677  0.2454    15.7593

Residual Deviance: 1631.342 
AIC: 1647.342

6.11 Testing the Proportional Odds Assumption

A likelihood ratio test was performed. The multinomial model fits 5 intercepts and 20 slopes, for a total of 25 parameters. The proportional odds model fits 5 intercepts and 4 slopes, for a total of 9 parameters. The difference between them is 16.

G is a chi-square statistic, calculated as the difference in deviance. Deviance = -2 log likelihood.

Code

LL_1 <- logLik(polr_model)
LL_1m <- logLik(multinom_model)
(G <- -2 * (LL_1[1] - LL_1m[1]))

[1] 110.8262

The p value associated with the chi-square statistic, with a difference of 16 degrees of freedom, is calculated.

Code

pchisq(G, 16, lower.tail = FALSE)

[1] 3.12363e-16

The p value suggests that the proportional odds model does not fit the data as well as the multinomial model does.

6.12 Comparing Metrics

Model	Effective Degrees of Freedom	Deviance	AIC
Multinomial Model	20	1515.427	1555.427
Polynomial Model	8	1624.988	1640.988

Based on the results of testing the proportional odds assumption, and the model metrics, the multinomial model appears to be a better fit for the data than the proportional odds model.

6.13 The Final Model

Code

summary(multinom_model)

Call:
multinom(formula = motive ~ monster_amount + monster_real + monster_gender, 
    data = q2_simp)

Coefficients:
                       (Intercept) monster_amount monster_realreal
theft                   -0.2984806    -0.03866893        0.3795771
indirect_self_interest  -0.4247981    -0.13144830        1.9282423
treasure                -1.2076838     0.08470106       -0.6903529
conquer                 -4.2541198     0.18107415        4.7542887
misc_motive             -2.1112197    -0.46152297        3.2876692
                       monster_genderfemale
theft                            0.20634400
indirect_self_interest           0.06788956
treasure                        -0.27912724
conquer                         -0.08231355
misc_motive                      1.11501804

Std. Errors:
                       (Intercept) monster_amount monster_realreal
theft                    0.1697542     0.08281784        0.4887687
indirect_self_interest   0.1767427     0.08597467        0.4165262
treasure                 0.2190608     0.09981712        0.8402710
conquer                  0.6053196     0.08056674        0.7138281
misc_motive              0.3661798     0.19340172        0.5299548
                       monster_genderfemale
theft                             0.3499198
indirect_self_interest            0.3704725
treasure                          0.5111410
conquer                           0.5731102
misc_motive                       0.5255567

Residual Deviance: 1520.516 
AIC: 1560.516

6.14 p Values

Code

z <- summary(multinom_model)$coefficients/summary(multinom_model)$standard.errors
p <- (1 - pnorm(abs(z), 0, 1)) * 2
p |> kable(dig = 3) |>
  kable_classic_2(font_size = 24,full_width = F)

	(Intercept)	monster_amount	monster_realreal	monster_genderfemale
theft	0.079	0.641	0.437	0.555
indirect_self_interest	0.016	0.126	0.000	0.855
treasure	0.000	0.396	0.411	0.585
conquer	0.000	0.025	0.000	0.886
misc_motive	0.000	0.017	0.000	0.034

6.15 Exponentiated Coefficients

Code

exp(coef(multinom_model, conf_int = TRUE)) |> kable(dig = 3) |>
  kable_classic_2(font_size = 24,full_width = F)

	(Intercept)	monster_amount	monster_realreal	monster_genderfemale
theft	0.742	0.962	1.462	1.229
indirect_self_interest	0.654	0.877	6.877	1.070
treasure	0.299	1.088	0.501	0.756
conquer	0.014	1.199	116.081	0.921
misc_motive	0.121	0.630	26.780	3.050

6.16 Exponentiated Coefficients with p Values

Code

df <- data.frame(
  motive = c("theft", "indirect_self_interest", "treasure", "conquer", "misc_motive"),
  intercept = c(0.741, 0.637, 0.299, 0.014, 0.102),
  intercept_pvalue = c(0.078, 0.011, 0.000, 0.000, 0.000),
  monster_amount = c(0.958, 0.882, 1.087, 1.209, 0.660),
  monster_amount_pvalue = c(0.602, 0.141, 0.404, 0.019, 0.021),
  monster_realreal = c(1.627, 6.949, 0.509, 115.167, 33.966),
  monster_realreal_pvalue = c(0.306, 0.000, 0.419, 0.000, 0.000),
  monster_genderfemale = c(1.229, 1.158, 0.757, 0.961, 2.556),
  monster_genderfemale_pvalue = c(0.555, 0.689, 0.586, 0.945, 0.085)
)

df |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

motive	intercept	intercept_pvalue	monster_amount	monster_amount_pvalue	monster_realreal	monster_realreal_pvalue	monster_genderfemale	monster_genderfemale_pvalue
theft	0.741	0.078	0.958	0.602	1.627	0.306	1.229	0.555
indirect_self_interest	0.637	0.011	0.882	0.141	6.949	0.000	1.158	0.689
treasure	0.299	0.000	1.087	0.404	0.509	0.419	0.757	0.586
conquer	0.014	0.000	1.209	0.019	115.167	0.000	0.961	0.945
misc_motive	0.102	0.000	0.660	0.021	33.966	0.000	2.556	0.085

6.17 Exponentiated Coefficients with Confidence Intervals

Code

tidy(multinom_model,conf.int = TRUE, conf.level = 0.95, exponentiate = TRUE) |>
  kbl(dig = 3) |>
  kable_classic_2(font_size = 24,full_width = F)

y.level	term	estimate	std.error	statistic	p.value	conf.low	conf.high
theft	(Intercept)	0.742	0.170	-1.758	0.079	0.532	1.035
theft	monster_amount	0.962	0.083	-0.467	0.641	0.818	1.132
theft	monster_realreal	1.462	0.489	0.777	0.437	0.561	3.810
theft	monster_genderfemale	1.229	0.350	0.590	0.555	0.619	2.440
indirect_self_interest	(Intercept)	0.654	0.177	-2.403	0.016	0.462	0.925
indirect_self_interest	monster_amount	0.877	0.086	-1.529	0.126	0.741	1.038
indirect_self_interest	monster_realreal	6.877	0.417	4.629	0.000	3.040	15.559
indirect_self_interest	monster_genderfemale	1.070	0.370	0.183	0.855	0.518	2.212
treasure	(Intercept)	0.299	0.219	-5.513	0.000	0.195	0.459
treasure	monster_amount	1.088	0.100	0.849	0.396	0.895	1.324
treasure	monster_realreal	0.501	0.840	-0.822	0.411	0.097	2.603
treasure	monster_genderfemale	0.756	0.511	-0.546	0.585	0.278	2.060
conquer	(Intercept)	0.014	0.605	-7.028	0.000	0.004	0.047
conquer	monster_amount	1.199	0.081	2.248	0.025	1.023	1.404
conquer	monster_realreal	116.081	0.714	6.660	0.000	28.652	470.298
conquer	monster_genderfemale	0.921	0.573	-0.144	0.886	0.300	2.832
misc_motive	(Intercept)	0.121	0.366	-5.766	0.000	0.059	0.248
misc_motive	monster_amount	0.630	0.193	-2.386	0.017	0.431	0.921
misc_motive	monster_realreal	26.780	0.530	6.204	0.000	9.478	75.668
misc_motive	monster_genderfemale	3.050	0.526	2.122	0.034	1.089	8.543

6.18 Storing and graphing Predicted Probabilities

The predicted probabilities were stored, tabulated, and then pivoted. This resulting table was used to create graphs of predicted probabilities of the outcome across the different predictors.

6.19 Predicted Probabilities and `monster_amount`

The predicted probabilities for each motive were graphed against the monster_amount variable.

Code

ggplot(q2_simp_fits_long, aes(x = monster_amount, y = prob, shape = response, col = response)) +
  geom_point(size = 3) +
  scale_fill_brewer(palette = "Set1") +
  labs(title = str_wrap("Comparing predicted probabilities of the different antagonist motive types against the number of monsters of each installment of Scooby-Doo Media.", width = 80),
       subtitle = "Using a multinomial logistic regression model.",
       x = "Number of monsters per installment",
       y = "Predicted Probability")

Higher monster counts are association with higher predicted probability of conquer being the model’s prediction for the antagonist’s motivation, and lower monster counts are associated with higher predicted probability of misc_motive being the model’s prediction for the antagonist’s motivation.

6.20 Predicted Probabilities and `monster_real`

Code

ggplot(q2_simp_fits_long, aes(x = monster_real, y = prob,
                              col = response,shape = response)) +
  geom_point(size = 3) +
  geom_jitter(width = 0.25) +
  scale_fill_brewer(palette = "Set1")+
  labs(title = str_wrap("Comparing predicted probabilities of the different antagonist motive types against whether the monster(s) of each installment of Scooby-Doo Media was real or not.", width = 80),
       subtitle = "Using a multinomial logistic regression model.",
       x = "Whether the monster was real or not",
       y = "Predicted Probability")

Monsters who are real have a higher predicted probability of having conquering as a motivation, along with having a miscellaneous motivation. Monsters who are not real have a higher predicted probability of having a competition be their motivation, with a very low probability of conquering being their motivation.

6.21 Predicted Probabilities and `monster_gender`

Code

ggplot(q2_simp_fits_long, aes(x = monster_gender, y = prob,
                              col = response,shape = response)) +
  geom_point(size = 3) +
  geom_jitter(width = 0.25) + 
  scale_fill_brewer(palette = "Set1")  +
  labs(title = str_wrap("Comparing predicted probabilities of the different antagonist motive types against whether the monster(s) of each installment of Scooby-Doo Media included a female monster.", width = 80),
       subtitle = str_wrap("Using a multinomial logistic regression model.", width = 80),
       x = "Whether the monster(s) included a female monster.",
       y = "Predicted Probability")

Monster gender does not appear to cause a significant effect on the predicted probability of the motivation types.

7 Conclusions and Discussion

7.1 Analysis 1

Code

tidy(mod_poisson_imp, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |>
  select(term, estimate, std.error, 
         low95 = conf.low, high95 = conf.high, p = p.value) |>
  kable(digits = 3)|>
  kable_classic_2(font_size = 24,full_width = F)

term	estimate	std.error	low95	high95	p
(Intercept)	5.940	0.911	0.968	34.596	0.050
rcs(imdb, 3)imdb	0.760	0.130	0.591	0.983	0.034
rcs(imdb, 3)imdb'	0.994	0.210	0.653	1.485	0.976
engagement	1.000	0.000	1.000	1.000	0.751
networkcartoon_network	0.624	0.212	0.407	0.935	0.026
networkwarner_brothers	0.965	0.185	0.665	1.372	0.845
networkboomerang	0.559	0.217	0.359	0.842	0.007
networkother_networks	0.804	0.192	0.547	1.162	0.256
seasonseason_2	0.864	0.144	0.648	1.140	0.309
seasonseason_3_4	1.027	0.169	0.730	1.418	0.876
seasonmovie	1.896	0.232	1.205	2.993	0.006
seasonspecial_crossover	0.888	0.308	0.461	1.556	0.700

7.1.1 The relationship between the Rooby-Rooby-Roo count and the producing network

Cartoon Network - Scooby-Doo media being produced by Cartoon Network results in a 37.6%(95%CI: 6.5% to 59.3%) decrease in count of Rooby-Rooby-Roo’s per episode compared to being produced by ABC , holding everything else constant.

Scooby-Doo media produced by Cartoon Network is more likely to have a lower count of Rooby-Rooby-Roos per episode when compared to Scooby-Doo media produced by ABC, provided they have the same IMDB rating, the same IMDB engagement score, and are from the same season.

Boomerang Episodes - Scooby-Doo media being produced by Boomerang results in a 44.1% (95%CI: 15.8% to 64.1%) decrease in count of Rooby-Rooby-Roo’s per episode compared to being produced by ABC, holding everything else constant.

Scooby-Doo media produced by Boomerang is more likely to have a lower count of Rooby-Rooby-Roos per episode when compared to Scooby-Doo media produced by ABC, provided they have the same IMDB rating, the same IMDB engagement score, and are from the same season.

7.1.2 The relationship between the Rooby-Rooby-Roo count and the type of Scooby-Doo media

Movies: The form of Scooby-Doo media being a movie results in a 89.6%(95%CI: 20.5% to 199.3%) increase in count of Rooby-Rooby-Roo’s compared to season 1 episodes, holding everything else constant.

Movies are more likely to have a higher count of Rooby-Rooby-Roo’s when compared to Season 1 episodes, provided they are both produced by the same network/producer, have the same IMDB score, and have the same IMDB engagement rating.

7.2 Analysis 2

Code

df |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

motive	intercept	intercept_pvalue	monster_amount	monster_amount_pvalue	monster_realreal	monster_realreal_pvalue	monster_genderfemale	monster_genderfemale_pvalue
theft	0.741	0.078	0.958	0.602	1.627	0.306	1.229	0.555
indirect_self_interest	0.637	0.011	0.882	0.141	6.949	0.000	1.158	0.689
treasure	0.299	0.000	1.087	0.404	0.509	0.419	0.757	0.586
conquer	0.014	0.000	1.209	0.019	115.167	0.000	0.961	0.945
misc_motive	0.102	0.000	0.660	0.021	33.966	0.000	2.556	0.085

Code

tidy(multinom_model,conf.int = TRUE, conf.level = 0.95, exponentiate = TRUE) |>
  kbl(dig = 3) |>
  kable_classic_2(font_size = 24,full_width = F)

y.level	term	estimate	std.error	statistic	p.value	conf.low	conf.high
theft	(Intercept)	0.742	0.170	-1.758	0.079	0.532	1.035
theft	monster_amount	0.962	0.083	-0.467	0.641	0.818	1.132
theft	monster_realreal	1.462	0.489	0.777	0.437	0.561	3.810
theft	monster_genderfemale	1.229	0.350	0.590	0.555	0.619	2.440
indirect_self_interest	(Intercept)	0.654	0.177	-2.403	0.016	0.462	0.925
indirect_self_interest	monster_amount	0.877	0.086	-1.529	0.126	0.741	1.038
indirect_self_interest	monster_realreal	6.877	0.417	4.629	0.000	3.040	15.559
indirect_self_interest	monster_genderfemale	1.070	0.370	0.183	0.855	0.518	2.212
treasure	(Intercept)	0.299	0.219	-5.513	0.000	0.195	0.459
treasure	monster_amount	1.088	0.100	0.849	0.396	0.895	1.324
treasure	monster_realreal	0.501	0.840	-0.822	0.411	0.097	2.603
treasure	monster_genderfemale	0.756	0.511	-0.546	0.585	0.278	2.060
conquer	(Intercept)	0.014	0.605	-7.028	0.000	0.004	0.047
conquer	monster_amount	1.199	0.081	2.248	0.025	1.023	1.404
conquer	monster_realreal	116.081	0.714	6.660	0.000	28.652	470.298
conquer	monster_genderfemale	0.921	0.573	-0.144	0.886	0.300	2.832
misc_motive	(Intercept)	0.121	0.366	-5.766	0.000	0.059	0.248
misc_motive	monster_amount	0.630	0.193	-2.386	0.017	0.431	0.921
misc_motive	monster_realreal	26.780	0.530	6.204	0.000	9.478	75.668
misc_motive	monster_genderfemale	3.050	0.526	2.122	0.034	1.089	8.543

7.2.1 The relationship between monster motivation and the monster count

For two installments of Scooby-Doo media,“EP-1” and “EP-2”, with the same nature of the monster, and the same gender of the monster, if “EP-1” has one more monster than “EP-2”, EP-1’s predicted odds of the antagonist’s motive being “Conquer” rather than “Competition” is 1.21 times(95% confidence interval - 1.03 to 1.42) EP-2’s odds of the antagonist’s motive being “Conquer” rather than “Competition”.

Episodes with more monsters have a meaningfully higher odds of the antagonist having a motive of “Conquer” rather than “Competition”, provided the monster gender and the reality of the monster remain same.

For two installments of Scooby-Doo media,“EP-1” and “EP-2”, with the same nature of the monster, and the same gender of the monster, if “EP-1” has one more monster than “EP-2”, EP-1’s predicted odds of the antagonist’s motive being “Miscellaneous” rather than “Competition” is only 0.66 times(95% confidence interval - 0.46 to 0.94) EP-2’s odds of the antagonist’s motive being “Miscellaneous” rather than “Competition”.

Episodes with more monsters have a meaningfully lower odds of the antagonist having a motive of “Miscellaneous” rather than “Competition”, provided the monster gender and the reality of the monster remain the same.

7.2.2 The relationship between monster motivation and the nature of the monster’s reality

For two installments of Scooby-Doo media,“EP-1” and “EP-2”, with the same number of monsters, and the same gender of the monster, and the monsters in EP-2 are not real and EP-1 has a real monster, EP-1’s predicted odds of the antagonist’s motive being “Indirect Self Interest” rather than “Competition” is 6.95 times (95% confidence interval - 3.11 to 15.54) EP-2’s odds of the antagonist’s motive being “Indirect Self Interest” rather than “Competition”.

Episodes with real monsters have a meaningfully higher odds of the antagonist having a motive of “Indirect Self Interest” rather than “Competition”, provided the monster gender and the number of monsters remain the same.

For two installments of Scooby-Doo media,“EP-1” and “EP-2”, with the same number of monsters, and the the same gender of the monster, and the monsters in EP-2 are not real and EP-1 has a real monster, EP-1’s predicted odds of the antagonist’s motive being “Conquer” rather than “Competition” is 115.167 times(95% confidence interval - 28.60 to 463.70) EP-2’s odds of the antagonist’s motive being “Conquer” rather than “Competition”.

Episodes with real monsters have a meaningfully higher odds of the antagonist having a motive of “Conquer” rather than “Competition”, provided the monster gender and the number of monsters remain the same.

For two installments of Scooby-Doo media,“EP-1” and “EP-2”, with the same number of monsters, and the the same gender of the monster, and the monsters in EP-2 are not real and EP-1 has a real monster, EP-1’s predicted odds of the antagonist’s motive being “Miscellaneous” rather than “Competition” is 33.966 times(95% confidence interval - 12.10 to 95.74) EP-2’s odds of the antagonist’s motive being “Miscellaneous” rather than “Competition”.

Episodes with real monsters have a meaningfully higher odds of the antagonist having a motive of “Miscellaneous” rather than “Competition”, provided the monster gender and the number of monsters remain the same.

7.2.3 The relationship between monster motivation and the gender of the monsters

Having a female monster in an installment of Scooby-Doo does not have a significant impact on the motive of the antagonist of that installment when compared to an installment with no female monsters, provided the number of monsters and the nature of the monster remains same across both installments.

7.3 Answering My Research Questions

7.4 To Answer My First Question….

Predicting an iconic catchphrase: Are the logistics of an episode of Scooby-Doo good predictors of the number of times an iconic catchphrase is spoken?

The network/producer that produces a particular installment of Scooby-Doo has a meaningful impact on the number of times Rooby-Rooby-Roo is spoken. Cartoon Network and Boomerang produce media that is less likely to have a higher count of Rooby-Rooby-Roo’s compared to ABC, which is the network that has produced the most Scooby-Doo media.(Provided the IMDB logistics and the type of installment remain constant.)

Given that there is a meaningful correlation, the producing network of Scooby-Doo media can be considered a decent predictor of the number of times Rooby-Rooby-Roo’s are spoken per installment.

The nature of the type of installment of Scooby-Doo media has a meaningful impact on the number of times Rooby-Rooby-Roo is spoken.

Movies are more likely to have a higher count of Rooby-Rooby-Roo’s compared to Season 1 episodes, provided the two installments have the same IMDB logistics and the same producing network.

Given that there is a meaningful correlation, the nature of the type of installment of Scooby-Doo media can be considered a decent predictor of the number of times Rooby-Rooby-Roo’s are spoken per installment.

7.5 To Answer My Second Question….

Monster Motivation: Is it possible to predict the motive of the major antagonist of an episode of Scooby-Doo, based on the nature of the monster the antagonist appears as?

The number of monsters per installment of Scooby-Doo are a good predictor of the motive of the major antagonist. Episodes with higher monster counts are more likely to have antagonists that want to conquer, and episodes with lower monster counts are more likely to have antagonists that have more nebulous motivations, provided the monster’s gender and nature of reality remain constant.

The nature of reality of the monster(s) in an installment of Scooby-Doo are a good predictor of the motive of the major antagonist. Episodes with real monsters (as opposed to humans in a costume, or brainwashed/mind-controlled humans) are more likely to have antagonists that want to conquer, have more nebulous motivations, or act in their own self interest in an indirect manner, provided the monster’s gender and monster count remain constant.

Monster gender does not have a meaningful impact on the motivation of the antagonist of an installment of Scooby-Doo, provided the monster’s nature of reality and monster count remain constant. (No gender bias here!)

8 References and Acknowledgments

8.1 References

The data comes from Kaggle, and was part of Tidy Tuesday’s dataset for 2021-07-13, and can be found here.

The data was manually aggregated by user plummye on the Kaggle website.

8.2 Acknowledgments

I would like to thank Stephanie Merlino for being a fantastic TA, and for answering any and every question I had for her without hesitation. She was a big part of why Project B felt like something I could accomplish.

9 Session Information

Code

xfun::session_info()

R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  abind_1.4-5          arm_1.13-1           askpass_1.2.0       
  backports_1.4.1      base64enc_0.1-3      bit_4.0.5           
  bit64_4.0.5          blob_1.2.4           boot_1.3-28.1       
  brio_1.1.4           broom_1.0.5          broom.helpers_1.14.0
  broom.mixed_0.2.9.4  bslib_0.6.1          cachem_1.0.8        
  callr_3.7.3          car_3.1-2            carData_3.0-5       
  cellranger_1.1.0     checkmate_2.3.1      class_7.3-22        
  cli_3.6.1            clipr_0.8.0          clock_0.7.0         
  cluster_2.1.6        coda_0.19-4          codetools_0.2-19    
  colorspace_2.1-0     commonmark_1.9.0     compiler_4.3.2      
  conflicted_1.2.0     countreg_0.2-1       cpp11_0.4.7         
  crayon_1.5.2         curl_5.2.0           data.table_1.14.10  
  datasets_4.3.2       DBI_1.2.0            dbplyr_2.4.0        
  desc_1.4.3           diagram_1.6.5        dials_1.2.0         
  DiceDesign_1.10      diffobj_0.3.5        digest_0.6.33       
  distributions3_0.2.1 dplyr_1.1.4          dtplyr_1.3.1        
  ellipsis_0.3.2       equatiomatic_0.3.1   evaluate_0.23       
  fansi_1.0.6          farver_2.1.1         fastmap_1.1.1       
  fontawesome_0.5.2    forcats_1.0.0        foreach_1.5.2       
  foreign_0.8-86       Formula_1.2-5        fs_1.6.3            
  furrr_0.3.1          future_1.33.1        future.apply_1.11.1 
  gargle_1.5.2         generics_0.1.3       GGally_2.2.0        
  ggformula_0.12.0     ggplot2_3.4.4        ggridges_0.5.5      
  ggstats_0.5.1        glmnet_4.1-8         globals_0.16.2      
  glue_1.6.2           googledrive_2.1.1    googlesheets4_1.1.1 
  gower_1.0.1          GPfit_1.0-8          graphics_4.3.2      
  grDevices_4.3.2      grid_4.3.2           gridExtra_2.3       
  gtable_0.3.4         hardhat_1.3.0        haven_2.5.4         
  highr_0.10           Hmisc_5.1-1          hms_1.1.3           
  htmlTable_2.4.2      htmltools_0.5.7      htmlwidgets_1.6.4   
  httpuv_1.6.13        httr_1.4.7           ids_1.0.1           
  infer_1.0.5          ipred_0.9-14         isoband_0.2.7       
  iterators_1.0.14     janitor_2.2.0        jomo_2.7-6          
  jquerylib_0.1.4      jsonlite_1.8.8       kableExtra_1.3.4    
  KernSmooth_2.23.22   knitr_1.45           labeling_0.4.3      
  labelled_2.12.0      later_1.3.2          lattice_0.22-5      
  lava_1.7.3           lhs_1.1.6            lifecycle_1.0.4     
  listenv_0.9.0        lme4_1.1-35.1        lubridate_1.9.3     
  magrittr_2.0.3       MASS_7.3-60          Matrix_1.6-4        
  MatrixModels_0.5-3   memoise_2.0.1        methods_4.3.2       
  mgcv_1.9.1           mice_3.16.0          mime_0.12           
  minqa_1.2.6          mitml_0.4-5          modeldata_1.2.0     
  modelenv_0.1.1       modelr_0.1.11        mosaic_1.9.0        
  mosaicCore_0.9.4.0   mosaicData_0.20.4    multcomp_1.4-25     
  munsell_0.5.0        mvtnorm_1.2-4        naniar_1.0.0        
  nlme_3.1-164         nloptr_2.0.3         nnet_7.3-19         
  norm_1.0.11.1        numDeriv_2016.8.1.1  openssl_2.1.1       
  ordinal_2023.12.4    pan_1.9              parallel_4.3.2      
  parallelly_1.36.0    parsnip_1.1.1        patchwork_1.1.3     
  pbkrtest_0.5.2       pillar_1.9.0         pkgbuild_1.4.3      
  pkgconfig_2.0.3      pkgload_1.3.3        plyr_1.8.9          
  polspline_1.1.24     praise_1.0.0         prettyunits_1.2.0   
  processx_3.8.3       prodlim_2023.08.28   progress_1.2.3      
  progressr_0.14.0     promises_1.2.1       ps_1.7.5            
  pscl_1.5.5.1         purrr_1.0.2          quantreg_5.97       
  R6_2.5.1             ragg_1.2.7           rappdirs_0.3.3      
  RColorBrewer_1.1-3   Rcpp_1.0.11          RcppEigen_0.3.3.9.4 
  readr_2.1.4          readxl_1.4.3         recipes_1.0.9       
  rematch_2.0.0        rematch2_2.1.2       reprex_2.0.2        
  rlang_1.1.2          rmarkdown_2.25       rms_6.7-1           
  rpart_4.1.23         rprojroot_2.0.4      rsample_1.2.0       
  rstudioapi_0.15.0    rvest_1.0.3          sandwich_3.1-0      
  sass_0.4.8           scales_1.3.0         selectr_0.4.2       
  shape_1.4.6          shiny_1.8.0          slider_0.3.1        
  snakecase_0.11.1     sourcetools_0.1.7.1  SparseM_1.81        
  splines_4.3.2        SQUAREM_2021.1       stats_4.3.2         
  stats4_4.3.2         stringi_1.8.3        stringr_1.5.1       
  survival_3.5-7       svglite_2.1.3        sys_3.4.2           
  systemfonts_1.0.5    testthat_3.2.1       textshaping_0.3.7   
  TH.data_1.1-2        tibble_3.2.1         tidymodels_1.1.1    
  tidyr_1.3.0          tidyselect_1.2.0     tidyverse_2.0.0     
  timechange_0.2.0     timeDate_4032.109    tinytex_0.49        
  tools_4.3.2          topmodels_0.3-0      tune_1.1.2          
  tzdb_0.4.0           ucminf_1.2.1         UpSetR_1.4.0        
  utf8_1.2.4           utils_4.3.2          uuid_1.1.1          
  vctrs_0.6.5          VGAM_1.1-9           viridis_0.6.4       
  viridisLite_0.4.2    visdat_0.6.0         vroom_1.6.5         
  waldo_0.5.2          warp_0.2.1           webshot_0.5.5       
  withr_2.5.2          workflows_1.1.3      workflowsets_1.0.1  
  xfun_0.41            xml2_1.3.6           xtable_1.8-4        
  yaml_2.3.7           yardstick_1.2.0      zoo_1.8-12

R Packages and Setup

1 Background

2 Research Questions

3 Data Ingest and Management

3.1 Loading the Raw Data

3.2 Cleaning the Data

3.2.1 Selecting Variables We’ll Use

3.2.2 Converting Variable Types

3.2.3 Sampling the Data

3.2.4 Working with Categorical Predictors

3.2.5 Working with Quantitative Variables

3.2.6 Identifying Missingness

3.3 Print and Save The Final Tibble

4 Code Book and Description

5 Analyses

5.1 Analysis 1

5.1.1 The Research Question

5.1.2 The Outcome

5.1.3 Visualizing the Outcome Variable

5.1.4 The Predictors

5.2 rooby_rooby_roo vs network

5.3 rooby_rooby_roo vs season

5.4 rooby_rooby_roo vs imdb

5.5 rooby_rooby_roo vs engagement

5.6 Visualizing the distribution of the zero counts across the network predictor

5.7 Visualizing the distribution of the zero counts across the season predictor

5.8 Preliminary Thoughts

5.8.1 Missingness

5.8.2 Imputing Missing Variables

5.8.3 Splitting the data

5.8.4 Assessing Collinearity

5.8.5 Non-Linear Terms

5.8.6 Fitting a Poisson Model

5.8.7 Model Coefficients

5.8.8 Poisson Rootogram

5.8.9 Storing Training Sample Predictions

5.8.10 Summarizing Training Sample mod_poisson Fit

5.8.11 Fitting a Negative Binomial Model

5.8.12 Model Coefficients

5.8.13 Negative Binomial Model Rootogram

5.8.14 Storing Training Sample Predictions

5.8.15 Summarizing Training Sample mod_nb Fit

5.8.16 Fitting a Zero-Inflated Poisson Model

5.8.17 Zero-Inflated Poisson Model Rootogram

5.8.18 Storing Training Sample Predictions

5.8.19 Summarizing Training Sample mod_nb Fit

5.8.20 Fitting a Zero-Inflated Negative Binomial Model

5.8.21 Rootogram

5.8.22 Storing Training Sample Predictions

5.8.23 Summarizing Training Sample mod_nb Fit

5.8.24 Comparing Rootograms

5.8.25 Vuong’s Procedure Comparing the Negative Binomial Model and the Poisson Model

5.8.26 Vuong’s Procedure Comparing the Zero-Inflated Poisson Model and the Poisson Model

5.8.27 Vuong’s Procedure Comparing the ZINB Model and the Poisson Model

5.8.28 Vuong’s Procedure Comparing the ZINB Model and the Negative Binomial Model

5.8.29 Testing Sample Validations

5.8.30 Selecting a Final Model

5.8.31 Rootogram

5.8.32 Storing Fitted Values and Residuals

5.8.33 Diagnostic Plots

5.9 Outlier Diagnostics

5.10 Testing for Overdispersion

5.11 Model Coefficients

5.11.1 Odds Ratios

6 Analysis 2

6.1 The Outcome

6.2 The Predictors

6.3 Missingness

6.4 Imputing Missing Variables

6.5 Visualizing Data Exploration

6.5.1 motive vs monster_amount

6.5.2 motive vs monster_real

6.5.3 motive vs monster_gender

6.6 Assessing collinearity

6.7 Non-Linear Terms

6.8 Fitting the Multinomial Model

6.9 Model Summary

6.10 Fitting a Proportional Odds Model

6.11 Testing the Proportional Odds Assumption

6.12 Comparing Metrics

5.2 `rooby_rooby_roo` vs `network`

5.3 `rooby_rooby_roo` vs `season`

5.4 `rooby_rooby_roo` vs `imdb`

5.5 `rooby_rooby_roo` vs `engagement`

5.6 Visualizing the distribution of the zero counts across the `network` predictor

5.7 Visualizing the distribution of the zero counts across the `season` predictor

5.8.10 Summarizing Training Sample `mod_poisson` Fit

5.8.15 Summarizing Training Sample `mod_nb` Fit

5.8.19 Summarizing Training Sample `mod_nb` Fit

5.8.23 Summarizing Training Sample `mod_nb` Fit

6.5.1 `motive` vs `monster_amount`

6.5.2 `motive` vs `monster_real`

6.5.3 `motive` vs `monster_gender`

6.19 Predicted Probabilities and `monster_amount`

6.20 Predicted Probabilities and `monster_real`

6.21 Predicted Probabilities and `monster_gender`