Predictors of Heavy E-Cigarette Use among Adults in the United States

An analysis of the Population Assessment of Tobacco and Health (PATH) Study.

Author

Naveen Kannan

Published

2023-03-19

R Packages and Setup

Code

knitr::opts_chunk$set(comment = NA) 
library(broom)
library(janitor) 
library(naniar)
library(glue)
library(rsample)
library(yardstick)
library(GGally)
library(mice)
library(car)
library(pROC)
library(patchwork)
library(rms)
library(simputation)
library(kableExtra)
library(tidyverse)
theme_set(theme_bw())

1 Data Source

The data under study comes from the Population Assessment of Tobacco and Health (PATH) Study. It was obtained from ICPSR, the Inter-university Consortium for Political and Social Research.

The Population Assessment of Tobacco and Health (PATH) Study began originally surveying 45,971 adult and youth respondents. The study sampled over 150,000 mailing addresses across the United States to create a national sample of tobacco users and non-users, and is a collaboration between the National Institute on Drug Abuse (NIDA), National Institutes of Health (NIH), and the Center for Tobacco Products (CTP), Food and Drug Administration (FDA).

The data can be found at (https://www.icpsr.umich.edu/web/ICPSR/studies/36498/datadocumentation).

For adults, data was collected via interviews. Each case constitutes a single completed interview. Each case in a Youth data file represents one youth and his or her parent’s responses about that youth. Parents who provided permission for their child to participate in a Youth Interview were asked to complete a brief interview about their child. Youth who turn 18 by the current wave of data collection are considered “aged-up adults” and are invited to complete the Adult Interview.

2 The Subjects

The dataset from which data is being taken is Dataset 1001 (DS1001), which contains the data from the Wave 1 adult questionnaire. 45,971 adults and youth constitute the first (baseline) wave of data collected by this longitudinal cohort study.

The DS1001 data can be considered a random sample that constitutes a longitudinal cross section of adults(of age 18 and above) across the United States, since the PATH study sampled over 150,000 mailing addresses across the United States to create a national sample of tobacco users and non-users. The DS1001 data contains data on adults (individuals of age 18 and above) alone.

From the DS1001 dataset, individuals who used e-cigarettes at the time of the Wave 1 questionnaire were selected, excluding individuals who did not use e-cigarettes at that time. Dual users of e-cigarettes and other forms of tobacco were not excluded. The DS1001 dataset only contains data on adults, and thus youth below the age of 18 are excluded from the final dataset.

Individuals who had complete data on both outcomes under study were then selected. The resulting dataset had 3000 individuals. A random sample of 1200 individuals was taken from this subset, ensuring that the same respondents were taken for both analyses.

This sample of 1200 PATH study participants (contained in the tibble data_final) can thus be considered as a longitudinal cross-sectional sample of individuals who used e-cigarettes at the time of administering the Wave 1 questionnaire, who may or may not have used other tobacco products, across the United States, of ages 18 and above. Wave 1 began at September 2013 and ended at December 2014.

3 Loading and Tidying the Data

3.1 Loading the Raw Data

Access to the data requires users to log-in to the ICPSR website. Thus, a direct download is not possible. This data can be freely downloaded by the public, and therefore a local copy of the data in the form of a tsv file was read using read_tsv from the tidyverse.

Code

#data_raw <- read_tsv("36498-1001-Data.tsv", show_col_types = FALSE)
#saveRDS(data_raw, "36498-1001-Data.Rds")
data_raw = readRDS("36498-1001-Data.Rds")

3.2 Cleaning the Data

3.2.1 Selecting Variables We’ll Use

Code

data_var <- data_raw |>
  select(PERSONID,R01_AE1022,R01_AE0103,R01R_A_AE1006,R01_AE1062,R01_AE1100,R01_AE1099,R01_AE9004,R01_AE9010,R01_AE1050,R01R_A_MINFIRST_ECIG, R01_AE1003)

3.2.2 Dealing with Missingness

The DS1001 dataset provides negative values for variables that are missing. There are several different subtypes of missing data:

Data removed due to respondent request
Data not ascertained
Respondent answer is “Don’t Know”
Respondent refused to answer
Question inapplicable to respondent

All of these values are negative in the original DS1001 dataset, and none of the variables under study have negative values. It is also not possible for the variables under study to have negative values. Therefore, all negative values were replaced with NA to ensure missing data are correctly identified.

Code

data_var <- data_var|> 
  mutate(across(everything(), function(x){replace(x, which(x<0), NA)}))

3.2.3 Changing Variable Names

The variables were renamed to have better, more readable names.

Code

data_var <- data_var|>
  mutate(ecig_30day = as.integer(R01_AE1022),
         puff_num = as.integer(R01_AE0103),
         age_range = factor(R01R_A_AE1006),
         less_harm = factor(R01_AE1062),
         ecig_reg = factor(R01_AE1100),
         harm_perc = factor(R01_AE1099),
         price_paid = factor(R01_AE9004),
         personid = as.character(PERSONID),
         ecig_nico = as.factor(R01_AE9010),
         ecig_flavored = as.factor(R01_AE1050),
         min_num = as.integer(R01R_A_MINFIRST_ECIG),
         ecig_smoker = factor(R01_AE1003)) |>
  select(personid, ecig_30day,puff_num,age_range,less_harm,ecig_reg,harm_perc,price_paid,ecig_nico, ecig_flavored, min_num, ecig_smoker)

data_var <- data_var |>
  clean_names()

3.2.4 Sampling the Data

The dataset was filtered to select only individuals who were users of e-cigarettes at the time of study. The dataset was then filtered to have complete cases on both the outcomes under study, namely ecig_30day and ecig_reg.

Code

data_outcome <- data_var |>
  filter(ecig_smoker == 1 | ecig_smoker == 2)|> # Excluding individuals who did not use e-cigarettes at the time of Wave 1 
  filter(complete.cases(ecig_30day, ecig_reg)) |>
  select(personid, ecig_30day,puff_num,age_range,less_harm,ecig_reg,harm_perc,price_paid,ecig_nico, ecig_flavored, min_num, ecig_smoker)
dim(data_outcome)

[1] 3000   12

3000 respondents have information on both the outcome variables under study. After setting a seed for reproducibility, a random sample of 1200 respondents was taken from the data_outcome tibble. This was also done to ensure that the same respondents are analyzed for both regression models.

Code

set.seed(2023) 
data_complete <- slice_sample(data_outcome, n = 1200)

Once a random sample was taken, the tabyl function was used to ensure that all individuals in the tibble data_complete were individuals who had used e-cigarettes.

Code

data_complete |>
  tabyl(ecig_smoker) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

ecig_smoker	n	percent
1	41	0.0341667
2	1159	0.9658333
3	0	0.0000000

From the documentation for the original variable name for ecig_smoker(R01_AE1003), the question asked was, ‘Do you now use e-cigarettes?’, with three possible levels.

Value	Label	Frequency
1	1 = Every day	41
2	2 = Some days	1159
3	3 = Not at all	0

The above table shows the distribution of the responses to the question, demonstrating that all the individuals under study are individuals who were using e-cigarettes at the time of the Wave 1 questionnaire. Individuals who did not use e-cigarettes at the time are not included in the random sample of 1200 respondents.

3.2.5 Working with Categorical Predictors

The tabyl function was used to check if all factor variables have atleast 30 observations at each level.

Code

data_complete |> 
  tabyl(age_range)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

age_range	n	percent	valid_percent
1	54	0.0450000	0.0453020
2	365	0.3041667	0.3062081
3	276	0.2300000	0.2315436
4	181	0.1508333	0.1518456
5	169	0.1408333	0.1417785
6	147	0.1225000	0.1233221
NA	8	0.0066667	NA

Code

data_complete |> 
  tabyl(less_harm) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

less_harm	n	percent	valid_percent
1	955	0.7958333	0.7998325
2	239	0.1991667	0.2001675
NA	6	0.0050000	NA

Code

data_complete |> 
  tabyl(ecig_reg) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

ecig_reg	n	percent
1	369	0.3075
2	831	0.6925

Code

data_complete |> 
  tabyl(harm_perc) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

harm_perc	n	percent	valid_percent
1	822	0.6850000	0.6919192
2	343	0.2858333	0.2887205
3	23	0.0191667	0.0193603
NA	12	0.0100000	NA

The harm_perc variable was found to have a level 3 which only had 23 observations. Factor levels 2 and 3 for harm_perc were collapsed into a single category.

Code

data_complete$harm_perc <- fct_collapse(data_complete$harm_perc,
  '2' = c("2", "3"),
  '1' = "1")

Now, harm_perc has only 2 levels, each with more than 30 distinct values.

Code

data_complete |> 
  tabyl(harm_perc)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

harm_perc	n	percent	valid_percent
1	822	0.685	0.6919192
2	366	0.305	0.3080808
NA	12	0.010	NA

Code

data_complete |> 
  tabyl(price_paid)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

price_paid	n	percent	valid_percent
1	189	0.1575000	0.2342007
2	260	0.2166667	0.3221809
3	328	0.2733333	0.4064436
4	30	0.0250000	0.0371747
NA	393	0.3275000	NA

Code

data_complete |> 
  tabyl(ecig_nico)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

ecig_nico	n	percent	valid_percent
1	320	0.2666667	0.8695652
2	48	0.0400000	0.1304348
NA	832	0.6933333	NA

Code

data_complete |> 
  tabyl(ecig_flavored)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

ecig_flavored	n	percent	valid_percent
1	246	0.2050	0.6666667
2	123	0.1025	0.3333333
NA	831	0.6925	NA

The other factor variables were found to have at least 30 observations at each level. Furthermore, none of the categorical variables have more than 6 levels.

3.2.6 Working with Quantitative Variables

Code

data_complete |> 
  select(personid, ecig_30day, puff_num, min_num) |>
  summarize(across(personid:min_num, ~ n_distinct(.)))

# A tibble: 1 × 4
  personid ecig_30day puff_num min_num
     <int>      <int>    <int>   <int>
1     1200         28       29      31

The quantitative variables ecig_30day,puff_num and min_num have at least 10 different, ordered, observed values.

Code

data_complete |> 
  select(personid, ecig_30day, puff_num, min_num) |>
  miss_var_summary() |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

variable	n_miss	pct_miss
min_num	853	71.08333
puff_num	849	70.75000
personid	0	0.00000
ecig_30day	0	0.00000

The miss_var_summary function from the naniar package was used to identify the number of missing values for the continuous variables under study.

3.2.7 Renaming the factors

The categorical variables had their factors renamed to make the data more readable. The final tidy tibble was designated as data_final.

Code

data_final <- data_complete|>
  mutate(age_range = fct_recode(factor(age_range),
                                "<18" = "1",
                                "18to24" = "2",
                                "25to34" = "3",
                                "35to44" = "4",
                                "45to54" = "5",
                                ">55" = "6"),
         less_harm = fct_recode(factor(less_harm),
                                "Yes" = "1",
                                "No" = "2"),
         ecig_reg = fct_recode(factor(ecig_reg),
                               "Yes" = "1",
                               "No" = "2"),
         harm_perc = fct_recode(factor(harm_perc),
                                "less_harm" = "1",
                                "same_or_more" = "2"),
         price_paid = fct_recode(factor(price_paid),
                                 "Below$10" = "1",
                                 "$10to$20" = "2",
                                 "$21to$100" = "3",
                                 ">$100" = "4"),
         ecig_nico = fct_recode(factor(ecig_nico),
                                "Yes" = "1",
                                "No" = "2"),
         ecig_flavored = fct_recode(factor(ecig_flavored),
                                "Yes" = "1",
                                "No" = "2"))

3.2.8 Splitting the Data

Once a random sample of 1200 respondents was taken, the data was separated into two tibbles for the purpose of describing the data under study for the two models. data_log denotes the dataset for the logistic regression, and data_lin denotes the dataset for the linear regression.

Code

data_lin <- data_final|>
  select(personid,ecig_30day,puff_num,age_range,less_harm,ecig_nico)
data_log <- data_final|>
  select(personid,ecig_reg,min_num,harm_perc,price_paid,ecig_nico,ecig_flavored)

4 The Tidy Tibble

4.1 Listing the Tibble

The tidy tibble, data_final, was listed.

Code

data_final

# A tibble: 1,200 × 12
   personid   ecig_30day puff_num age_range less_harm ecig_reg harm_perc   
   <chr>           <int>    <int> <fct>     <fct>     <fct>    <fct>       
 1 P000319879          1       NA <18       Yes       No       less_harm   
 2 P000255041          0       NA 25to34    Yes       Yes      less_harm   
 3 P000413552          2       NA 35to44    Yes       No       same_or_more
 4 P000239368          1       NA 45to54    Yes       No       less_harm   
 5 P000334022          1       NA 35to44    Yes       No       less_harm   
 6 P000151148         25       NA 45to54    Yes       Yes      less_harm   
 7 P000423889          2       NA 45to54    No        No       same_or_more
 8 P000264820         24        4 35to44    No        No       less_harm   
 9 P000321341          2       NA 45to54    No        No       same_or_more
10 P000200073          2       NA 25to34    No        No       same_or_more
# ℹ 1,190 more rows
# ℹ 5 more variables: price_paid <fct>, ecig_nico <fct>, ecig_flavored <fct>,
#   min_num <int>, ecig_smoker <fct>

4.2 Size and Identifiers

There are 1200 rows, with 12 columns in the final tibble data_final. One of these columns

Code

dim(data_final)

[1] 1200   12

The personid variable provides the identifier for each row in the tidy tibble. The n_distinct function was used to demonstrate 1200 individual values for personid in data_final, indicating that each row has a unique personid value.

Code

n_distinct(data_final$personid)

[1] 1200

4.3 Save The Tibble

The saveRDS function was used to save the tidy tibble in a .rds file.

Code

write_rds(data_final, file = "data_final.Rds")

5 The Code Book

5.1 Defining the Variables

Variable	Original DS1001 code	Description \| Type \| Analytic Role \| Missing values
`personid`	`PERSONID`	Unique Participant Identifier	Character variable	Unique Identifier	0
`ecig_30day`	`R01_AE1022`	On how many of the past 30 days did the participant use an e-cigarette?	Quantitative Variable	Linear Regression Outcome	0
`puff_num`	`R01_AE0103`	How many puffs from an e-cigarette did the participant take the last time they smoked an e-cigarette?	Quantitative Variable	Predictor	849
`age_range`	`R01R_A_AE1006`	How old was the participant when they first used an e-cigarette, even one or two times? 6 levels:`<18`,`18to24`, `25to34`,`35to44`, `45to54`,`>55`	Multi-Categorical Variable with 6 factors	Predictor	8
`less_harm`	`R01_AE1062`	Did the participant use / used to use e-cigarettes because they might be / have been less harmful to them? 2 levels: `yes`, `no`	Binary Categorical Variable	Predictor	6
`ecig_reg`	`R01_AE1100`	Has the participant ever used e-cigarettes fairly regularly? 2 levels:`yes`,`no`	Binary Categorical Variable	Logistic regression Outcome	0
`harm_perc`	`R01_AE1099`	To the participant, is using e-cigarettes less harmful, about the same, or more harmful than smoking cigarettes? 2 levels:`1`,`2`	Binary Categorical Variable	Predictor	12
`price_paid`	`R01_AE9004`	About how much did the participant pay for their e-cigarette? 4 levels: `Below$10`,`$10to$20`, `$21t$100`, `>$100`	Multi-Categorical Variable with 4 factors	Predictor	393
`ecig_nico`	`R01_AE9010`	Does the e-cigarette the participant usually uses contain nicotine? 2 levels:`yes`,`no`	Binary Categorical Variable	Predictor	832
`ecig_flavored`	`R01_AE9010`	Is the regular / last brand of e-cigarettes used by the participant flavored to taste like menthol, mint,clove, spice, candy, fruit, chocolate, alcohol or other sweets? 2 levels:`yes`,`no`	Binary Categorical Variable	Predictor	831
`min_num`	`R01R_A_MINFIRST_ECIG`	Amount of time from waking to smoking first e-cigarette of any given day (in minutes).	Quantitative Variable	Predictor	853

The levels of the categorical variables and their description are as follows:

age_range

How old was the participant when they first used an e-cigarette, even one or two times?

Level	Description	Distinct Observations in Level
`<18`	Less than 18 years old	54
`18to24`	18 to 24 years old	365
`25to34`	25 to 34 years old	276
`35to44`	35 to 44 years old	181
`45to54`	45 to 54 years old	169
`>55`	55 years old or older	147

less_harm

Did the participant use / used to use e-cigarettes because they might be / have been less harmful to them?

Level	Description	Distinct Observations in Level
`yes`	Yes	955
`no`	No	239

ecig_reg

Has the participant ever used e-cigarettes fairly regularly?

Level	Description	Distinct Observations in Level
`yes`	Yes	369
`no`	No	831

harm_perc

To the participant, is using e-cigarettes less harmful, about the same, or more harmful than smoking cigarettes?

Level	Description	Distinct Observations in Level
`1`	Less harmful	822
`2`	About the Same / More harmful	366

price_paid

About how much did the participant pay for their e-cigarette?

Level	Description	Distinct Observations in Level
`Below$10`	Less than $10	189
`$10to$20`	$10 to $20	260
`$21to$100`	$21 to $100	328
`>$100`	More than $100	30

ecig_nico

Does the e-cigarette the participant usually uses contain nicotine?

Level	Description	Distinct Observations in Level
`yes`	Yes	320
`no`	No	48

ecig_flavored

Is the regular/last brand of e-cigarettes used by the participant flavored to taste like menthol, mint, clove, spice, candy, fruit, chocolate, alcohol or other sweets?

Level	Description	Distinct Observations in Level
`yes`	Yes	246
`no`	No	123

5.2 Numerical Description

The describe function from the Hmisc package was used on the entire tibble.

Code

describe(data_final)

data_final 

 12  Variables      1200  Observations
--------------------------------------------------------------------------------
personid 
       n  missing distinct 
    1200        0     1200 

lowest : P000000176 P000000318 P000000457 P000001710 P000001806
highest: P000497409 P000498677 P000499021 P000499448 P000499846
--------------------------------------------------------------------------------
ecig_30day 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    1200        0       28    0.972     5.11    6.663        0        0 
     .25      .50      .75      .90      .95 
       0        2        6       15       20 

lowest :  0  1  2  3  4, highest: 25 26 28 29 30
--------------------------------------------------------------------------------
puff_num 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     351      849       28     0.99    10.77    10.15        2        3 
     .25      .50      .75      .90      .95 
       4        6       10       25       30 

lowest :   1   2   3   4   5, highest:  50  51  60  75 100
--------------------------------------------------------------------------------
age_range 
       n  missing distinct 
    1192        8        6 
                                                    
Value         <18 18to24 25to34 35to44 45to54    >55
Frequency      54    365    276    181    169    147
Proportion  0.045  0.306  0.232  0.152  0.142  0.123
--------------------------------------------------------------------------------
less_harm 
       n  missing distinct 
    1194        6        2 
                  
Value      Yes  No
Frequency  955 239
Proportion 0.8 0.2
--------------------------------------------------------------------------------
ecig_reg 
       n  missing distinct 
    1200        0        2 
                      
Value        Yes    No
Frequency    369   831
Proportion 0.308 0.693
--------------------------------------------------------------------------------
harm_perc 
       n  missing distinct 
    1188       12        2 
                                    
Value         less_harm same_or_more
Frequency           822          366
Proportion        0.692        0.308
--------------------------------------------------------------------------------
price_paid 
       n  missing distinct 
     807      393        4 
                                                  
Value       Below$10  $10to$20 $21to$100     >$100
Frequency        189       260       328        30
Proportion     0.234     0.322     0.406     0.037
--------------------------------------------------------------------------------
ecig_nico 
       n  missing distinct 
     368      832        2 
                    
Value       Yes   No
Frequency   320   48
Proportion 0.87 0.13
--------------------------------------------------------------------------------
ecig_flavored 
       n  missing distinct 
     369      831        2 
                      
Value        Yes    No
Frequency    246   123
Proportion 0.667 0.333
--------------------------------------------------------------------------------
min_num 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     347      853       30    0.993    149.1      182        3        5 
     .25      .50      .75      .90      .95 
      15       60      240      360      522 

lowest :    1    2    3    4    5, highest:  720  780  840  900 1020
--------------------------------------------------------------------------------
ecig_smoker 
       n  missing distinct 
    1200        0        2 
                      
Value          1     2
Frequency     41  1159
Proportion 0.034 0.966
--------------------------------------------------------------------------------

6 Linear Regression Plans

6.1 My First Research Question

E-Cigarette Perception, Smoking Habits, and their association with Heavy E-Cigarette Use:

Are smoking e-cigarettes that contain nicotine, the perception of the purported healthiness of e-cigarettes when compared to smoking regular cigarettes, and smoking habits strong predictors of heavy e-cigarette use in adulthood?

6.2 My Quantitative Outcome

The quantitative variable I will be predicting is ecig_30day.

In the United States, e-cigarettes are not subject to the same marketing and promotion restrictions that apply to cigarettes. E-cigarette companies are permitted to advertise their products through mass media, and via the internet. Previous studies have demonstrated that e-cigarettes companies exacerbate and expand the tobacco epidemic by bringing lower risk youth into the market.^[1] E-Cigarette use among young adults has been rapidly expanding, and has surpassed conventional cigarette use among young adults.^[2]

From a study conducted in 2017, while the demographic and behavioral risk profiles of most youth who reported smoking cigarettes (both cigarettes and e-cigarettes) are consistent with smoking cigarettes, the risk profiles of the remaining e-cigarette-only users (about 25% of e-cigarette users) suggested that these individuals would have been unlikely to have initiated tobacco product use with cigarettes.^[3]

E-cigarettes are reducing smoking cessation rates and expanding the nicotine market by attracting low-risk youth who would be unlikely to initiate nicotine use with conventional cigarettes.^[4]

By attempting to predict higher use of e-cigarettes in the past 30 days(at the time of data collection) with the variables under study, the association between e-cigarette perception and smoking habits with heavy e-cigarette use can be studied.

Code

data_lin |>
  filter(complete.cases(ecig_30day)) |>
  dim()

[1] 1200    6

Out of the 1200 rows in data_lin, all 1200 have complete data on the outcome variable ecig_30day.

Code

p1 <- 
  ggplot(data_lin, aes(x = ecig_30day)) +
  geom_histogram(bins = 15, 
  fill = "#2b8cbe", col = "white") +
  labs( x = "Number of days used an e-cigarette", y ="")

p2 <- 
  ggplot(data_lin, aes(sample = ecig_30day)) + 
  geom_qq(col = "#2b8cbe") + geom_qq_line(col = "red") +
  labs(x = "", y = "Number of days used an e-cigarette")

p3 <- 
  ggplot(data_lin, aes(x = "", y = ecig_30day)) +
  geom_violin(fill = "#2b8cbe", alpha = 0.3) + 
  geom_boxplot(fill = "#2b8cbe", width = 0.3,
               outlier.color = "red") +
  labs(y = "Number of days used an e-cigarette", x = "") + coord_flip()

p2 + p1 - p3 +
  plot_layout(ncol = 1, height = c(3, 2)) + 
  plot_annotation(title = "Distribution of the number of days participants used an e-cigarette in the past 30 days",
         subtitle = glue("Across ", nrow(data_lin), 
                           " participants, taken as a random sample from the DS1001 dataset."),
         caption = "Data taken from the Population Assessment of Tobacco and Health (PATH) Study.")

The distribution is very notably right skewed. Taking the logarithm of the outcome variable might help in making it’s distribution more normalized.

Code

describe(data_lin$ecig_30day)

data_lin$ecig_30day 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    1200        0       28    0.972     5.11    6.663        0        0 
     .25      .50      .75      .90      .95 
       0        2        6       15       20 

lowest :  0  1  2  3  4, highest: 25 26 28 29 30

The describe function shows that there are 28 distinct values for the ecig_30day variable, and the variable has at least 10 different, ordered, observed values.

6.3 My Planned Predictors (Linear Model)

The inputs that will be used in the linear regression model are:

puff_num: This is a quantitative variable that has atleast 10 distinct, ordered, observed values.

Code

describe(data_lin$puff_num)

data_lin$puff_num 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     351      849       28     0.99    10.77    10.15        2        3 
     .25      .50      .75      .90      .95 
       4        6       10       25       30 

lowest :   1   2   3   4   5, highest:  50  51  60  75 100

age_range: This is a categorical variable which has 6 categories, that has at least 30 observations in each level of the factor.

Code

data_lin|>
  tabyl(age_range) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

age_range	n	percent	valid_percent
<18	54	0.0450000	0.0453020
18to24	365	0.3041667	0.3062081
25to34	276	0.2300000	0.2315436
35to44	181	0.1508333	0.1518456
45to54	169	0.1408333	0.1417785
>55	147	0.1225000	0.1233221
NA	8	0.0066667	NA

less_harm, which is a binary categorical variable.

Code

data_lin|>
  tabyl(less_harm) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

less_harm	n	percent	valid_percent
Yes	955	0.7958333	0.7998325
No	239	0.1991667	0.2001675
NA	6	0.0050000	NA

ecig_nico, which is a binary categorical variable.

Code

data_lin|>
  tabyl(ecig_nico)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

ecig_nico	n	percent	valid_percent
Yes	320	0.2666667	0.8695652
No	48	0.0400000	0.1304348
NA	832	0.6933333	NA

There are a total of 4 predictors.

Code

datatest1 <- data_lin |>
  filter(complete.cases(personid,ecig_30day,puff_num,age_range,less_harm,ecig_nico)) |>
  select(personid,ecig_30day,puff_num,age_range,less_harm,ecig_nico)
dim(datatest1)

[1] 137   6

With 137 rows having complete observations on all the variables under study, [4 + (N1 - 100)/100] is 4.3, and the number of predictors does not exceed this value.

From the cited literature,

I expect higher values of puff_num to be associated with higher values of ecig_30day. For less_harm and ecig_nico, I expect the yes factor to be associated with higher values of ecig_30day, ie, as the two variables change from no to yes, the value of ecig_30day is expected to increase.

I expect the lower levels of age_range to be associated with higher values of ecig_30day, ie, as age_range factors go from the lowest factors to the higher factors, I expect ecig_30day to decrease.

7 Logistic Regression Plans

7.1 My Second Research Question

Regular e-cigarette use and it’s associated factors:

Are factors such as using flavored e-cigarettes, or needing to use e-cigarettes immediately after waking up, or using e-cigarettes with nicotine, or use of e-cigarettes as a healthier alternative to regular cigarettes associated with regular e-cigarette use?

7.2 My Binary Outcome

The binary outcome I will be predicting is ecig_reg.

In the United States, e-cigarettes and their market value have rapidly expanded over the past years.^[2] E-cigarette use and it’s prevalent adoption have resulted in youth and young adults who would otherwise not have been exposed to nicotine/tobacco products to begin the use of the drug via e-cigarettes.^[3]

Reasons for this increased use of e-cigarettes are many, including the perception of e-cigarettes as being healthier than regular cigarettes, the presence of flavored e-cigarettes available for use, and the availability of cheap, refillable cartridges that only require e-cigarette fluid.^[1] By attempting to predict the odds of being a regular user of e-cigarettes using the predictors under study, the association between these factors and the odds of adoption of regular e-cigarette use can be more thoroughly studied.

Code

data_log|>
  tabyl(ecig_reg)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

ecig_reg	n	percent
Yes	369	0.3075
No	831	0.6925

There are no missing values for the outcome variable.

7.3 My Planned Predictors (Logistic Model)

For the logistic regression model, the following are the predictors I intend to use:

min_num: This is a quantitative variable that has atleast 10 distinct, ordered, observed values.

Code

describe(data_log$min_num)

data_log$min_num 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     347      853       30    0.993    149.1      182        3        5 
     .25      .50      .75      .90      .95 
      15       60      240      360      522 

lowest :    1    2    3    4    5, highest:  720  780  840  900 1020

price_paid: This is a categorical variable which has 4 categories, that has at least 30 observations in each level of the factor.

Code

data_log |>
  tabyl(price_paid)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

price_paid	n	percent	valid_percent
Below$10	189	0.1575000	0.2342007
$10to$20	260	0.2166667	0.3221809
$21to$100	328	0.2733333	0.4064436
>$100	30	0.0250000	0.0371747
NA	393	0.3275000	NA

harm_perc, which is a binary categorical variable.

Code

data_log |>
  tabyl(harm_perc)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

harm_perc	n	percent	valid_percent
less_harm	822	0.685	0.6919192
same_or_more	366	0.305	0.3080808
NA	12	0.010	NA

-ecig_nico, which is a binary categorical variable.

Code

data_log |>
  tabyl(ecig_nico)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

ecig_nico	n	percent	valid_percent
Yes	320	0.2666667	0.8695652
No	48	0.0400000	0.1304348
NA	832	0.6933333	NA

-ecig_flavored, which is a binary categorical variable.

Code

data_log |>
  tabyl(ecig_flavored)|>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

ecig_flavored	n	percent	valid_percent
Yes	246	0.2050	0.6666667
No	123	0.1025	0.3333333
NA	831	0.6925	NA

There are 5 predictors.

With 369 subjects in the smaller of my two outcome groups, [4 + (N2 - 100)/100] is 6.69, and the number of predictors does not exceed this value.

From the cited literature,

I expect lower values of min_num to be associated with higher odds of being in the yes category of ecig_reg.I also expect lower values of price_paid to be associated with higher odds of being in the yes category of ecig_reg. I also expect yes values for ecig_nico and ecig_flavored to be associated with higher odds of being in the yes category of ecig_reg. I expect less_harm values for harm_perc to be associated with higher odds of being in the yes category of ecig_reg.

8 Linear Regression Analyses

8.1 Missingness

Code

data_lin |>
  miss_var_summary() |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

variable	n_miss	pct_miss
puff_num	849	70.7500000
ecig_nico	832	69.3333333
age_range	8	0.6666667
less_harm	6	0.5000000
personid	0	0.0000000
ecig_30day	0	0.0000000

There is missing data in all of the variables under study, and the outcome variable ecig_30day does not have any missing values. There are missing predictor values for more than 10% of the subjects under study, and more than 50 subjects have missing values.

The data can be assumed to be MAR (Missing At Random), since the missing data are not randomly distributed. It cannot be assumed to be MNAR (Missing Not At Random) since there does not appear to be a relationship between the magnitude of a value and it’s missingness (or inclusion in the non-missing data), since there is no preponderance of values with lower or higher magnitude being missing.

Single imputation was then performed on the dataset, to account for missingness. The complete dataset, containing the data for both the analyses had it’s missing values imputed.

Code

set.seed(4322023)
data_final_i <- data_final |> data.frame() |>
  impute_rhd(less_harm ~ ecig_30day + ecig_reg) |>
  impute_rhd(age_range ~ ecig_30day + ecig_reg + less_harm) |>
  impute_rhd(ecig_nico ~ age_range ) |>
  impute_rlm(puff_num ~ ecig_30day ) |>
  impute_cart(harm_perc ~ ecig_reg) |>
  impute_cart(price_paid ~ ecig_reg + harm_perc) |>
  impute_cart(ecig_flavored ~ ecig_reg + price_paid) |>
  impute_cart(ecig_nico ~ ecig_reg + ecig_flavored) |>
  impute_pmm(min_num ~ ecig_30day + age_range + less_harm) |>
  as_tibble()

Code

data_lin_i <- data_final_i |>
  select(personid,ecig_30day,puff_num,ecig_nico,age_range,less_harm)

The imputed dataset was then checked to see if it was still missing any values.

Code

miss_var_summary(data_lin_i) |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

variable	n_miss	pct_miss
personid	0	0
ecig_30day	0	0
puff_num	0	0
ecig_nico	0	0
age_range	0	0
less_harm	0	0

8.2 Outcome Transformation

The distribution of the outcome variable was plotted.

Code

p1 <- 
  ggplot(data_lin_i, aes(x = ecig_30day)) +
  geom_histogram(bins = 15, 
  fill = "#2b8cbe", col = "white") +
  labs( x = "Number of days used an e-cigarette", y ="")

p2 <- 
  ggplot(data_lin_i, aes(sample = ecig_30day)) + 
  geom_qq(col = "#2b8cbe") + geom_qq_line(col = "red") +
  labs(x = "", y = "Number of days used an e-cigarette")

p3 <- 
  ggplot(data_lin_i, aes(x = "", y = ecig_30day)) +
  geom_violin(fill = "#2b8cbe", alpha = 0.3) + 
  geom_boxplot(fill = "#2b8cbe", width = 0.3,
               outlier.color = "red") +
  labs(y = "Number of days used an e-cigarette", x = "") + coord_flip()

p2 + p1 - p3 +
  plot_layout(ncol = 1, height = c(3, 2)) + 
  plot_annotation(title = "Distribution of the number of days participants used an e-cigarette in the past 30 days",
         subtitle = glue("Across ", nrow(data_lin_i), 
                           " participants, taken as a random sample from the DS1001 dataset."),
         caption = "Data taken from the Population Assessment of Tobacco and Health (PATH) Study.")

The outcome variable was modified so that the data would be strictly positive, making it possible to assess the need for transformation. 1 was added to each value.

Code

data_lin_i <- data_lin_i |>
  mutate(int_ecig_30day = ecig_30day + 1) 
boxCox(data_lin_i$int_ecig_30day ~ 1)

Code

powerTransform(data_lin_i$int_ecig_30day ~ 1)

Estimated transformation parameter 
        Y1 
-0.1938334

The Box-Cox plot peaks at a Y value of -0.19, which is approximately close enough to -1 to justify using -1 on Tukey’s ladder of power transformations, which suggests taking the inverse of the outcome.

Code

p1 <- 
  ggplot(data_lin_i, aes(x = log(int_ecig_30day))) +
  geom_histogram(bins = 15, 
  fill = "#2b8cbe", col = "white") +
  labs( x = "Natural Logarithm of the Number of days participant used an e-cigarette", y ="")

p2 <- 
  ggplot(data_lin_i, aes(sample = log(int_ecig_30day))) + 
  geom_qq(col = "#2b8cbe") + geom_qq_line(col = "red") +
  labs(x = "", y = str_wrap("Natural Logarithm of the Number of days participant used an e-cigarette", width = 50))

p3 <- 
  ggplot(data_lin_i, aes(x = "", y = log(int_ecig_30day))) +
  geom_violin(fill = "#2b8cbe", alpha = 0.3) + 
  geom_boxplot(fill = "#2b8cbe", width = 0.3,
               outlier.color = "red") +
  labs(y = "Natural Logarithm of the Number of days participant used an e-cigarette", x = "") + coord_flip()

p2 + p1 - p3 +
  plot_layout(ncol = 1, height = c(3, 2)) + 
  plot_annotation(title = "Distribution of the natural logarithm of the number of days participants used an e-cigarette in the past 30 days",
         subtitle = glue("Across ", nrow(data_lin_i), 
                           " participants, taken as a random sample from the DS1001 dataset."),
         caption = "Data taken from the Population Assessment of Tobacco and Health (PATH) Study.")

8.3 Scatterplot Matrix and Collinearity

A new variable log_ecig_30day was created by taking the natural logarithm of the outcome variable, which itself was modified by adding 1 to each of it’s values.

Code

data_lin_i <- data_lin_i |>
  mutate(log_ecig_30day = log(int_ecig_30day))

The ggpairs function was used to plot a scatterplot matrix.

Code

ggpairs(data_lin_i, columns = c("puff_num", "age_range", "less_harm", 
                               "ecig_nico", "log_ecig_30day"))

mod_A was fit using the lm function. The vif function from the car package was used to estimate if the variables had significant collinearity with each other.

Code

mod_A_car <- lm(ecig_30day ~ puff_num + age_range + less_harm + ecig_nico, data = data_lin_i )
car::vif(mod_A_car)

              GVIF Df GVIF^(1/(2*Df))
puff_num  1.011415  1        1.005691
age_range 1.045301  5        1.004440
less_harm 1.006246  1        1.003118
ecig_nico 1.044649  1        1.022081

From the output, the variables do not seem to have significant collinearity with each other.

8.4 Model A

8.4.1 Fitting Model A

Model A was fit using both lm and ols. The outcome variable is log_ecig_30day, and the predictors are puff_num (A Quantitative variable), age_range (A multi-categorical variable), less_harm and ecig_nico(both binary variables).

Code

mod_A <- lm(log_ecig_30day ~ puff_num + age_range + less_harm + ecig_nico, data = data_lin_i )

Code

dd <- datadist(data_lin_i)
options(datadist = "dd")
mod_A_ols <- ols(log_ecig_30day ~ puff_num + age_range + less_harm + ecig_nico, data = data_lin_i, 
                 x = TRUE, y = TRUE )

8.4.2 Tidied Coefficient Estimates (Model A)

With confidence levels of 90%, the lm model (mod_A) had it’s coefficients tidied and placed in a table that was presented using kable.

Code

tidy(mod_A, conf.int = TRUE, conf.level = 0.90) |>
  select(term, estimate, se = std.error, 
         low90 = conf.low, high90 = conf.high, 
         p = p.value) |>
  kable(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)

term	estimate	se	low90	high90	p
(Intercept)	0.980	0.138	0.753	1.207	0.000
puff_num	0.055	0.004	0.048	0.062	0.000
age_range18to24	-0.204	0.142	-0.437	0.030	0.151
age_range25to34	-0.082	0.145	-0.320	0.157	0.574
age_range35to44	-0.169	0.151	-0.417	0.080	0.263
age_range45to54	-0.040	0.152	-0.290	0.210	0.794
age_range>55	0.068	0.155	-0.187	0.324	0.660
less_harmNo	-0.183	0.070	-0.298	-0.067	0.009
ecig_nicoNo	-0.117	0.086	-0.258	0.024	0.172

8.4.3 Summarizing Fit (Model A)

Using the glance function. the numerical summaries of mod_A’s fit were placed in a table.

Code

glance(mod_A) |>
  select(r2 = r.squared, adjr2 = adj.r.squared, sigma, 
         AIC, BIC, nobs, df, df.residual) |>
  kable(digits = c(3, 3, 2, 1, 1, 0, 0, 0))|>
  kable_classic_2(font_size = 24,full_width = F)

r2	adjr2	sigma	AIC	BIC	nobs	df	df.residual
0.142	0.137	0.97	3350.3	3401.2	1200	8	1191

8.4.4 Regression Diagnostics (Model A)

The four main diagnostic residual plots were visualized.

Code

par(mfrow = c(2,2)); plot(mod_A); par(mfrow = c(1,1))

From the normal Q-Q plot, there appears to be serious problems with the assumption of normality, despite the transformation of the outcome variable. There is also a definite pattern in the residuals vs fitted values plot, which has implications regarding the assumption of linearity. The scale-location plot also has a definite non-linear pattern, indicating problems with the assumption of constant variance.

8.5 Non-Linearity

Adding a non-linear term to model A might help with the problems visualized with the various assumptions made in a linear regression model. A Spearman $p^2$ plot was made with the predictors in model A.

Code

plot(spearman2(log_ecig_30day ~ puff_num + age_range + less_harm + ecig_nico, data = data_lin_i ))

From the plot, if a non-linear term were to improve the fit of the model, the predictor most likely to do so would be puff_num, followed much less closely by less_harm.

A restricted cubic spline with 4 knots in puff_num was then added to the model, spending 2 additional degrees of freedom as compared to model A (with the main effect of puff_num).

An interaction term between the main effects of puff_num and less_harm was added to the model, spending a single additional degree of freedom.

Overall, this would result in 3 additional degrees of freedom being spent.

8.6 Model B

8.6.1 Fitting Model B

Using lm and ols, model B was fit, using a restricted cubic spline on puff_num and an interaction term between the main effects of puff_num and less_harm.

Code

mod_B <- lm(log_ecig_30day ~ rcs(puff_num,4) + age_range + less_harm + ecig_nico + puff_num%ia%less_harm, data = data_lin_i )

Code

dd <- datadist(data_lin_i)
options(datadist = "dd")

mod_B_ols <- ols(log_ecig_30day ~ rcs(puff_num,4) + age_range + less_harm + ecig_nico + puff_num%ia%less_harm, data = data_lin_i,
                 x = TRUE, y = TRUE)

8.6.2 Tidied Coefficient Estimates (Model B)

mod_B had it’s coefficients tidied with a 90% confidence interval and placed in a tidy table.

Code

tidy(mod_B, conf.int = TRUE, conf.level = 0.90) |>
  select(term, estimate, se = std.error, 
         low90 = conf.low, high90 = conf.high, 
         p = p.value) |>
  kable(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)

term	estimate	se	low90	high90	p
(Intercept)	3.476	0.214	3.124	3.828	0.000
rcs(puff_num, 4)puff_num	-0.620	0.040	-0.686	-0.554	0.000
rcs(puff_num, 4)puff_num'	11.472	0.493	10.660	12.283	0.000
rcs(puff_num, 4)puff_num''	-57.329	2.382	-61.251	-53.408	0.000
age_range18to24	-0.103	0.112	-0.287	0.081	0.357
age_range25to34	0.020	0.114	-0.169	0.208	0.863
age_range35to44	-0.060	0.119	-0.256	0.135	0.612
age_range45to54	0.025	0.120	-0.173	0.222	0.838
age_range>55	0.155	0.122	-0.047	0.356	0.206
less_harmNo	0.144	0.114	-0.043	0.331	0.204
ecig_nicoNo	-0.121	0.068	-0.232	-0.009	0.076
puff_num %ia% less_harm	-0.031	0.014	-0.054	-0.008	0.029

8.6.3 Effects Plot for Model B

The effects plot for model mod_B_ols was created.

Code

plot(summary(mod_B_ols))

8.6.4 Summarizing Fit (Model B)

Using the glance function, the numerical summaries of model B’s fit were placed in a table.

Code

glance(mod_B) |>
  select(r2 = r.squared, adjr2 = adj.r.squared, sigma, 
         AIC, BIC, nobs, df, df.residual) |>
  kable(digits = c(3, 3, 2, 1, 1, 0, 0, 0)) |>
  kable_classic_2(font_size = 24,full_width = F)

r2	adjr2	sigma	AIC	BIC	nobs	df	df.residual
0.468	0.463	0.77	2782.8	2849	1200	11	1188

8.6.5 Regression Diagnostics (Model B)

The four residual diagnostic plots for model B were obtained.

Code

par(mfrow = c(2,2)); plot(mod_B); par(mfrow = c(1,1))

The residual plots are improved from the plots in model A. The q-q plot follows a more normal distribution. The residuals vs fitted values plot shows a more linear line, and the scale-location plot shows a more linear model as well. The assumptions of normality, linearity and constant variance are adhered to more rigorously in the residuals of model B, as compared to model A.

While there are significant outliers in the q-q plot, the residuals vs leverage plot does not show points that have a Cook’s distance of more than 0.5, indicating that these outliers do not exert an undue influence on the model. There are points that have an unusual combination of predictor variables, but their influence on the model is not significant.

8.7 Validating Models A and B

After setting a seed, the data_lin_i dataset was divided into a training and a testing sample. Two models were created, one model with linear terms alone, and the other model with the non-linear terms recommended by the Spearman $p^2$ plot. The coefficients obtained from these models were then used to obtain the predicted log_ecig_30day values in the holdout test sample.

Code

set.seed(4322023)

data_lin_split <- initial_split(data_lin_i, prop = 0.7)
data_lin_train <- training(data_lin_split)
data_lin_test <- testing(data_lin_split)

Functions from the yardstick package were used to obtain key summary of fit statistics for both the models’ predictions of the holdout testing sample. The metrics were then placed in a tibble and displayed using kable.

Code

mod_A_train <- ols(log_ecig_30day ~ puff_num + age_range + less_harm + ecig_nico, data = data_lin_train)
mod_B_train <- ols(log_ecig_30day ~ rcs(puff_num,4) + age_range + less_harm + ecig_nico + puff_num%ia%less_harm,
                  data = data_lin_train )
mod_A_test_aug <- augment(mod_A, newdata = data_lin_test)
mod_B_test_aug <- augment(mod_B, newdata = data_lin_test)
rep1 <- rmse(data = mod_A_test_aug, truth = log_ecig_30day, estimate = .fitted)
rep2 <- rmse(data = mod_B_test_aug, truth = log_ecig_30day, estimate = .fitted)
rep3 <- rsq(data = mod_A_test_aug, truth = log_ecig_30day, estimate = .fitted)
rep4 <- rsq(data = mod_B_test_aug, truth = log_ecig_30day, estimate = .fitted)
rep5 <- mae(data = mod_A_test_aug, truth = log_ecig_30day, estimate = .fitted)
rep6 <- mae(data = mod_B_test_aug, truth = log_ecig_30day, estimate = .fitted)
rep_full <- bind_rows(rep1, rep2, rep3, rep4, rep5, rep6)
rep_full <- rep_full |> 
  mutate(Models = c("Model A Holdout RMSE","Model B Holdout RMSE","Model A Holdout $R^2$","Model B Holdout $R^2$","Model A Holdout MAE","Model B Holdout MAE"))
rep_full <- rep_full |>
  select(Models, Metric = .metric, Estimate = .estimate)
rep_full |> kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

Models	Metric	Estimate
Model A Holdout RMSE	rmse	0.9562983
Model B Holdout RMSE	rmse	0.7614010
Model A Holdout $R^2$	rsq	0.1708639
Model B Holdout $R^2$	rsq	0.4727390
Model A Holdout MAE	mae	0.8158778
Model B Holdout MAE	mae	0.6037231

8.7.1 Holdout RMSE, $R^2$ and MAE

Model	Holdout RMSE	Holdout $R^2$	Holdout MAE
A	0.9563	0.1709	0.8159
B	0.7614	0.4727	0.5992

8.7.2 Validated $R^2$, MSE and IC statistics

After setting a seed, the validate function was used to obtain the validated R-squared and MSE values for models A and B.

Code

set.seed(4322023); (valA <- validate(mod_A_ols))

          index.orig training   test optimism index.corrected  n
R-square      0.1423   0.1495 0.1330   0.0165          0.1258 40
MSE           0.9393   0.9284 0.9495  -0.0210          0.9603 40
g             0.3083   0.3251 0.3141   0.0110          0.2972 40
Intercept     0.0000   0.0000 0.0266  -0.0266          0.0266 40
Slope         1.0000   1.0000 0.9811   0.0189          0.9811 40

Code

set.seed(4322023); (valB <- validate(mod_B_ols))

          index.orig training   test optimism index.corrected  n
R-square      0.4682   0.4825 0.4596   0.0229          0.4452 40
MSE           0.5824   0.5647 0.5918  -0.0272          0.6096 40
g             0.7450   0.7545 0.7403   0.0142          0.7308 40
Intercept     0.0000   0.0000 0.0273  -0.0273          0.0273 40
Slope         1.0000   1.0000 0.9757   0.0243          0.9757 40

Model	Validated $R^2$	Validated MSE	AIC	BIC	DF
A	0.1258	0.9603	3349.8	3400.7	8
B	0.4452	0.6096	2782.8	2849	11

8.7.3 ANOVA to compare Models A and B

Code

anova(mod_A, mod_B)

Analysis of Variance Table

Model 1: log_ecig_30day ~ puff_num + age_range + less_harm + ecig_nico
Model 2: log_ecig_30day ~ rcs(puff_num, 4) + age_range + less_harm + ecig_nico + 
    puff_num %ia% less_harm
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1   1191 1127.17                                  
2   1188  698.93  3    428.24 242.63 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This ANOVA table shows that model B has 3 additional degrees of freedom as compared to model A, and that adding non linear terms has resulted in a significant improvement in the predictive power of the model.

8.8 Final Linear Regression Model

I prefer Model B, because:

Model B has a higher validated $R^2$, and a higher holdout sample $R^2$.
Model B has a lower validated MSE (Mean Squared Error) and a lower holdout RMSE.
Model B has a lower holdout MAE, and a lower AIC and BIC.
Model B’s residual plots show that the assumptions of linearity, normality and non-heteroscedasticity are not violated.

8.8.1 Winning Model’s Parameter Estimates

The model B’s raw and adjusted $R^2$ were obtained.

Code

mod_B_ols

Linear Regression Model

ols(formula = log_ecig_30day ~ rcs(puff_num, 4) + age_range + 
    less_harm + ecig_nico + puff_num %ia% less_harm, data = data_lin_i, 
    x = TRUE, y = TRUE)

                Model Likelihood    Discrimination    
                      Ratio Test           Indexes    
Obs    1200    LR chi2    757.68    R2       0.468    
sigma0.7670    d.f.           11    R2 adj   0.463    
d.f.   1188    Pr(> chi2) 0.0000    g        0.745    

Residuals

     Min       1Q   Median       3Q      Max 
-3.17538 -0.63628  0.04184  0.48501  2.62802 

                        Coef     S.E.   t      Pr(>|t|)
Intercept                 3.4764 0.2139  16.26 <0.0001 
puff_num                 -0.6200 0.0398 -15.57 <0.0001 
puff_num'                11.4717 0.4931  23.26 <0.0001 
puff_num''              -57.3292 2.3822 -24.07 <0.0001 
age_range=18to24         -0.1032 0.1119  -0.92 0.3568  
age_range=25to34          0.0197 0.1143   0.17 0.8635  
age_range=35to44         -0.0605 0.1190  -0.51 0.6115  
age_range=45to54          0.0246 0.1199   0.20 0.8377  
age_range=>55             0.1549 0.1224   1.27 0.2059  
less_harm=No              0.1445 0.1136   1.27 0.2037  
ecig_nico=No             -0.1206 0.0678  -1.78 0.0756  
puff_num * less_harm=No  -0.0309 0.0141  -2.19 0.0290

8.8.2 Model Coefficients

Model B’s coefficients were obtained with 90% confidence intervals and presented using kable.

Code

tidy_mod_B <- tidy(mod_B, conf.int = TRUE, conf.level = 0.90)

tidy_mod_B |>
  kbl(digits = 4) |> 
  kable_classic_2(font_size = 24, full_width = F)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	3.4764	0.2139	16.2554	0.0000	3.1243	3.8284
rcs(puff_num, 4)puff_num	-0.6200	0.0398	-15.5677	0.0000	-0.6855	-0.5544
rcs(puff_num, 4)puff_num'	11.4717	0.4931	23.2637	0.0000	10.6599	12.2834
rcs(puff_num, 4)puff_num''	-57.3292	2.3822	-24.0658	0.0000	-61.2506	-53.4078
age_range18to24	-0.1032	0.1119	-0.9219	0.3568	-0.2874	0.0811
age_range25to34	0.0197	0.1143	0.1720	0.8635	-0.1685	0.2079
age_range35to44	-0.0605	0.1190	-0.5080	0.6115	-0.2564	0.1355
age_range45to54	0.0246	0.1199	0.2048	0.8377	-0.1728	0.2219
age_range>55	0.1549	0.1224	1.2655	0.2059	-0.0466	0.3564
less_harmNo	0.1445	0.1136	1.2718	0.2037	-0.0425	0.3314
ecig_nicoNo	-0.1206	0.0678	-1.7783	0.0756	-0.2322	-0.0090
puff_num %ia% less_harm	-0.0309	0.0141	-2.1867	0.0290	-0.0541	-0.0076

Code

ggplot(Predict(mod_B_ols))

The coefficients for the non-linear term are significantly larger than expected. Given that the outcome of interest is a count outcome, and the predictor (puff_num) has a very intuitive relationship with the outcome variable, in that individuals who have a zero for their outcome (ecig_30day) could be expected to have significantly lower (or even zero) values of puff_num, and individuals with any particular value of puff_num would have a collinear relationship with the outcome variable. It can be reasonably expected that the model is being driven significantly towards a particular value of log_ecig_30day, which could result in the exploding coefficients seen in the table of coefficients.

This indicates that model B is not the ideal sort of model to be used for a count outcome like ecig_30day.

Following this, the validated $R^2$ was also obtained, along with the validated MSE.

Code

set.seed(4322023); (valB_imp <- validate(mod_B_ols))

          index.orig training   test optimism index.corrected  n
R-square      0.4682   0.4825 0.4596   0.0229          0.4452 40
MSE           0.5824   0.5647 0.5918  -0.0272          0.6096 40
g             0.7450   0.7545 0.7403   0.0142          0.7308 40
Intercept     0.0000   0.0000 0.0273  -0.0273          0.0273 40
Slope         1.0000   1.0000 0.9757   0.0243          0.9757 40

An ANOVA test performed on mod_B_ols shows that a non linear term (a restricted cubic spline with 4 knots) on puff_num had significant impact on the predictive ability of model B, along with the interaction term between the main effect of puff_num and less_harm.

Code

anova(mod_B_ols)

                Analysis of Variance          Response: log_ecig_30day 

 Factor                                              d.f. Partial SS
 puff_num  (Factor+Higher Order Factors)                4 589.511975
  All Interactions                                      1   2.813085
  Nonlinear                                             2 425.614545
 age_range                                              5   8.073412
 less_harm  (Factor+Higher Order Factors)               2   3.811854
  All Interactions                                      1   2.813085
 ecig_nico                                              1   1.860506
 puff_num * less_harm  (Factor+Higher Order Factors)    1   2.813085
 TOTAL NONLINEAR + INTERACTION                          3 428.235704
 REGRESSION                                            11 615.229768
 ERROR                                               1188 698.931962
 MS          F      P     
 147.3779937 250.50 <.0001
   2.8130849   4.78 0.0290
 212.8072723 361.72 <.0001
   1.6146824   2.74 0.0179
   1.9059270   3.24 0.0395
   2.8130849   4.78 0.0290
   1.8605056   3.16 0.0756
   2.8130849   4.78 0.0290
 142.7452346 242.63 <.0001
  55.9299789  95.07 <.0001
   0.5883266

The ANOVA table shows that model B has 12 total degrees of freedom, with 4 non-linear terms of freedom.

Metric	Value
Raw $R^2$	0.468
Adjusted $R^2$	0.463
Validated $R^2$	0.4452
AIC	2782.8
BIC	2849
Validated MSE (Mean Squared Error)	0.6096

8.8.3 Effects Plot for Winning Model

Code

summary(mod_B_ols, conf.int = 0.90)

             Effects              Response : log_ecig_30day 

 Factor                    Low    High   Diff.  Effect    S.E.     Lower 0.9
 puff_num                  5.7586 7.0476 1.2889  0.596830 0.021791  0.560960
 age_range - <18:18to24    2.0000 1.0000     NA  0.103180 0.111920 -0.081062
 age_range - 25to34:18to24 2.0000 3.0000     NA  0.122840 0.061711  0.021257
 age_range - 35to44:18to24 2.0000 4.0000     NA  0.042718 0.069549 -0.071770
 age_range - 45to54:18to24 2.0000 5.0000     NA  0.127730 0.071094  0.010703
 age_range - >55:18to24    2.0000 6.0000     NA  0.258070 0.075061  0.134510
 less_harm - No:Yes        1.0000 2.0000     NA -0.041346 0.057248 -0.135580
 ecig_nico - No:Yes        1.0000 2.0000     NA -0.120600 0.067818 -0.232240
 Upper 0.9 
  0.6327000
  0.2874200
  0.2244300
  0.1572000
  0.2447600
  0.3816300
  0.0528930
 -0.0089634

Adjusted to: puff_num=6.016428 less_harm=Yes

Code

plot(summary(mod_B_ols, conf.int = 0.90))

From the summary above, for the puff_num coefficient, in model mod_b_ols,

We can conclude that the estimated effect of moving puff_num from 5.76 to 7.05 results in an increase of 0.597 in log_ecig_30day, with 90% confidence intervals of 0.595 to 0.733. IE, for model B,

As an individual’s number of e-cigarette puffs taken the last time they smoked an e-cigarette increases from 5.76 to 7.05, the predicted logarithm of the number of days they smoked an e-cigarette in the past 30 days (plus one day) increases by 0.597, with 90% CI(0.561 to 0.633), provided the individual’s age when they began smoking, their perception of the harmfulness of e-cigarettes, and their use or lack thereof of nicotine containing e-cigarettes are all held constant.

Given that this confidence interval does not include 0, this is a scientifically meaningful effect.

8.8.4 Model Calibration

Code

set.seed(4322023); plot(calibrate(mod_B_ols))


n=1200   Mean absolute error=0.222   Mean squared error=0.06564
0.9 Quantile of absolute error=0.364

The model does not appear to be particularly well calibrated. It under-predicts at the high and low ends of the actual log_ecig_30day values, and over-predicts elsewhere.

8.8.5 Numerical Description of Effect Sizes

Code

summary(mod_B_ols, conf.int = 0.90) |> kable(digits = 3) |> 
  kable_classic_2(font_size = 24, full_width = F)

	Low	High	Diff.	Effect	S.E.	Lower 0.9	Upper 0.9	Type
puff_num	5.759	7.048	1.289	0.597	0.022	0.561	0.633	1
age_range - <18:18to24	2.000	1.000	NA	0.103	0.112	-0.081	0.287	1
age_range - 25to34:18to24	2.000	3.000	NA	0.123	0.062	0.021	0.224	1
age_range - 35to44:18to24	2.000	4.000	NA	0.043	0.070	-0.072	0.157	1
age_range - 45to54:18to24	2.000	5.000	NA	0.128	0.071	0.011	0.245	1
age_range - >55:18to24	2.000	6.000	NA	0.258	0.075	0.135	0.382	1
less_harm - No:Yes	1.000	2.000	NA	-0.041	0.057	-0.136	0.053	1
ecig_nico - No:Yes	1.000	2.000	NA	-0.121	0.068	-0.232	-0.009	1

puff_num description :

For two subjects with the same age range at which they started smoking e-cigarettes, the same perception of the harmfulness of e-cigarettes, the same kind of usage or lack thereof of e-cigarettes containing nicotine, and if one subject took 5.76 e-cigarette puffs the last time they smoked an e-cigarette, and the other took 7.05 e-cigarettes puffs 7.05 the last time they smoked an e-cigarette, the model estimates that subject 1 will have a log(number of days smoked in the past 30 days + 1) value 0.597 units higher than subject 2’s log(number of days smoked in the past 30 days + 1). The 90% confidence interval for that estimate ranges from 0.561 to 0.633.

8.8.6 Nomogram of Winning Model

Code

plot(nomogram(mod_B_ols))

8.8.7 Prediction for a New Subject

A new theoretical subject, who took 5 puffs from an e-cigarette the last time they smoked an e-cigarette, who started smoking e-cigarettes when they were under 18, who uses e-cigarettes because they thought e-cigarettes might be less harmful to them than regular cigarettes, who usually uses an e-cigarette containing nicotine, was created. Their predicted log_ecig_30day value was then obtained and exponentiated. 1 was added to the exponentiated value to complete the transformation back into the original value.

Code

preds <- predict(mod_B_ols, expand.grid(puff_num = 7, age_range = "<18",
                           less_harm = "Yes", ecig_nico = "Yes"),
conf.int = 0.90, conf.type = "individual") |> as_tibble() |>
  mutate(point_estimate = (exp(linear.predictors) - 1),
         lower_orig = (exp(lower) - 1),
         upper_orig = (exp(upper) - 1)) |>
  select(point_estimate,lower_orig,upper_orig)
preds |> kable() |>
  kable_classic_2(font_size = 24,full_width = F)

point_estimate	lower_orig	upper_orig
2.720541	0.040103	12.3087

Predictor Values	Predicted days participant used an e-cigarette in the past 30 days	90% Confidence Interval
puff_num = 7, age_range = “<18”,less_harm = “Yes”, ecig_nico = “Yes”	2.72	0.04 to 12.31

9 Logistic Regression Analyses

9.1 Missingness

Code

data_log |>
  miss_var_summary() |>
  kbl() |>
  kable_classic_2(font_size = 24,full_width = F)

variable	n_miss	pct_miss
min_num	853	71.08333
ecig_nico	832	69.33333
ecig_flavored	831	69.25000
price_paid	393	32.75000
harm_perc	12	1.00000
personid	0	0.00000
ecig_reg	0	0.00000

There is missing data in all of the variables under study, and the outcome variable ecig_reg does not have any missing values. There are missing predictor values for more than 10% of the subjects under study, and more than 50 subjects have missing values.

Single imputation was already performed on the complete dataset, to account for missingness. The logistic regression dataset was selected.

Code

data_log_i <- data_final_i |>
  select(personid,ecig_reg,harm_perc,price_paid,ecig_flavored,ecig_nico,min_num)
miss_var_summary(data_log_i)

# A tibble: 7 × 3
  variable      n_miss pct_miss
  <chr>          <int>    <dbl>
1 personid           0        0
2 ecig_reg           0        0
3 harm_perc          0        0
4 price_paid         0        0
5 ecig_flavored      0        0
6 ecig_nico          0        0
7 min_num            0        0

There are no missing predictor variables in the logistic regression dataset after performing single imputation.

9.2 Model Y

9.2.1 Fitting Model Y

Model Y was fit with both lrm and glm. ecig_reg was the outcome variable, and the predictors were min_num, ecig_nico, harm_perc and price_paid.

Code

mod_Y <- glm(ecig_reg ~ min_num + ecig_nico + ecig_flavored + harm_perc + price_paid,
            data = data_log_i, family = binomial())

ddd <- datadist(data_log_i)
options(datadist = "ddd")

mod_Y_lrm <- lrm(ecig_reg ~ min_num + ecig_nico + ecig_flavored + harm_perc + price_paid,
            data = data_log_i, x = TRUE, y = TRUE)

9.2.2 Tidied Odds Ratio Estimates (Model Y)

Code

tidy(mod_Y, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |>
  select(term, estimate, se = std.error, 
         low95 = conf.low, high95 = conf.high, p = p.value) |>
  kable(digits = 3) |>
  kable_classic_2(font_size = 24,full_width = F)

term	estimate	se	low95	high95	p
(Intercept)	18.682	0.314	10.250	35.172	0.000
min_num	1.000	0.001	0.999	1.001	0.733
ecig_nicoNo	0.927	0.198	0.632	1.375	0.700
ecig_flavoredNo	0.116	0.248	0.070	0.186	0.000
harm_percsame_or_more	1.196	0.164	0.869	1.653	0.274
price_paid$10to$20	0.230	0.299	0.126	0.409	0.000
price_paid$21to$100	0.117	0.293	0.065	0.204	0.000
price_paid>$100	0.055	0.479	0.021	0.140	0.000

9.2.3 Effects Plot (Model Y)

Code

plot(summary(mod_Y_lrm))

9.2.4 Summarizing Fit (Model Y)

Code

mod_Y_lrm

Logistic Regression Model

lrm(formula = ecig_reg ~ min_num + ecig_nico + ecig_flavored + 
    harm_perc + price_paid, data = data_log_i, x = TRUE, y = TRUE)

                       Model Likelihood      Discrimination    Rank Discrim.    
                             Ratio Test             Indexes          Indexes    
Obs          1200    LR chi2     129.72      R2       0.145    C       0.630    
 Yes          369    d.f.             7     R2(7,1200)0.097    Dxy     0.261    
 No           831    Pr(> chi2) <0.0001    R2(7,766.6)0.148    gamma   0.264    
max |deriv| 1e-07                            Brier    0.181    tau-a   0.111    

                       Coef    S.E.   Wald Z Pr(>|Z|)
Intercept               2.9276 0.3140  9.32  <0.0001 
min_num                -0.0002 0.0006 -0.34  0.7335  
ecig_nico=No           -0.0762 0.1979 -0.39  0.7001  
ecig_flavored=No       -2.1513 0.2479 -8.68  <0.0001 
harm_perc=same_or_more  0.1790 0.1638  1.09  0.2743  
price_paid=$10to$20    -1.4684 0.2994 -4.91  <0.0001 
price_paid=$21to$100   -2.1453 0.2925 -7.33  <0.0001 
price_paid=>$100       -2.8969 0.4788 -6.05  <0.0001

Code

glance(mod_Y) |>
  mutate(df = nobs - df.residual - 1) |>
  select(AIC, BIC, df, df.residual, nobs) |>
  kable(digits = 1)|>
  kable_classic_2(font_size = 24,full_width = F)

AIC	BIC	df	df.residual	nobs
1367.3	1408	7	1192	1200

9.2.5 Confusion Matrix (Model Y)

A temporary dataset, log_i_temp, for the putpose of generating a confusion matrix, was created. The outcome variable ecig_reg had it’s levels changed from “Yes” and “No” to 1 and 0.

Code

log_i_temp <- data_log_i |>
  mutate(ecig_reg = as.factor(ifelse(ecig_reg == "Yes",1,0)))

mod_Y_cm <- glm(ecig_reg ~ min_num + ecig_nico + ecig_flavored + harm_perc + price_paid,
            data = log_i_temp, family = binomial())

resY_aug <- augment(mod_Y_cm, type.predict = "response")

My prediction rule is that the fitted value of pr(ecig_reg = 1) needs to be greater than or equal to 0.5 for me to predict that ecig_reg is 1.

Code

cm_Y <- caret::confusionMatrix(
  data = factor(resY_aug$.fitted >= 0.5),
  reference = factor(resY_aug$ecig_reg == 1),
  positive = "TRUE")

cm_Y

Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE   826  274
     TRUE      5   95
                                          
               Accuracy : 0.7675          
                 95% CI : (0.7425, 0.7911)
    No Information Rate : 0.6925          
    P-Value [Acc > NIR] : 4.676e-09       
                                          
                  Kappa : 0.3153          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.25745         
            Specificity : 0.99398         
         Pos Pred Value : 0.95000         
         Neg Pred Value : 0.75091         
             Prevalence : 0.30750         
         Detection Rate : 0.07917         
   Detection Prevalence : 0.08333         
      Balanced Accuracy : 0.62572         
                                          
       'Positive' Class : TRUE

9.3 Non-Linearity

A Spearman $p^2$ plot was created for assessing the terms that are most likely to add predictive power to the model.

Code

plot(spearman2(ecig_reg ~ min_num + ecig_nico + ecig_flavored + harm_perc + price_paid,
            data = data_log_i))

The Spearman plot suggests the use of a non-linear term in min_num, so a restricted cubic spline with 4 knots in min_num will be added to the model, spending two additional degrees of freedom than the main effects model.

An interaction term between the main effect of ecig_flavored and min_num will be added, spending an additional degree of freedom.

9.4 Model Z

Model Z will add a total of 3 additional degrees of freedom to Model Y, using two non-linear terms.

9.4.1 Fitting Model Z

Model Z was fit using both glm and lrm.

Code

mod_Z <- glm(ecig_reg ~ ecig_nico + rcs(min_num,4) + ecig_flavored + harm_perc + price_paid + ecig_flavored %ia% min_num,
            data = data_log_i, family = binomial())

ddd <- datadist(data_log_i)
options(datadist = "ddd")

mod_Z_lrm <- lrm(ecig_reg ~ rcs(min_num,4) + ecig_nico + ecig_flavored + harm_perc + price_paid + ecig_flavored %ia% min_num,
            data = data_log_i, x = TRUE, y = TRUE)

9.4.2 Tidied Odds Ratio Estimates (Model Z)

Code

tidy(mod_Z, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.90) |>
  select(term, estimate, se = std.error, 
         low90 = conf.low, high90 = conf.high, p = p.value) |>
  kable(digits = 3)|>
  kable_classic_2(font_size = 24,full_width = F)

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

term	estimate	se	low90	high90	p
(Intercept)	0.021	0.726	0.006	0.065	0.000
ecig_nicoNo	0.796	0.277	0.510	1.271	0.412
rcs(min_num, 4)min_num	1.104	0.009	1.088	1.121	0.000
rcs(min_num, 4)min_num'	0.843	0.021	0.814	0.871	0.000
rcs(min_num, 4)min_num''	1.538	0.081	1.349	1.760	0.000
ecig_flavoredNo	0.048	0.715	0.014	0.151	0.000
harm_percsame_or_more	1.589	0.254	1.052	2.429	0.068
price_paid$10to$20	0.111	0.432	0.054	0.224	0.000
price_paid$21to$100	0.057	0.424	0.028	0.113	0.000
price_paid>$100	0.035	0.709	0.011	0.112	0.000
ecig_flavored %ia% min_num	1.003	0.004	0.997	1.009	0.446

9.4.3 Effects Plot (Model Z)

Code

plot(summary(mod_Z_lrm, conf.int = 0.90))

9.4.4 Summarizing Fit (Model Z)

Code

mod_Z_lrm

Logistic Regression Model

lrm(formula = ecig_reg ~ rcs(min_num, 4) + ecig_nico + ecig_flavored + 
    harm_perc + price_paid + ecig_flavored %ia% min_num, data = data_log_i, 
    x = TRUE, y = TRUE)

                       Model Likelihood       Discrimination    Rank Discrim.    
                             Ratio Test              Indexes          Indexes    
Obs          1200    LR chi2     773.30       R2       0.670    C       0.891    
 Yes          369    d.f.            10     R2(10,1200)0.471    Dxy     0.783    
 No           831    Pr(> chi2) <0.0001    R2(10,766.6)0.631    gamma   0.792    
max |deriv| 2e-08                             Brier    0.084    tau-a   0.334    

                           Coef    S.E.   Wald Z Pr(>|Z|)
Intercept                  -3.8676 0.7261 -5.33  <0.0001 
min_num                     0.0987 0.0090 10.99  <0.0001 
min_num'                   -0.1707 0.0206 -8.29  <0.0001 
min_num''                   0.4303 0.0807  5.33  <0.0001 
ecig_nico=No               -0.2277 0.2773 -0.82  0.4116  
ecig_flavored=No           -3.0412 0.7147 -4.26  <0.0001 
harm_perc=same_or_more      0.4633 0.2540  1.82  0.0682  
price_paid=$10to$20        -2.1952 0.4318 -5.08  <0.0001 
price_paid=$21to$100       -2.8602 0.4241 -6.74  <0.0001 
price_paid=>$100           -3.3635 0.7092 -4.74  <0.0001 
ecig_flavored=No * min_num  0.0028 0.0037  0.76  0.4457

The Nagelkerke $R^2$ for model Z is 0.670. The C statistic is 0.891.

Code

glance(mod_Z) |>
  mutate(df = nobs - df.residual - 1) |>
  select(AIC, BIC, df, df.residual, nobs) |>
  kable(digits = 1)|>
  kable_classic_2(font_size = 24,full_width = F)

AIC	BIC	df	df.residual	nobs
729.7	785.7	10	1189	1200

9.4.5 Confusion Matrix (Model Z)

As in Model Y, my prediction rule is that the fitted value of pr(ecig_reg = 1) needs to be greater than or equal to 0.5 for me to predict that ecig_reg is 1.

Code

mod_Z_cm <- glm(ecig_reg ~ ecig_nico + rcs(min_num,4) + ecig_flavored + harm_perc + price_paid + ecig_flavored %ia% min_num,
            data = log_i_temp, family = binomial())

resZ_aug <- augment(mod_Z_cm, type.predict = "response")

Code

cm_Z <- caret::confusionMatrix(
  data = factor(resZ_aug$.fitted >= 0.5),
  reference = factor(resZ_aug$ecig_reg == 1),
  positive = "TRUE")

cm_Z

Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE   817   99
     TRUE     14  270
                                          
               Accuracy : 0.9058          
                 95% CI : (0.8879, 0.9218)
    No Information Rate : 0.6925          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7638          
                                          
 Mcnemar's Test P-Value : 2.743e-15       
                                          
            Sensitivity : 0.7317          
            Specificity : 0.9832          
         Pos Pred Value : 0.9507          
         Neg Pred Value : 0.8919          
             Prevalence : 0.3075          
         Detection Rate : 0.2250          
   Detection Prevalence : 0.2367          
      Balanced Accuracy : 0.8574          
                                          
       'Positive' Class : TRUE

MODEL	NAGELKERKE $R^2$	C STATISTIC	AIC	BIC	DF
Y	0.145	0.630	1367.3	1408	7
Z	0.670	0.891	729.7	785.7	10

MODEL	SENSITIVITY	SPECIFICITY	POSITIVE PREDICTIVE VALUE
Y	0.258	0.994	0.950
Z	0.732	0.983	0.951

9.5 Validating Models Y and Z

Code

set.seed(4322023); (valY <- validate(mod_Y_lrm))

          index.orig training   test optimism index.corrected  n
Dxy           0.2606   0.3282 0.2917   0.0366          0.2241 40
R2            0.1445   0.1617 0.1356   0.0260          0.1185 40
Intercept     0.0000   0.0000 0.0657  -0.0657          0.0657 40
Slope         1.0000   1.0000 0.9093   0.0907          0.9093 40
Emax          0.0000   0.0000 0.0325   0.0325          0.0325 40
D             0.1073   0.1211 0.1003   0.0208          0.0864 40
U            -0.0017  -0.0017 0.0011  -0.0028          0.0011 40
Q             0.1089   0.1228 0.0992   0.0236          0.0853 40
B             0.1814   0.1786 0.1828  -0.0042          0.1856 40
g             0.7310   0.7958 0.7134   0.0824          0.6486 40
gp            0.1421   0.1532 0.1406   0.0126          0.1295 40

Code

set.seed(4322023); (valZ <- validate(mod_Z_lrm))

          index.orig training   test optimism index.corrected  n
Dxy           0.7827   0.7994 0.7809   0.0186          0.7641 40
R2            0.6701   0.6814 0.6641   0.0173          0.6528 40
Intercept     0.0000   0.0000 0.0495  -0.0495          0.0495 40
Slope         1.0000   1.0000 0.9496   0.0504          0.9496 40
Emax          0.0000   0.0000 0.0202   0.0202          0.0202 40
D             0.6436   0.6591 0.6356   0.0235          0.6201 40
U            -0.0017  -0.0017 0.0003  -0.0020          0.0003 40
Q             0.6452   0.6607 0.6353   0.0254          0.6198 40
B             0.0839   0.0815 0.0852  -0.0037          0.0876 40
g             3.3667   3.4866 3.3112   0.1754          3.1914 40
gp            0.3511   0.3546 0.3497   0.0049          0.3462 40

9.5.1 Validated $R^2$ and $C$ statistics for each model

MODEL	VALIDATED $R^2$	VALIDATED $C$ STATISTIC
Y	0.1185	0.612
Z	0.6528	0.882

9.6 Final Logistic Regression Model

I prefer Model Z, for the following reasons:

Model Z has a higher validated $R^2$ and $C$ statistic
Model Z has a higher Naegelkerke $R^2$
Model Z has a lower AIC and BIC
Model Z has a higher sensitivity, and a higher positive predictive value. While it has the lower specificity, the difference is not wide.

9.6.1 Winning Model’s Parameter Estimates

Code

mod_Z_lrm

Logistic Regression Model

lrm(formula = ecig_reg ~ rcs(min_num, 4) + ecig_nico + ecig_flavored + 
    harm_perc + price_paid + ecig_flavored %ia% min_num, data = data_log_i, 
    x = TRUE, y = TRUE)

                       Model Likelihood       Discrimination    Rank Discrim.    
                             Ratio Test              Indexes          Indexes    
Obs          1200    LR chi2     773.30       R2       0.670    C       0.891    
 Yes          369    d.f.            10     R2(10,1200)0.471    Dxy     0.783    
 No           831    Pr(> chi2) <0.0001    R2(10,766.6)0.631    gamma   0.792    
max |deriv| 2e-08                             Brier    0.084    tau-a   0.334    

                           Coef    S.E.   Wald Z Pr(>|Z|)
Intercept                  -3.8676 0.7261 -5.33  <0.0001 
min_num                     0.0987 0.0090 10.99  <0.0001 
min_num'                   -0.1707 0.0206 -8.29  <0.0001 
min_num''                   0.4303 0.0807  5.33  <0.0001 
ecig_nico=No               -0.2277 0.2773 -0.82  0.4116  
ecig_flavored=No           -3.0412 0.7147 -4.26  <0.0001 
harm_perc=same_or_more      0.4633 0.2540  1.82  0.0682  
price_paid=$10to$20        -2.1952 0.4318 -5.08  <0.0001 
price_paid=$21to$100       -2.8602 0.4241 -6.74  <0.0001 
price_paid=>$100           -3.3635 0.7092 -4.74  <0.0001 
ecig_flavored=No * min_num  0.0028 0.0037  0.76  0.4457

Code

# Obtaining model coefficients and exponentiating them
tidy(mod_Z, exponentiate = TRUE,
conf.int = TRUE, conf.level = 0.90) |>
kable(digits = 3) |> 
  kable_classic_2(font_size = 24, full_width = F)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.021	0.726	-5.327	0.000	0.006	0.065
ecig_nicoNo	0.796	0.277	-0.821	0.412	0.510	1.271
rcs(min_num, 4)min_num	1.104	0.009	10.992	0.000	1.088	1.121
rcs(min_num, 4)min_num'	0.843	0.021	-8.289	0.000	0.814	0.871
rcs(min_num, 4)min_num''	1.538	0.081	5.330	0.000	1.349	1.760
ecig_flavoredNo	0.048	0.715	-4.255	0.000	0.014	0.151
harm_percsame_or_more	1.589	0.254	1.823	0.068	1.052	2.429
price_paid$10to$20	0.111	0.432	-5.084	0.000	0.054	0.224
price_paid$21to$100	0.057	0.424	-6.744	0.000	0.028	0.113
price_paid>$100	0.035	0.709	-4.743	0.000	0.011	0.112
ecig_flavored %ia% min_num	1.003	0.004	0.763	0.446	0.997	1.009

9.6.2 Plot of Effect Sizes for Winning Model

Code

summary(mod_Z_lrm)

             Effects              Response : ecig_reg 

 Factor                             Low High Diff. Effect    S.E.    Lower 0.95
 min_num                            90  180  90     1.056400 0.24375  0.578620 
  Odds Ratio                        90  180  90     2.875900      NA  1.783600 
 ecig_nico - No:Yes                  1    2  NA    -0.227660 0.27727 -0.771110 
  Odds Ratio                         1    2  NA     0.796390      NA  0.462500 
 ecig_flavored - No:Yes              1    2  NA    -2.705000 0.41467 -3.517800 
  Odds Ratio                         1    2  NA     0.066868      NA  0.029666 
 harm_perc - same_or_more:less_harm  1    2  NA     0.463250 0.25405 -0.034679 
  Odds Ratio                         1    2  NA     1.589200      NA  0.965920 
 price_paid - Below$10:$21to$100     3    1  NA     2.860200 0.42410  2.029000 
  Odds Ratio                         3    1  NA    17.465000      NA  7.606200 
 price_paid - $10to$20:$21to$100     3    2  NA     0.664980 0.25607  0.163100 
  Odds Ratio                         3    2  NA     1.944400      NA  1.177100 
 price_paid - >$100:$21to$100        3    4  NA    -0.503340 0.60451 -1.688200 
  Odds Ratio                         3    4  NA     0.604510      NA  0.184860 
 Upper 0.95
  1.53410  
  4.63710  
  0.31579  
  1.37130  
 -1.89230  
  0.15072  
  0.96118  
  2.61480  
  3.69140  
 40.10100  
  1.16690  
  3.21190  
  0.68147  
  1.97680  

Adjusted to: min_num=120 ecig_flavored=Yes

Code

plot(summary(mod_Z_lrm))

9.6.3 Numerical Description of Effect Sizes

Code

summary(mod_Z_lrm, conf.int = 0.90) |> kable(digits = 3) |> 
  kable_classic_2(font_size = 24, full_width = F)

	Low	High	Diff.	Effect	S.E.	Lower 0.9	Upper 0.9	Type
min_num	90	180	90	1.056	0.244	0.655	1.457	1
Odds Ratio	90	180	90	2.876	NA	1.926	4.294	2
ecig_nico - No:Yes	1	2	NA	-0.228	0.277	-0.684	0.228	1
Odds Ratio	1	2	NA	0.796	NA	0.505	1.257	2
ecig_flavored - No:Yes	1	2	NA	-2.705	0.415	-3.387	-2.023	1
Odds Ratio	1	2	NA	0.067	NA	0.034	0.132	2
harm_perc - same_or_more:less_harm	1	2	NA	0.463	0.254	0.045	0.881	1
Odds Ratio	1	2	NA	1.589	NA	1.046	2.414	2
price_paid - Below$10:$21to$100	3	1	NA	2.860	0.424	2.163	3.558	1
Odds Ratio	3	1	NA	17.465	NA	8.694	35.085	2
price_paid - $10to$20:$21to$100	3	2	NA	0.665	0.256	0.244	1.086	1
Odds Ratio	3	2	NA	1.944	NA	1.276	2.963	2
price_paid - >$100:$21to$100	3	4	NA	-0.503	0.605	-1.498	0.491	1
Odds Ratio	3	4	NA	0.605	NA	0.224	1.634	2

price_paid description:

For two individuals, subject 1 and 2, who have their first e-cigarette puff of the day the same number of minutes after waking up, who both use nicotine containing e-cigarettes, who have the same perception of the harmfulness of e-cigarettes in comparison to regular cigarettes, and use flavored e-cigarettes, and subject 1 pays less than 10 dollars for their e-cigarettes, and subject 2 pays between 21 to a 100 dollars for their e-cigarettes, according to model Z, the ratio of the odds of subject 1 being a regular e-cigarette smoker to the odds of subject 2 being a regular e-cigarette smoker is 17.465 The 90% confidence intervals around this estimate are (8.69 to 35.09).

Given that this interval does not include 1, this is a meaningful effect.

9.6.4 Plot of ROC Curve for Winning Model

The roc function from the pROC package was used to obtain the ROC curve for model Z(the lrm model). The dataset with the outcome variable recoded as “Yes” = 1, “No” = 0 was used.

Code

roc.mod <- 
    roc(log_i_temp$ecig_reg ~ predict(mod_Z_lrm, type="fitted"),
        ci = TRUE)

Setting levels: control = 0, case = 1

Setting direction: controls > cases

Code

plot(roc.mod, auc.polygon = TRUE, print.auc = TRUE,max.auc.polygon = TRUE,
     grid=c(0.1, 0.2),
     grid.col = c("green", "red"),
     auc.polygon.col = "skyblue",title = "ROC Curve for Model Z (lrm) with the AUC score")

9.6.5 Validated $R^2$ and $C$ statistic for Winning Model

MODEL	VALIDATED $R^2$	VALIDATED $C$ STATISTIC
Z	0.6528	0.882

9.6.6 Nomogram of Winning Model

Code

plot(nomogram(mod_Z_lrm, fun = plogis))

9.6.7 Predictions for Two New Subjects

Code

new_subj <-
  data.frame(min_num = c(60,60), ecig_flavored = c("Yes","Yes"),ecig_nico = c("Yes","Yes"),harm_perc = c("less_harm","less_harm"), price_paid = c("Below$10","$21to$100")) 
preds4 <- predict(mod_Z_lrm, newdata = new_subj, type = "fitted")

preds4

        1         2 
0.8584043 0.2576755

Subject Name	Subject Characteristics	Predicted Probability of being a regular E-cigarette smoker
Subject 1	min_num = 60 minutes, ecig_flavored = Yes, ecig_nico = Yes, harm_perc = “less_harm”, price_paid = “Below$10”	0.858
Subject 2	min_num = 60 minutes, ecig_flavored = Yes, ecig_nico = Yes, harm_perc = “less_harm”, price_paid = “$21to$100”	0.258

10 Discusssion

10.1 Answering My Research Questions

10.1.1 My First Research Question

E-Cigarette Perception, Smoking Habits, and their association with Heavy E-Cigarette Use:

Code

# Transforming the coefficients of Model B back to the original variable's scale
tidy_mod_B |> 
  mutate(estimate = sprintf("%.2f",(exp(estimate) - 1)),
         conf.low = sprintf("%.2f",(exp(conf.low) - 1)),
         conf.high = sprintf("%.2f",(exp(conf.high) - 1))) |>
  select(term,estimate,conf.low,conf.high, p.value) |>
  kbl(digits = 3) |> 
  kable_classic_2(font_size = 24, full_width = F)

term	estimate	conf.low	conf.high	p.value
(Intercept)	31.34	21.74	44.99	0.000
rcs(puff_num, 4)puff_num	-0.46	-0.50	-0.43	0.000
rcs(puff_num, 4)puff_num'	95957.21	42612.78	216078.82	0.000
rcs(puff_num, 4)puff_num''	-1.00	-1.00	-1.00	0.000
age_range18to24	-0.10	-0.25	0.08	0.357
age_range25to34	0.02	-0.16	0.23	0.863
age_range35to44	-0.06	-0.23	0.15	0.612
age_range45to54	0.02	-0.16	0.25	0.838
age_range>55	0.17	-0.05	0.43	0.206
less_harmNo	0.16	-0.04	0.39	0.204
ecig_nicoNo	-0.11	-0.21	-0.01	0.076
puff_num %ia% less_harm	-0.03	-0.05	-0.01	0.029

E-CIGARETTE PUFFS PER DAY:

Higher number of puffs taken per day are correlated with a higher number of days an e-cigarette was smoked in the past 30 days, provided all other variables are held constant. This is evident from the confidence intervals of puff_num, which do not include zero.

USAGE OF E-CIGARETTES WITH NICOTINE:

The usage of e-cigarettes that do not contain nicotine are associated with a significantly lower number of days an e-cigarette was smoked by a study participant in the past 30 days, provided all other variables are held constant.

The point estimate for this effect is -0.11, with 90%CI(-0.21 to -0.01). Given that the confidence interval does not include zero, this is a meaningful effect. (p = 0.076)

E-CIGARETTE SMOKING INITIATION AGE:

Initiation of e-cigarette smoking at a younger age does not have a meaningful correlation with heavy e-cigarette smoking by an individual in the past 30 days, provided all other variables are held constant. This is evident from the confidence intervals for the different levels of age_range, all of which contain zero, when compared to the number of days an e-cigarette was smoked in the past 30 days by individuals who initiated e-cigarette smoking when they were less than 18 years of age.

PERCEPTION OF THE HARMFULNESS OF E-CIGARETTES COMPARED TO REGULAR CIGARETTES:

The perception of the supposed healthiness of e-cigarettes when compared to regular cigarettes do not have a meaningful correlation with an individual’s e-cigarette smoking in the past 30 days, provided all other variables are held constant. This is evident from the confidence intervals for the term less_harmNo from the table of model B’s coefficients, which include zero. (p = 0.204)

In summation, to answer my first research question,

Factors such as a higher number of puffs taken from an e-cigarette daily, and usage of nicotine-containing e-cigarettes are correlated with heavier smoking habits over the past 30 days.

Factors such as an individual’s perception of the supposed healthiness of e-cigarettes when compared to regular cigarettes, and their age when they began smoking e-cigarettes, do not have a meaningful impact on an individual’s smoking habits over the past 30 days.

10.1.2 My Second Research Question

Regular e-cigarette use and it’s associated factors:

Code

summary(mod_Z_lrm, conf.int = 0.90) |> kable(digits = 3) |> 
  kable_classic_2(font_size = 24, full_width = F)

	Low	High	Diff.	Effect	S.E.	Lower 0.9	Upper 0.9	Type
min_num	90	180	90	1.056	0.244	0.655	1.457	1
Odds Ratio	90	180	90	2.876	NA	1.926	4.294	2
ecig_nico - No:Yes	1	2	NA	-0.228	0.277	-0.684	0.228	1
Odds Ratio	1	2	NA	0.796	NA	0.505	1.257	2
ecig_flavored - No:Yes	1	2	NA	-2.705	0.415	-3.387	-2.023	1
Odds Ratio	1	2	NA	0.067	NA	0.034	0.132	2
harm_perc - same_or_more:less_harm	1	2	NA	0.463	0.254	0.045	0.881	1
Odds Ratio	1	2	NA	1.589	NA	1.046	2.414	2
price_paid - Below$10:$21to$100	3	1	NA	2.860	0.424	2.163	3.558	1
Odds Ratio	3	1	NA	17.465	NA	8.694	35.085	2
price_paid - $10to$20:$21to$100	3	2	NA	0.665	0.256	0.244	1.086	1
Odds Ratio	3	2	NA	1.944	NA	1.276	2.963	2
price_paid - >$100:$21to$100	3	4	NA	-0.503	0.605	-1.498	0.491	1
Odds Ratio	3	4	NA	0.605	NA	0.224	1.634	2

TIME TAKEN FOR THE FIRST E-CIGARETTE PUFF OF THE DAY

Individuals who need to use e-cigarettes within an hour or an hour and a half of waking up have detectable higher odds of identifying as regular users of e-cigarettes, provided the other variables are held constant. This is evident in the 90% confidence intervals for the odds ratio in the coefficient table, which does not include 1.

USAGE OF FLAVORED E-CIGARETTES:

Individuals who do not use flavored e-cigarettes have detectable lower odds of identifying as regular users of e-cigarettes, provided the other variables are held constant. This is evident in the 90% confidence intervals for the odds ratio in the coefficient table, which does not include 1. The point estimate for this odds ratio is 0.07, with 90% confidence intervals (0.03 to 0.13). Individuals who do not use flavored e-cigarettes have only 0.07 times the odds(90%CI 0.03 to 0.13) of identifying as regular e-cigarette smokers, compared to individuals who do use flavored e-cigarettes.

COST OF E-CIGARETTE BRAND:

Individuals who pay less than 10 dollars for their e-cigarettes have detectably higher odds of identifying as regular e-cigarettes, as compared to individuals who paid anywhere between 21 to 100 dollars for their e-cigarette, provided the other variables are held constant. This is evident in the 90% confidence intervals for the odds ratio in the coefficient table, which do not include zero. Similarly, individuals who pay anywhere between 10 to 20 dollars for their e-cigarette also have detectably higher odds of identifying as regular e-cigarette smokers

However, individuals who pay more than 100 dollars for their e-cigarettes do not have meaningfully different odds of identifying as regular e-cigarette smokers compared to individuals who pay anywhere between 21 to 100 dollars for their e-cigarettes, provided the other variables remain constant. This is evident in the 90% confidence intervals for the odds ratio of this effect in the coefficient table, which includes 1.

USAGE OF E-CIGARETTES CONTAINING NICOTINE:

Individuals who do not use e-cigarettes with nicotine do not have meaningfully different odds of identifying as regular e-cigarette smokers compared to individuals who use e-cigarettes containing nicotine, provided the other variables are held constant. The 90% confidence intervals for the odds ratio of this effect includes 1.

PERCEPTION OF THE HARMFULNESS OF E-CIGARETTES AS COMPARED TO REGULAR CIGARETTES:

Individuals who perceive e-cigarettes as being the same or more harmful than regular e-cigarettes do not have meaningfully different odds of identifying as regular e-cigarette smokers compared to individuals who perceive e-cigarettes as being less harmful than regular e-cigarettes, provided the other variables are held constant. The odds ratio for this effect includes 1.

In summation, to answer my second research question,

Factors such as needing to use e-cigarettes immediately after waking up, using flavored e-cigarettes, and using cheaper e-cigarette brands are all associated with detectably higher odds of identifying as a regular e-cigarette user, provided the other variables are held constant.

Factors such as the usage of nicotine containing e-cigarettes, and the perceptiom of the harmfulness of e-cigarettes as compared to regular cigarettes do not have a meaningful impact on the odds of identifying as a regular e-cigarette user, provided all other variables remain constant.

10.2 Limitations and Next Steps

10.2.1 Model B and it’s limitations

Given that model B was predicting a count outcome, and given then the outcome was transformed using a logarithm transformation after adding a non-zero integer to the outcome, the resulting model is very highly driven by the puff_num variable, resulting in exploding coefficients for the restricted cubic spline applied on puff_num.

Model B outperformed Model A on all metrics that were studied. Visualization of the transformed outcome variable showed why that was the case, given the non-linear, skewed nature of the data under study. The data lent itself very well towards a non-linear term, even if the restricted cubic spline had exploding coefficients.

This outcome could be better modeled with a Poisson regression, and by using a larger dataset with lesser missing values. The next ideal step would be to expand the dataset to more than 1200, and model the outcome using a Poisson regression.

10.2.2 Model Z and it’s limitations

The glm fit of the model produced a warning stating that predicted probabilities of 0 or 1 had occurred. Given that the data under analysis had a large amount of missingness, imputation possibly skewed the model towards one direction or the other. It is important to note that this warning was only produced in the glm model, while the lrm model did not produce such a warning.

This model could be improved by vastly expanding the sample size, minimizing the effect of the missing data and smoothing out the predicted probabilites so that the preceding skew does not repeat itself.

10.3 Thoughts on Project A

Project A was more difficult than I expected it to be. I went in knowing it would take a decent amount of work and effort, and I was reasonably confident that I could tackle it, since I knew my experience in 431 had already helped prepare me for the work that would be required. The hardest part of Project A was dealing with the large number of missing values. The impact of imputation on both models was significant, and discussing the limitations and understanding how the models could be improved was a difficult task, although I did learn a lot from the difficulty I had with the data.

At the end of Project A, I wished that I had chosen a non count outcome for the linear regression.We hadn’t covered count outcomes at the time of submission of the Plan, and if I’d known the difficulties inherent in using linear regression for a count outcome when I started, I would have picked a different outcome variable.

11 Affirmation

I am certain that it is completely appropriate for these data to be shared with anyone, without any conditions. There are no concerns about privacy or security.

12 References

[1] Singh, T., Arrazola, R. A., Corey, C. G., Husten, C. G., Neff, L. J., Homa, D. M., & King, B. A. (2016). Tobacco use among middle and high school students—United States, 2011–2015. Morbidity and mortality weekly report, 65(14), 361-367.

[2] Dutra, L. M., & Glantz, S. A. (2014). High international electronic cigarette use among never smoker adolescents. Journal of Adolescent Health, 55(5), 595-597.

[3] Dutra, L. M., & Glantz, S. A. (2017). E-cigarettes and national adolescent cigarette use: 2004–2014. Pediatrics, 139(2).

[4] Glantz, S. A., & Bareham, D. W. (2018). E-cigarettes: use, effects on smoking, risks, and policy implications. Annual review of public health, 39, 215-235.

[5] United States Department of Health and Human Services. National Institutes of Health. National Institute on Drug Abuse, and United States Department of Health and Human Services. Food and Drug Administration. Center for Tobacco Products. Population Assessment of Tobacco and Health (PATH) Study [United States] Public-Use Files. Inter-university Consortium for Political and Social Research [distributor], 2022-10-07. https://doi.org/10.3886/ICPSR36498.v17

13 Session Information

Code

xfun::session_info()

R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  abind_1.4-5          askpass_1.2.0        backports_1.4.1     
  base64enc_0.1-3      bit_4.0.5            bit64_4.0.5         
  blob_1.2.4           boot_1.3-28.1        brio_1.1.4          
  broom_1.0.5          broom.helpers_1.14.0 bslib_0.6.1         
  cachem_1.0.8         callr_3.7.3          car_3.1-2           
  carData_3.0-5        caret_6.0-94         cellranger_1.1.0    
  checkmate_2.3.1      class_7.3-22         cli_3.6.1           
  clipr_0.8.0          clock_0.7.0          cluster_2.1.6       
  codetools_0.2-19     colorspace_2.1-0     compiler_4.3.2      
  conflicted_1.2.0     cpp11_0.4.7          crayon_1.5.2        
  curl_5.2.0           data.table_1.14.10   DBI_1.2.0           
  dbplyr_2.4.0         DEoptimR_1.1.3       desc_1.4.3          
  diagram_1.6.5        diffobj_0.3.5        digest_0.6.33       
  doRNG_1.8.6          dplyr_1.1.4          dtplyr_1.3.1        
  e1071_1.7-14         ellipsis_0.3.2       evaluate_0.23       
  fansi_1.0.6          farver_2.1.1         fastmap_1.1.1       
  fontawesome_0.5.2    forcats_1.0.0        foreach_1.5.2       
  foreign_0.8-86       Formula_1.2-5        fs_1.6.3            
  furrr_0.3.1          future_1.33.1        future.apply_1.11.1 
  gargle_1.5.2         generics_0.1.3       GGally_2.2.0        
  ggplot2_3.4.4        ggstats_0.5.1        glmnet_4.1-8        
  globals_0.16.2       glue_1.6.2           googledrive_2.1.1   
  googlesheets4_1.1.1  gower_1.0.1          graphics_4.3.2      
  grDevices_4.3.2      grid_4.3.2           gridExtra_2.3       
  gtable_0.3.4         hardhat_1.3.0        haven_2.5.4         
  highr_0.10           Hmisc_5.1-1          hms_1.1.3           
  htmlTable_2.4.2      htmltools_0.5.7      htmlwidgets_1.6.4   
  httr_1.4.7           ids_1.0.1            ipred_0.9-14        
  isoband_0.2.7        iterators_1.0.14     itertools_0.1.3     
  janitor_2.2.0        jomo_2.7-6           jquerylib_0.1.4     
  jsonlite_1.8.8       kableExtra_1.3.4     KernSmooth_2.23.22  
  knitr_1.45           labeling_0.4.3       labelled_2.12.0     
  laeken_0.5.2         lattice_0.22-5       lava_1.7.3          
  lifecycle_1.0.4      listenv_0.9.0        lme4_1.1-35.1       
  lmtest_0.9.40        lubridate_1.9.3      magrittr_2.0.3      
  MASS_7.3-60          Matrix_1.6-4         MatrixModels_0.5-3  
  memoise_2.0.1        methods_4.3.2        mgcv_1.9.1          
  mice_3.16.0          mime_0.12            minqa_1.2.6         
  missForest_1.5       mitml_0.4-5          ModelMetrics_1.2.2.2
  modelr_0.1.11        multcomp_1.4-25      munsell_0.5.0       
  mvtnorm_1.2-4        naniar_1.0.0         nlme_3.1-164        
  nloptr_2.0.3         nnet_7.3-19          norm_1.0.11.1       
  numDeriv_2016.8.1.1  openssl_2.1.1        ordinal_2023.12.4   
  pan_1.9              parallel_4.3.2       parallelly_1.36.0   
  patchwork_1.1.3      pbkrtest_0.5.2       pillar_1.9.0        
  pkgbuild_1.4.3       pkgconfig_2.0.3      pkgload_1.3.3       
  plyr_1.8.9           polspline_1.1.24     praise_1.0.0        
  prettyunits_1.2.0    pROC_1.18.5          processx_3.8.3      
  prodlim_2023.08.28   progress_1.2.3       progressr_0.14.0    
  proxy_0.4-27         ps_1.7.5             purrr_1.0.2         
  quantreg_5.97        R6_2.5.1             ragg_1.2.7          
  randomForest_4.7.1.1 ranger_0.16.0        rappdirs_0.3.3      
  RColorBrewer_1.1-3   Rcpp_1.0.11          RcppEigen_0.3.3.9.4 
  readr_2.1.4          readxl_1.4.3         recipes_1.0.9       
  rematch_2.0.0        rematch2_2.1.2       reprex_2.0.2        
  reshape2_1.4.4       rlang_1.1.2          rmarkdown_2.25      
  rms_6.7-1            rngtools_1.5.2       robustbase_0.99.1   
  rpart_4.1.23         rprojroot_2.0.4      rsample_1.2.0       
  rstudioapi_0.15.0    rvest_1.0.3          sandwich_3.1-0      
  sass_0.4.8           scales_1.3.0         selectr_0.4.2       
  shape_1.4.6          simputation_0.2.8    slider_0.3.1        
  snakecase_0.11.1     sp_2.1.2             SparseM_1.81        
  splines_4.3.2        SQUAREM_2021.1       stats_4.3.2         
  stats4_4.3.2         stringi_1.8.3        stringr_1.5.1       
  survival_3.5-7       svglite_2.1.3        sys_3.4.2           
  systemfonts_1.0.5    testthat_3.2.1       textshaping_0.3.7   
  TH.data_1.1-2        tibble_3.2.1         tidyr_1.3.0         
  tidyselect_1.2.0     tidyverse_2.0.0      timechange_0.2.0    
  timeDate_4032.109    tinytex_0.49         tools_4.3.2         
  tzdb_0.4.0           ucminf_1.2.1         UpSetR_1.4.0        
  utf8_1.2.4           utils_4.3.2          uuid_1.1.1          
  vcd_1.4.11           vctrs_0.6.5          VIM_6.2.2           
  viridis_0.6.4        viridisLite_0.4.2    visdat_0.6.0        
  vroom_1.6.5          waldo_0.5.2          warp_0.2.1          
  webshot_0.5.5        withr_2.5.2          xfun_0.41           
  xml2_1.3.6           yaml_2.3.7           yardstick_1.2.0     
  zoo_1.8-12

R Packages and Setup

1 Data Source

2 The Subjects

3 Loading and Tidying the Data

3.1 Loading the Raw Data

3.2 Cleaning the Data

3.2.1 Selecting Variables We’ll Use

3.2.2 Dealing with Missingness

3.2.3 Changing Variable Names

3.2.4 Sampling the Data

3.2.5 Working with Categorical Predictors

3.2.6 Working with Quantitative Variables

3.2.7 Renaming the factors

3.2.8 Splitting the Data

4 The Tidy Tibble

4.1 Listing the Tibble

4.2 Size and Identifiers

4.3 Save The Tibble

5 The Code Book

5.1 Defining the Variables

5.2 Numerical Description

6 Linear Regression Plans

6.1 My First Research Question

6.2 My Quantitative Outcome

6.3 My Planned Predictors (Linear Model)

7 Logistic Regression Plans

7.1 My Second Research Question

7.2 My Binary Outcome

7.3 My Planned Predictors (Logistic Model)

8 Linear Regression Analyses

8.1 Missingness

8.2 Outcome Transformation

8.3 Scatterplot Matrix and Collinearity

8.4 Model A

8.4.1 Fitting Model A

8.4.2 Tidied Coefficient Estimates (Model A)

8.4.3 Summarizing Fit (Model A)

8.4.4 Regression Diagnostics (Model A)

8.5 Non-Linearity

8.6 Model B

8.6.1 Fitting Model B

8.6.2 Tidied Coefficient Estimates (Model B)

8.6.3 Effects Plot for Model B

8.6.4 Summarizing Fit (Model B)

8.6.5 Regression Diagnostics (Model B)

8.7 Validating Models A and B

8.7.1 Holdout RMSE, \(R^2\) and MAE

8.7.2 Validated \(R^2\), MSE and IC statistics

8.7.3 ANOVA to compare Models A and B

8.8 Final Linear Regression Model

8.8.1 Winning Model’s Parameter Estimates

8.8.2 Model Coefficients

8.8.3 Effects Plot for Winning Model

8.8.4 Model Calibration

8.8.5 Numerical Description of Effect Sizes

8.8.6 Nomogram of Winning Model

8.8.7 Prediction for a New Subject

9 Logistic Regression Analyses

9.1 Missingness

9.2 Model Y

9.2.1 Fitting Model Y

9.2.2 Tidied Odds Ratio Estimates (Model Y)

9.2.3 Effects Plot (Model Y)

9.2.4 Summarizing Fit (Model Y)

9.2.5 Confusion Matrix (Model Y)

9.3 Non-Linearity

9.4 Model Z

9.4.1 Fitting Model Z

9.4.2 Tidied Odds Ratio Estimates (Model Z)

9.4.3 Effects Plot (Model Z)

9.4.4 Summarizing Fit (Model Z)

9.4.5 Confusion Matrix (Model Z)

9.5 Validating Models Y and Z

9.5.1 Validated \(R^2\) and \(C\) statistics for each model

9.6 Final Logistic Regression Model

9.6.1 Winning Model’s Parameter Estimates

9.6.2 Plot of Effect Sizes for Winning Model

9.6.3 Numerical Description of Effect Sizes

9.6.4 Plot of ROC Curve for Winning Model

9.6.5 Validated \(R^2\) and \(C\) statistic for Winning Model