Analysing the relationship between Blood Cholesterol levels, Physical Activity, Excessive Drinking and Depression.

Author

Naveen Kannan

Published

2022-12-14

1 Setup and Data Ingest

1.1 Loading Packages

Code
source("Love-boost.R")

library(knitr)
library(rmdformats)
library(nhanesA)
library(kableExtra)
library(naniar)
library(janitor)
library(glue)
library(rcompanion)
library(gt)
library(ggridges)
library(patchwork)
library(broom)
library(tidyverse)

## Global options
options(max.print="100")
opts_chunk$set(comment=NA)
opts_knit$set(width=75)
knitr::opts_chunk$set(comment = NA)

1.2 Ingesting Data

The nhanes package was used to read the data and store in in the form of an .Rds file for each of the data sets being used.

Code
## Demographic Data
#demo_raw <- nhanes('P_DEMO') |> tibble()
#saveRDS(demo_raw, "C:\\Users\\navee\\Desktop\\PQHS 431\\Project B\\data\\P_DEMO.Rds")
demo_raw <- readRDS("data/P_DEMO.Rds")

## Alcohol Data
#alc_raw <- nhanes('P_ALQ') |> tibble()
#saveRDS(alc_raw, "C:\\Users\\navee\\Desktop\\PQHS 431\\Project B\\data\\P_ALQ.Rds")
alc_raw <- readRDS("data/P_ALQ.Rds")

## Depression data
#depr_raw <- nhanes('P_DPQ') |> tibble()
#saveRDS(depr_raw, "C:\\Users\\navee\\Desktop\\PQHS 431\\Project B\\data\\P_DPQ.Rds")
depr_raw <- readRDS("data/P_DPQ.Rds")

## Physical activity data
#phys_raw <- nhanes('P_PAQ') |> tibble()
#saveRDS(phys_raw, "C:\\Users\\navee\\Desktop\\PQHS 431\\Project B\\data\\P_PAQ.Rds")
phys_raw <- readRDS("data/P_PAQ.Rds")

## Cholesterol data
#chol_raw <- nhanes('P_HDL') |> tibble()
#saveRDS(chol_raw, "C:\\Users\\navee\\Desktop\\PQHS 431\\Project B\\data\\P_HDL.Rds")
chol_raw <- readRDS("data/P_HDL.Rds")

The tibbles were then merged on the SEQN variable.

Code
## Merging the tibbles
NEW <- left_join(demo_raw, chol_raw, by = "SEQN")
NEW2 <- left_join(NEW, alc_raw, by = "SEQN")
NEW3 <- left_join(NEW2, depr_raw, by = "SEQN")
combined_tib <- left_join(NEW3, phys_raw, by = "SEQN")

2 Cleaning the Data

Missing values, values such as “Refused”, or “Don’t Know”, were filtered out, and participants from 21 to 79 years of age were selected.

Code
final_tib <- combined_tib|>
  filter(RIDSTATR == 2) |>
  filter(RIDAGEYR > 20 &  RIDAGEYR < 80) |>
  filter(PAD680 !=7777 & PAD680 != 9999 &
         PAD675 != 7777 & PAD675 != 9999 &
         PAQ650 != 7 & PAQ650 != 9 &
         ALQ151 != 7 & ALQ151 !=9 &
         DPQ020 != 7 & DPQ020 != 9 &
         DPQ050 != 7 & DPQ050 != 9 &
         LBDHDD != ".") |>
  filter(complete.cases(LBDHDD,PAD680,PAD675,PAQ650,ALQ151,DPQ020,DPQ050))|>
  mutate(PAQ650 = 2- PAQ650) |> 
  mutate(ALQ151 = 2- ALQ151) |>
  mutate(PAQ650 = as.factor(PAQ650)) |>
  mutate(ALQ151 = as.factor(ALQ151)) |>
  mutate(DPQ020 = as.factor(DPQ020)) |>
  mutate(DPQ050 = as.factor(DPQ050)) |>
  select(SEQN,LBDHDD,PAD680,PAD675,PAQ650,ALQ151,DPQ020,DPQ050) 
final_tib
# A tibble: 2,657 × 8
     SEQN LBDHDD PAD680 PAD675 PAQ650 ALQ151 DPQ020 DPQ050
    <dbl>  <dbl>  <dbl>  <dbl> <fct>  <fct>  <fct>  <fct> 
 1 109266     56    480     30 1      0      0      0     
 2 109273     42    180    120 0      0      2      2     
 3 109274     29    300     60 0      0      0      0     
 4 109290     40    180     90 0      0      1      0     
 5 109312     77    180    120 0      0      0      0     
 6 109313     54    300     60 1      0      0      0     
 7 109317     71    420     30 0      0      1      2     
 8 109319    110    600     60 1      0      0      0     
 9 109324     60    360     20 0      0      1      0     
10 109327     58    480     30 1      0      0      0     
# ℹ 2,647 more rows

3 Codebook and Data Description

2,657 adults ages 21-79 participating in NHANES 2017-18 with complete data on the variables listed in the table below.

Variable name Variable Type Description Analytic Role
SEQN - Respondent sequence number (Unique Identifier) Indexing
LBDHDD Continuous variable Direct HDL-Cholesterol (mg/dL) Outcome variable
PAD680 Continuous variable Time spent sitting on a typical day (Minutes) Quantitative predictor
PAD675 Continuous variable Time spent performing moderate-intensity sports, fitness or recreational activities on a typical day (Minutes) Quantitative Predictor
PAQ650 Binary categorical variable Whether or not the participant performs any vigorous-intensity sports, fitness, or recreational activities that cause large increases in breathing or heart rate like running or basketball for at least 10 minutes continuously, in a typical week Binary categorical predictor
ALQ151 Binary categorical variable Whether or not the participant had a period in their life where they drank 4 or more drinks of any alcoholic beverage almost every day Binary categorical predictor
DPQ020 Multi-categorical variable Over the last 2 weeks, how often has the participant been bothered by the following problems: feeling down, depressed, or hopeless Multi-categorical predictor
DPQ050 Multi-categorical variable Over the last 2 weeks, how often has the participant been bothered by the following problems: poor appetite or overeating Multi-categorical predictor

The levels of the categorical variables under study and their descriptions are as follows:

Variable name Original Levels New levels Variable description
PAQ650 4 levels: 1 = Yes, 2 = No, 7 = Refused, 9 = Don’t Know 2 levels: 1 = Yes, 0 = No Whether or not the participant performs any vigorous-intensity sports, fitness, or recreational activities that cause large increases in breathing or heart rate like running or basketball for at least 10 minutes continuously, in a typical week
ALQ151 4 levels: 1 = Yes, 2 = No, 7 = Refused, 9 = Don’t Know 2 levels: 1 = Yes, 0 = No Whether or not the participant had a period in their life where they drank 4 or more drinks of any alcoholic beverage almost every day
DPQ020 6 levels: 0 = Not at all, 1 = Several days, 2 = More than half the days, 3 = Nearly every day, 7 = refused, 9 = Don’t know 4 levels : 0 = Not at all, 1 = Several days, 2 = More than half the days, 3 = Nearly every day Over the last 2 weeks, how often has the participant been bothered by the following problems: feeling down, depressed, or hopeless
DPQ050 6 levels: 0 = Not at all, 1 = Several days, 2 = More than half the days, 3 = Nearly every day, 7 = refused, 9 = Don’t know 4 levels : 0 = Not at all, 1 = Several days, 2 = More than half the days, 3 = Nearly every day Over the last 2 weeks, how often has the participant been bothered by the following problems: poor appetite or overeating

4 Analysis B : Comparing 2 Means with Independent Samples

4.1 The Question

Among adults of ages 21-79 participating in NHANES 2017-18, is there a significant difference in the values of HDL cholesterol values of participants across their self-reported participation or lack thereof in vigorous weekly physical activity?

4.2 Describing the Data

The Outcome Variable

The outcome variable is LBDHDD, which is a continuous quantitative variable. It describes the Direct HDL-Cholesterol levels of the participants in mg/dL(milligrams of HDL-Cholesterol per deciliter of blood). The values range from 5 to 189 mg/dL, from the documentation of the variable available at https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_HDL.htm#LBDHDD.

The Binary Categorical Exposure Variable

The categorical variable is PAQ650, which is a Binary Categorical Variable. It describes whether or not the participant performs any vigorous-intensity sports, fitness, or recreational activities that cause large increases in breathing or heart rate like running or basketball for at least 10 minutes continuously, in a typical week. It has the following factors :

“1” - Yes “0” - No

The samples here are independent, as the levels of the exposure variable PAQ650 are mutually exclusive and exhaustive, with only 1 of two possible answers that can apply to each participant.

4.2.1 Numerical Summary of Direct HDL- Cholesterol across weekly vigorous activity groups

Code
mosaic::favstats(LBDHDD ~ PAQ650, data = final_tib) |>
  kable(digits = 2) |> 
  kable_classic_2(font_size = 28, full_width = F)
PAQ650 min Q1 median Q3 max mean sd n missing
0 23 42 51 62 138 53.42 15.57 1492 0
1 10 44 53 65 166 55.96 16.90 1165 0

The numerical summary indicates that there are differences in the distribution of the outcome variable across the vigorous activity groups. The range of HDL cholesterol goes to a higher level in the group that answered yes, and the mean and median HDL levels in the group are higher than that of the group that answered no, although it is yet to be seen if this difference is statistically meaningful. Furthermore, 1492 participants answered no, while 1165 participants answered yes, indicating that the study design is imbalanced.

Code
temp1 <- final_tib |> 
  mutate(PAQ650_f = fct_recode(factor(PAQ650),
                              "No vigorous weekly physical activity" = "0",
                              "Atleast once per week" = "1")) |> 
    select(PAQ650_f,LBDHDD)
ggplot(temp1, aes(x = LBDHDD, y = PAQ650_f, fill = PAQ650_f, height = after_stat(density))) +
    ggridges::geom_density_ridges(scale = 2) +
    scale_fill_brewer(palette = "Blues") +
    guides(fill = "none") +
    labs(title = str_wrap("Direct HDL-Cholesterol levels of participants split according to their self-reported participation in vigorous weekly physical activity", width = 70),
         x = "Direct HDL-Cholesterol (mg/dL)",
         y = "Self reported participation in weekly vigorous physical activity",
       subtitle = str_wrap(glue("Across ", nrow(temp1), 
                           " participants from NHANES 2017-18 data with complete data. "), width = 70),
       caption = "Source: NHANES 2017-18.") 
Picking joint bandwidth of 3.28

Visualizing the levels of direct HDL-cholesterol among study participants across weekly exercise activity shows that there is a difference in HDL-cholesterol levels, albeit slight. Individuals who participate in vigorous physical activity atleast once a week appear to have higher HDL-cholesterol levels than individuals who do not participate in weekly vigorous physical activity.

Code
p1 <- 
  ggplot(temp1, aes(x = LBDHDD)) +
  geom_histogram(bins = 20, 
  fill = "#2b8cbe", col = "white") +
  facet_grid(PAQ650_f ~ ., labeller = 'label_value') +
  labs( x = "Direct HDL-Cholesterol (mg/dL)", y = "")

p2 <- 
  ggplot(temp1, aes(sample = LBDHDD)) + 
  geom_qq(col = "#2b8cbe") + geom_qq_line(col = "red") +
  facet_grid(PAQ650_f ~ ., labeller = 'label_value') +
  labs(x = "", y = "Direct HDL-Cholesterol (mg/dL)")

p3 <- 
  ggplot(temp1, aes(x = "", y = LBDHDD)) +
  geom_violin(fill = "#2b8cbe", alpha = 0.3) + 
  geom_boxplot(fill = "#2b8cbe", width = 0.3,
               outlier.color = "red") +
  facet_grid(PAQ650_f ~ ., labeller = 'label_value') +
  labs(y = "Direct HDL-Cholesterol (mg/dL)", x = "") + coord_flip()

p2 + p1 - p3 +
  plot_layout(ncol = 1, height = c(3, 2)) + 
  plot_annotation(title = str_wrap("Distribution of Direct HDL-Cholesterol (mg/dL), split according to the self reported weekly vigorous physical activity of the participant", width = 100),
         subtitle = glue("Across ", nrow(temp1), 
                           " participants. Data taken from NHANES 2017-18"))

The distribution of Direct HDL-Cholesterol values across the two different groups of weekly vigorous activity demonstrates that the distribution has a significant right skew, from the shape of the Q-Q plots, the shape of the histogram, and the outliers detected in the boxplot and violinplot.

Code
final_tib |>
  summarise(mean(LBDHDD), median(LBDHDD), sd(LBDHDD),
              skew1 = (mean(LBDHDD) - median(LBDHDD))/sd(LBDHDD)) |>
    kbl(digits = 2) |>kable_classic_2(font_size = 28, full_width = F)
mean(LBDHDD) median(LBDHDD) sd(LBDHDD) skew1
54.53 52 16.21 0.16

The numerical value of the skew also indicates that the data is right-skewed. From the data visualization and the numerical summaries, the right-skewed, non-normal distribution of the outcome variable indicates that the bootstrap method is ideal in order to compare the difference in means across the two groups.

4.3 Main Analysis

Given the non-normality of the outcome variable’s distribution and the imbalance in the study design, the Bootstrap method would be the ideal choice to estimate the difference in means across the two groups. After setting a seed, the bootdif function from the love-boost script was used to perform a bootstrapped analysis of the difference in mean HDL levels between the weekly vigorous exercise groups.

Code
set.seed(4312022)

bootdif(final_tib$LBDHDD, final_tib$PAQ650, conf.level = 0.90)
Mean Difference            0.05            0.95 
       2.545718        1.491790        3.637311 

4.4 Conclusions

The 90% confidence interval for the difference in the means of Direct HDL-Cholesterol levels goes from 1.49 mg/dL to 3.64 mg/dL, which are a reasonable range of estimates for the true difference of the mean of Direct HDL-Cholesterol levels across the two weekly vigorous activity groups. The point estimate of the difference in means is 2.55 mg/dL. This confidence interval does not include zero.

From the fact that we are using a 90% confidence interval, we can be 90% sure that this method of creating a confidence interval will produce a result containing the true difference in means. (This means that if bootstrapping was done 100 times to create 100 different intervals, 90 of those intervals will have the true difference in mean Direct HDL-Cholesterol levels.)

There is a statistically detectable difference in Direct HDL-Cholesterol levels across the two groups of weekly vigorous physical activity.

The mean blood HDL cholesterol for individuals who reported weekly vigorous physical activity was 2.55 mg/dL higher than that of individuals who did not report weekly vigorous physical activity, with a 90% CI of (1.49, 3.64 mg/dL).

However, while there is a statistical difference, the mean difference varies from 1.49 to 3.64 mg/dL. This difference may not be medically significant.

5 Analysis C - Comparing 3+ Population Means via ANOVA

5.1 The Question

Among adults of ages 21-79 participating in NHANES 2017-18, are differences in severity of feeling down, depressed, or hopeless associated with differences in the number of minutes of daily sedentary activity among NHANES 2017-18 participants?

5.2 Describing the Data

The Outcome Variable

The outcome variable is PAD680, which is a continuous quantitative variable. It describes the participants time spent sitting on a typical day in Minutes.

The Multi-Categorical Exposure Variable

The categorical variable is DPQ020, which is a Multi-Categorical Variable. It describes how often has the participant been bothered by the following problems: feeling down, depressed, or hopeless over the past 2 weeks. It has the following factors :

4 levels :

0 = Not at all

1 = Several days

2 = More than half the days

3 = Nearly every day

Code
mosaic::favstats(PAD680 ~ DPQ020, data = final_tib) |> kbl() |> 
  kable_classic_2(font_size = 28)
DPQ020 min Q1 median Q3 max mean sd n missing
0 10 180 300 480 1080 336.9990 197.0243 2042 0
1 2 180 300 480 960 331.5556 196.0190 450 0
2 20 180 240 360 900 294.8454 186.7324 97 0
3 15 120 300 420 960 301.4706 208.0203 68 0

The above table depicts the summary statistics of PAD680 values across the levels of DPQ020. From the summary statistics alone, there appears to be a differences in the mean PAD680 value across the different levels of DPQ020. However, numbers alone are not enough to make any conclusions, and the data needs to be visualized.

Using ggplot and ggridges, the distribution of PAD680 values across the levels of DPQ020 was visualized, and given a color scale to help the different categories stand out visually.

Code
temp2 <- final_tib |> 
  mutate(DPQ020_f = fct_recode(factor(DPQ020),
                              "Not at all" = "0",
                              "Several days" = "1",
                              "More than half the days" = "2",
                              "Nearly every day" = "3")) |> 
    select(DPQ020_f,PAD680)
ggplot(temp2, aes(x = PAD680, y = DPQ020_f, fill = DPQ020_f, height = after_stat(density))) +
    ggridges::geom_density_ridges(scale = 2) +
    scale_fill_brewer(palette = "RdYlGn", direction = -1) +
    guides(fill = "none") +
    labs(title = str_wrap("Minutes of sedentary activity of NHANES participants per day split according to how often they felt down, depressed or hopeless over the past 2 weeks.", width = 70),
         x = "Minutes of sedentary activity during the day (Minutes)",
         y = str_wrap("How often did the participant feel down, depressed or hopeless over the past 2 weeks?",width = 70),
       subtitle = str_wrap(glue("Across ", nrow(temp2), 
                           " participants from NHANES 2017-18 data with complete data. "), width = 70),
       caption = "Source: NHANES 2017-18.") 
Picking joint bandwidth of 54.9

The above graphic visualizes the distribution of DPQ020 (Minutes of sedentary activity of NHANES participants per day) according to the PAD680 (How often did the participant feel down, depressed or hopeless over the past 2 weeks?) category of the participant. From looking at the plot, it does appear that categories 2 (More than half the days) and 3 (Nearly every day) have a larger proportion of minutes of sedentary activity.

Code
ggplot(temp2, aes(x = PAD680, y = DPQ020_f, fill = DPQ020_f)) +
  geom_violin(alpha = 0.3) +
  geom_boxplot(width = 0.3) +
  scale_fill_brewer(palette = "RdYlGn", direction = -1) +
  coord_flip() +
  guides(fill = "none") +
  theme_bw() +
  labs(title = str_wrap("Minutes of sedentary activity of NHANES participants per day split according to how often they felt down, depressed or hopeless over the past 2 weeks.", width = 70),
       y = str_wrap("How often did the participant feel down, depressed or hopeless over the past 2 weeks?",width = 70),
       x = "Minutes of sedentary activity per day",
       caption = "Data taken from NHANES 2017-18")

The above plot demonstrates the variance of PAD680 values across the 4 categories of DPQ020. PAD680 appears to have more or less equal variance across the 4 categories.

Code
p1 <- 
  ggplot(temp2, aes(x = PAD680)) +
  geom_histogram(bins = 20, 
  fill = "#2b8cbe", col = "white") +
  facet_grid(DPQ020_f ~ ., labeller = 'label_value') +
  labs( x = "Minutes of sedentary activity during the day (Minutes)", y = "")

p2 <- 
  ggplot(temp2, aes(sample = PAD680)) + 
  geom_qq(col = "#2b8cbe") + geom_qq_line(col = "red") +
  facet_grid(DPQ020_f ~ ., labeller = 'label_value') +
  labs(x = "", y = "Minutes of sedentary activity during the day (Minutes)")

p3 <- 
  ggplot(temp2, aes(x = "", y = PAD680)) +
  geom_violin(fill = "#2b8cbe", alpha = 0.3) + 
  geom_boxplot(fill = "#2b8cbe", width = 0.3,
               outlier.color = "red") +
  facet_grid(DPQ020_f ~ ., labeller = 'label_value') +
  labs(y = "Minutes of sedentary activity during the day (Minutes)", x = "") + coord_flip()

p2 + p1 - p3 +
  plot_layout(ncol = 1, height = c(3, 2)) + 
  plot_annotation(title = str_wrap("Distribution of Minutes of Sedentary Activity of NHANES participants per day", width = 90),
         subtitle = str_wrap(glue("Across ", nrow(temp2), 
                           " participants, split according to how often they felt down, depressed or hopeless over the past 2 weeks. Data taken from NHANES 2017-18."), width = 90))

These plots demonstrate the distribution of PAD680. The distribution is skewed to the right, and does not appear to have a normal distribution.

Code
final_tib |>
  summarise(mean(PAD680), median(PAD680), sd(PAD680),
              skew1 = (mean(PAD680) - median(PAD680))/sd(PAD680)) |>
    kbl(digits = 2) |>kable_classic_2(font_size = 28, full_width = F)
mean(PAD680) median(PAD680) sd(PAD680) skew1
333.63 300 196.9 0.17

The numerical summaries also demonstrate the PAD680 does not have a normal distribution; it is right-skewed. There is a marked difference between the mean and the median, and the numerical value of the skew also demonstrates a right skew.

5.3 Main Analysis

Given that PAD680 does not have a normal distribution, even though the distribution of PAD680 across DPQ020 has equal variance, the assumptions of an ANOVA test would be violated if it were to be performed on these two variables.

The three main assumptions of an ANOVA test are as follows.

  1. Normality

Each sample was drawn from a normally distributed population.

  1. Equal variance

The variances of the populations that the samples come from are equal.

  1. Independence

The observations in each group are independent of each other and the observations within groups were obtained by a random sample.

Thus, the Kruskal-Wallis rank sum test would be preferable to the ANOVA test to estimate if there is a significant difference in the mean value of PAD680 across the 4 categories of DPQ020, especially because of the significant right-skew of the data. The Kruskal-Wallis test is non-parametric, which means that it does not need any distributional assumption.

Code
kruskal.test(PAD680 ~ DPQ020, data = final_tib)

    Kruskal-Wallis rank sum test

data:  PAD680 by DPQ020
Kruskal-Wallis chi-squared = 6.7478, df = 3, p-value = 0.08038

The results of the Kruskal-Wallis test do not indicate a meaningful difference in the values of PAD680 across the levels of DPQ020, given that the p value is 0.080, with a significance value of 0.05.

5.4 Conclusions

From the results of the Kruskal-Wallis test, among the study population taken from NHANES 2017-18, there is no statistically detectable difference in the number of minutes of daily sedentary activity across the different levels of how often they felt down, depressed or hopeless over the past 2 weeks.

6 Analysis D - Analyzing a 2x2 table

6.1 The Question

Among adults of ages 21-79 participating in NHANES 2017-18, is there an association between a period of heavy drinking (described as having 4 or more alcoholic beverages daily) and a lack of weekly vigorous physical activity?

6.2 Describing the Data

The two binary categorical variables under study are:

The Outcome Variable

PAQ650 - It describes whether or not the participant performs any vigorous-intensity sports, fitness, or recreational activities that cause large increases in breathing or heart rate like running or basketball for at least 10 minutes continuously, in a typical week. It has the following levels:

1 = Yes

0 = No

The Exposure Variable

ALQ151 - It describes whether or not the participant had a period in their life where they drank 4 or more drinks of any alcoholic beverage almost every day. It has the following levels:

1 = Yes

0 = No

Using the mutate function, the variables ALQ151 and PAQ650 were given different factor names to aid in understanding their meaning.

Code
temp3<- final_tib |>
  mutate(ALQ151 = fct_recode(factor(ALQ151),
                                "Yes" = "1",
                                "No" = "0"),
         PAQ650 = fct_recode(factor(PAQ650),
                                "Yes" = "1",
                                "No" = "0")) |>
  select(PAQ650, ALQ151)

temp3 <- temp3 |>
  mutate(ALQ151 = fct_relevel(ALQ151, "Yes"),
         PAQ650 = fct_relevel(PAQ650, "No", "Yes"))

The tabyl function was used to create a table of the two variables.

Code
table3 <- temp3 |> tabyl(ALQ151, PAQ650)|> 
    adorn_totals(where = c("row")) |>
    adorn_percentages(denominator = "row") |>
    adorn_pct_formatting() |>
    adorn_ns(position = "front")

Using the gt package in conjunction with the glue package, a meaningful, visually appealing 2x2 table was created.

Code
gt_table3 <- gt(table3)
gt_table3 <- gt_table3 |>
  tab_header(
    title = md("**Exploring the relationship between heavy drinking and a lack of weekly vigorous physical activity.**"),
    subtitle = glue("Across ", nrow(temp3), 
                           " participants with complete data.")
  ) |>
  tab_spanner(
    label = "Does the participant perform any vigorous-intensity activities that cause large increases in breathing or heart rate for at least 10 minutes continuously, in a typical week?",
    columns = c(Yes, No)
  ) |>
  cols_label("ALQ151" = "Did the participant have a period in their life where they drank 4 or more drinks of any alcoholic beverage almost every day?") |>
  cols_label("No" = md("**No**")) |>
  cols_label("Yes" = md("**Yes**")) |>
  tab_source_note(
    source_note = md("**Source:** _National Health and Nutrition Examination Survey, 2017-2018_")
  )
gt_table3
Exploring the relationship between heavy drinking and a lack of weekly vigorous physical activity.
Across 2657 participants with complete data.
Did the participant have a period in their life where they drank 4 or more drinks of any alcoholic beverage almost every day? Does the participant perform any vigorous-intensity activities that cause large increases in breathing or heart rate for at least 10 minutes continuously, in a typical week?
Yes No
Yes 131 (37.9%) 215 (62.1%)
No 1,034 (44.7%) 1,277 (55.3%)
Total 1,165 (43.8%) 1,492 (56.2%)
Source: National Health and Nutrition Examination Survey, 2017-2018

6.3 Main Analysis

Using the twobytwo function from the love-boost script, an analysis of the two-by-two table was conducted.

Code
twobytwo(215+2, 131+2, 1277+2, 1034+2, "Drank 4 or more alcoholic drinks daily","Did not have such a period","No weekly physical activity","Had weekly physical activity", conf.level = 0.90)
2 by 2 table analysis: 
------------------------------------------------------ 
Outcome   : No weekly physical activity 
Comparing : Drank 4 or more alcoholic drinks daily vs. Did not have such a period 

                                       No weekly physical activity
Drank 4 or more alcoholic drinks daily                         217
Did not have such a period                                    1279
                                       Had weekly physical activity
Drank 4 or more alcoholic drinks daily                          133
Did not have such a period                                     1036
                                          P(No weekly physical activity)
Drank 4 or more alcoholic drinks daily                            0.6200
Did not have such a period                                        0.5525
                                       90% conf. interval
Drank 4 or more alcoholic drinks daily    0.5765   0.6617
Did not have such a period                0.5354   0.5694

                                   90% conf. interval
             Relative Risk: 1.1222    1.0407   1.2101
         Sample Odds Ratio: 1.3216    1.0888   1.6041
Conditional MLE Odds Ratio: 1.3215    1.0822   1.6163
    Probability difference: 0.0675    0.0209   0.1125

             Exact P-value: 0.0179 
        Asymptotic P-value: 0.0179 
------------------------------------------------------

6.4 Conclusions

6.4.1 The Relative Risk

Individuals who have had a period of their life where they drank 4 or more alcoholic drinks a day have a 1.12 times higher risk(90% CI-1.04 to 1.21) of having no weekly vigorous physical activity when compared to individuals who did not have such a period.

6.4.2 The Odds Ratio

Individuals who do not report weekly vigorous physical activity have 1.32 times higher odds(90% CI-1.09 to 1.60) of having had a period in their lives where they drank 4 or more alcoholic drinks a day when compared to individuals who did report weekly vigorous physical activity.

6.4.3 The Risk Difference

The difference between in the probability of having no weekly vigorous physical activity for individuals who have had a period of their life where they drank 4 or more alcoholic drinks versus individuals who did not have such a period is 0.068 (90% CI - 0.021 to 0.113).

7 Analysis E - Two-Way (4 x 2) Contingency Table

7.1 The Question

Among adults of ages 21-79 participating in NHANES 2017-18, is there an association between having had a period of heavy drinking (described as having 4 or more alcoholic beverages daily) and symptoms of depression (how often has the participant been bothered by the following problems: feeling down, depressed, or hopeless)?

7.2 Describing the Data

Using the mutate function, the variables under study and their factors were re-leveled and renamed in order to create context and meaning for the data.

The tabyl function was then used to create a dataset containing the different factors of DPQ020 as the rows, and the two factors of ALQ151 as the columns.

Code
temp4 <- final_tib |>
  select(DPQ020, ALQ151) |>
  mutate(DPQ020 = fct_recode(factor(DPQ020),
                               "Not at all" = "0",
                               "Several Days" = "1",
                               "More than half the days" = "2",
                               "Nearly every day" = "3"),
         ALQ151 = fct_recode(factor(ALQ151),
                               "No" = "0",
                               "Yes" = "1"))
table4 <- temp4 |>
  tabyl(DPQ020, ALQ151)|>
  adorn_totals(where = c("row")) |>
    adorn_percentages(denominator = "row") |>
    adorn_pct_formatting() |>
    adorn_ns(position = "front")

The gt package was used to create a meaningful, visually appealing table for the data present in table4, in conjunction with the glue package.

Code
gt_table4 <- gt(table4)
gt_table4 <- gt_table4 |>
  tab_header(
    title = md("**Exploring the relationship between depression and excessive drinking.**"),
    subtitle = glue("Across ", nrow(temp4), 
                           " participants with complete data.")
  )|>
  tab_spanner(
    label = "Did the participant have a period in their life where they drank 4 or more drinks of any alcoholic beverage almost every day?",
    columns = c(Yes, No)
  ) |>
  cols_label("DPQ020" = "Over the last 2 weeks, how often has the participant been bothered by the following problems: feeling down, depressed, or hopeless") |>
  cols_label("Yes" = md("**Yes**")) |>
  cols_label("No" = md("**No**")) |>
  tab_source_note(
    source_note = md("**Source:** _National Health and Nutrition Examination Survey, 2017-2018_")
  )
gt_table4
Exploring the relationship between depression and excessive drinking.
Across 2657 participants with complete data.
Over the last 2 weeks, how often has the participant been bothered by the following problems: feeling down, depressed, or hopeless Did the participant have a period in their life where they drank 4 or more drinks of any alcoholic beverage almost every day?
Yes No
Not at all 238 (11.7%) 1,804 (88.3%)
Several Days 67 (14.9%) 383 (85.1%)
More than half the days 16 (16.5%) 81 (83.5%)
Nearly every day 25 (36.8%) 43 (63.2%)
Total 346 (13.0%) 2,311 (87.0%)
Source: National Health and Nutrition Examination Survey, 2017-2018

7.3 Main Analysis

The Pearson Chi-Square test of Independence was conducted to assess if there is an association between the rows and the columns of the contingency table.

Code
table4a <- table(temp4$DPQ020, temp4$ALQ151)
chisq.test(table4a)

    Pearson's Chi-squared test

data:  table4a
X-squared = 39.629, df = 3, p-value = 1.277e-08

From the p-value of the above Chi-squared test, we can assume that there is a definite association between the rows and columns of the contingency table.

Cramer’s V

From the Chi-squared test, we know that there is an association between the rows and the columns of the contingency table. However, to identify the strength of the association, the Cramer’s V test, which is a test of statistical strength, can be conducted.

Code
cramerV(table4a, ci = TRUE) 
  Cramer.V lower.ci upper.ci
1   0.1221  0.07406   0.1801

The point estimate of the Cramer’s V value is 0.122, with 95% confidence intervals ranging from 0.074 to 0.180.

From the Cramer’s V test, given that there are 3 degrees of freedom, the association between depression and excessive drinking is not particularly strong, although it is definitely present.

7.3.1 Chi-Square Test Assumptions

  • The levels of the variables are mutually exclusive.

  • Each subject may contribute data to one and only one cell in the χ2.

  • The study groups must be independent.

In addition to these, the Cochran’s assumptions must also be met, which are:

  • No cells with 0 counts

  • At least 80% of the cells in our table have counts of 5 or higher

  • The Expected counts in each cell of the table should be 5 or more

7.4 Conclusions

11.7% of participants who reported never feeling down, depressed, or hopeless over the past 2 weeks had a period in their life where they drank 4 or more drinks of any alcoholic beverage almost every day.

The proportion of people who have a history of excessive drinking increases with the severity of the self reported symptoms of depression.

36.8% of participants who felt down, depressed or hopeless nearly every day had a period in their life where they drank 4 or more drinks of any alcoholic beverage almost every day.

Among adults of ages 21-79 participating in NHANES 2017-18, individuals who have had a period of heavy drinking (described as having 4 or more alcoholic beverages daily) have a greater degree of association with symptoms of depression (how often has the participant been bothered by the following problems: feeling down, depressed, or hopeless) compared to individuals who have not had a period of heavy drinking. Stronger feelings of depression are associated more strongly with a history of having had a period of excessive drinking.

8 Session Information

Code
sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31 ucrt)
 os       Windows 11 x64 (build 22621)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United States.utf8
 ctype    English_United States.utf8
 tz       America/New_York
 date     2024-01-08
 pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package      * version    date (UTC) lib source
 backports      1.4.1      2021-12-13 [1] CRAN (R 4.3.1)
 base64enc      0.1-3      2015-07-28 [1] CRAN (R 4.3.1)
 bookdown       0.37       2023-12-01 [1] CRAN (R 4.3.2)
 boot           1.3-28.1   2022-11-22 [1] CRAN (R 4.3.2)
 broom        * 1.0.5      2023-06-09 [1] CRAN (R 4.3.2)
 cellranger     1.1.0      2016-07-27 [1] CRAN (R 4.3.2)
 checkmate      2.3.1      2023-12-04 [1] CRAN (R 4.3.2)
 class          7.3-22     2023-05-03 [2] CRAN (R 4.3.2)
 cli            3.6.1      2023-03-23 [1] CRAN (R 4.3.2)
 cluster        2.1.6      2023-12-01 [1] CRAN (R 4.3.2)
 cmprsk         2.2-11     2022-01-06 [1] CRAN (R 4.3.2)
 codetools      0.2-19     2023-02-01 [2] CRAN (R 4.3.2)
 coin           1.4-3      2023-09-27 [1] CRAN (R 4.3.2)
 colorspace     2.1-0      2023-01-23 [1] CRAN (R 4.3.2)
 commonmark     1.9.0      2023-03-17 [1] CRAN (R 4.3.2)
 data.table     1.14.10    2023-12-08 [1] CRAN (R 4.3.2)
 DescTools      0.99.52    2023-12-01 [1] CRAN (R 4.3.2)
 digest         0.6.33     2023-07-07 [1] CRAN (R 4.3.2)
 dplyr        * 1.1.4      2023-11-17 [1] CRAN (R 4.3.2)
 e1071          1.7-14     2023-12-06 [1] CRAN (R 4.3.2)
 Epi            2.47.1     2023-04-25 [1] CRAN (R 4.3.2)
 etm            1.1.1      2020-09-08 [1] CRAN (R 4.3.2)
 evaluate       0.23       2023-11-01 [1] CRAN (R 4.3.2)
 Exact          3.2        2022-09-25 [1] CRAN (R 4.3.1)
 expm           0.999-8    2023-11-29 [1] CRAN (R 4.3.2)
 fansi          1.0.6      2023-12-08 [1] CRAN (R 4.3.2)
 farver         2.1.1      2022-07-06 [1] CRAN (R 4.3.2)
 fastmap        1.1.1      2023-02-24 [1] CRAN (R 4.3.2)
 forcats      * 1.0.0      2023-01-29 [1] CRAN (R 4.3.2)
 foreign        0.8-86     2023-11-28 [1] CRAN (R 4.3.2)
 Formula        1.2-5      2023-02-24 [1] CRAN (R 4.3.1)
 generics       0.1.3      2022-07-05 [1] CRAN (R 4.3.2)
 ggformula    * 0.12.0     2023-11-09 [1] CRAN (R 4.3.2)
 ggplot2      * 3.4.4      2023-10-12 [1] CRAN (R 4.3.2)
 ggridges     * 0.5.5      2023-12-15 [1] CRAN (R 4.3.2)
 gld            2.6.6      2022-10-23 [1] CRAN (R 4.3.2)
 glue         * 1.6.2      2022-02-24 [1] CRAN (R 4.3.2)
 gridExtra      2.3        2017-09-09 [1] CRAN (R 4.3.2)
 gt           * 0.10.0     2023-10-07 [1] CRAN (R 4.3.2)
 gtable         0.3.4      2023-08-21 [1] CRAN (R 4.3.2)
 haven          2.5.4      2023-11-30 [1] CRAN (R 4.3.2)
 highr          0.10       2022-12-22 [1] CRAN (R 4.3.2)
 Hmisc          5.1-1      2023-09-12 [1] CRAN (R 4.3.2)
 hms            1.1.3      2023-03-21 [1] CRAN (R 4.3.2)
 htmlTable      2.4.2      2023-10-29 [1] CRAN (R 4.3.2)
 htmltools      0.5.7      2023-11-03 [1] CRAN (R 4.3.2)
 htmlwidgets    1.6.4      2023-12-06 [1] CRAN (R 4.3.2)
 httr           1.4.7      2023-08-15 [1] CRAN (R 4.3.2)
 janitor      * 2.2.0      2023-02-02 [1] CRAN (R 4.3.2)
 jsonlite       1.8.8      2023-12-04 [1] CRAN (R 4.3.2)
 kableExtra   * 1.3.4      2021-02-20 [1] CRAN (R 4.3.2)
 knitr        * 1.45       2023-10-30 [1] CRAN (R 4.3.2)
 labeling       0.4.3      2023-08-29 [1] CRAN (R 4.3.1)
 labelled       2.12.0     2023-06-21 [1] CRAN (R 4.3.2)
 lattice      * 0.22-5     2023-10-24 [1] CRAN (R 4.3.2)
 libcoin        1.0-10     2023-09-27 [1] CRAN (R 4.3.2)
 lifecycle      1.0.4      2023-11-07 [1] CRAN (R 4.3.2)
 lmom           3.0        2023-08-29 [1] CRAN (R 4.3.1)
 lmtest         0.9-40     2022-03-21 [1] CRAN (R 4.3.2)
 lubridate    * 1.9.3      2023-09-27 [1] CRAN (R 4.3.2)
 magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.3.2)
 markdown       1.12       2023-12-06 [1] CRAN (R 4.3.2)
 MASS           7.3-60     2023-05-04 [1] CRAN (R 4.3.2)
 Matrix       * 1.6-4      2023-11-30 [1] CRAN (R 4.3.2)
 matrixStats    1.2.0      2023-12-11 [1] CRAN (R 4.3.2)
 mgcv           1.9-1      2023-12-21 [1] CRAN (R 4.3.2)
 modeltools     0.2-23     2020-03-05 [1] CRAN (R 4.3.1)
 mosaic         1.9.0      2023-11-10 [1] CRAN (R 4.3.2)
 mosaicCore     0.9.4.0    2023-11-05 [1] CRAN (R 4.3.2)
 mosaicData   * 0.20.4     2023-11-05 [1] CRAN (R 4.3.2)
 multcomp       1.4-25     2023-06-20 [1] CRAN (R 4.3.2)
 multcompView   0.1-9      2023-04-09 [1] CRAN (R 4.3.2)
 munsell        0.5.0      2018-06-12 [1] CRAN (R 4.3.2)
 mvtnorm        1.2-4      2023-11-27 [1] CRAN (R 4.3.2)
 naniar       * 1.0.0      2023-02-02 [1] CRAN (R 4.3.2)
 nhanesA      * 0.7.4      2023-07-16 [1] CRAN (R 4.3.2)
 nlme           3.1-164    2023-11-27 [1] CRAN (R 4.3.2)
 nnet           7.3-19     2023-05-03 [1] CRAN (R 4.3.2)
 nortest        1.0-4      2015-07-30 [1] CRAN (R 4.3.1)
 numDeriv       2016.8-1.1 2019-06-06 [1] CRAN (R 4.3.1)
 patchwork    * 1.1.3      2023-08-14 [1] CRAN (R 4.3.2)
 pillar         1.9.0      2023-03-22 [1] CRAN (R 4.3.2)
 pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.3.2)
 plyr           1.8.9      2023-10-02 [1] CRAN (R 4.3.2)
 proxy          0.4-27     2022-06-09 [1] CRAN (R 4.3.2)
 purrr        * 1.0.2      2023-08-10 [1] CRAN (R 4.3.2)
 R6             2.5.1      2021-08-19 [1] CRAN (R 4.3.2)
 RColorBrewer   1.1-3      2022-04-03 [1] CRAN (R 4.3.1)
 rcompanion   * 2.4.34     2023-09-17 [1] CRAN (R 4.3.2)
 Rcpp           1.0.11     2023-07-06 [1] CRAN (R 4.3.2)
 readr        * 2.1.4      2023-02-10 [1] CRAN (R 4.3.2)
 readxl         1.4.3      2023-07-06 [1] CRAN (R 4.3.2)
 rlang          1.1.2      2023-11-04 [1] CRAN (R 4.3.2)
 rmarkdown      2.25       2023-09-18 [1] CRAN (R 4.3.2)
 rmdformats   * 1.0.4      2022-05-17 [1] CRAN (R 4.3.2)
 rootSolve      1.8.2.4    2023-09-21 [1] CRAN (R 4.3.1)
 rpart          4.1.23     2023-12-05 [1] CRAN (R 4.3.2)
 rstudioapi     0.15.0     2023-07-07 [1] CRAN (R 4.3.2)
 rvest          1.0.3      2022-08-19 [1] CRAN (R 4.3.2)
 sandwich       3.1-0      2023-12-11 [1] CRAN (R 4.3.2)
 sass           0.4.8      2023-12-06 [1] CRAN (R 4.3.2)
 scales         1.3.0      2023-11-28 [1] CRAN (R 4.3.2)
 sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.3.2)
 snakecase      0.11.1     2023-08-27 [1] CRAN (R 4.3.2)
 stringi        1.8.3      2023-12-11 [1] CRAN (R 4.3.2)
 stringr      * 1.5.1      2023-11-14 [1] CRAN (R 4.3.2)
 survival       3.5-7      2023-08-14 [1] CRAN (R 4.3.2)
 svglite        2.1.3      2023-12-08 [1] CRAN (R 4.3.2)
 systemfonts    1.0.5      2023-10-09 [1] CRAN (R 4.3.2)
 TH.data        1.1-2      2023-04-17 [1] CRAN (R 4.3.2)
 tibble       * 3.2.1      2023-03-20 [1] CRAN (R 4.3.2)
 tidyr        * 1.3.0      2023-01-24 [1] CRAN (R 4.3.2)
 tidyselect     1.2.0      2022-10-10 [1] CRAN (R 4.3.2)
 tidyverse    * 2.0.0      2023-02-22 [1] CRAN (R 4.3.2)
 timechange     0.2.0      2023-01-11 [1] CRAN (R 4.3.2)
 tzdb           0.4.0      2023-05-12 [1] CRAN (R 4.3.2)
 utf8           1.2.4      2023-10-22 [1] CRAN (R 4.3.2)
 vctrs          0.6.5      2023-12-01 [1] CRAN (R 4.3.2)
 viridisLite    0.4.2      2023-05-02 [1] CRAN (R 4.3.2)
 visdat         0.6.0      2023-02-02 [1] CRAN (R 4.3.2)
 webshot        0.5.5      2023-06-26 [1] CRAN (R 4.3.2)
 withr          2.5.2      2023-10-30 [1] CRAN (R 4.3.2)
 xfun           0.41       2023-11-01 [1] CRAN (R 4.3.2)
 xml2           1.3.6      2023-12-04 [1] CRAN (R 4.3.2)
 yaml           2.3.7      2023-01-23 [1] CRAN (R 4.3.2)
 zoo            1.8-12     2023-04-13 [1] CRAN (R 4.3.2)

 [1] C:/Users/navee/AppData/Local/R/win-library/4.3
 [2] C:/Program Files/R/R-4.3.2/library

──────────────────────────────────────────────────────────────────────────────