In this exercise, we will attempt to provide an answer to the question, “How do light users of alcohol differ from those who are heavy users of alcohol in terms of their self-reported overall health?” In this document we will:
To address this question, we will use data from the 2014 National Survey on Drug Use and Health, which is conducted annually by the Substance Abuse and Mental Health Services Administration (SAMHSA), an agency in the U.S. Department of Health and Human Services (DHHS). More information about this survey can be found at ICPSR’s data archive page. The data has been prepared into a .Rdata file, so we can simply load it in the workspace with the load()
command. At this time, we’ll also load our packages:
library(tidyverse)
library(forcats)
load("36361-0001-Data.rda")
Because the data object has an unpleasant name, da36361.0001, we make a copy with the name full_data and remove the original data object.
full_data <- da36361.0001
remove(da36361.0001)
#current_use_criteria <- !0
#if(full_data$ALCREC <= 30){
# current_use
#}
# never_use_criteria <- 0
We have defined light users as those who have drank alcohol 10 days or less within the past month.
light_use_criteria <- 10
We have defined heavy users as those who have drank alcohol more than 20 days within the past month.
heavy_use_criteria <- 20
We ar looking for the self-reported overall health of people who are categorized as “heavy,” “light,” “moderate,” or “never” users. Below we have manipulated the data to classify people under these definitions using the variable alc_user_status.
full_data <- mutate(full_data, alc_user_status =
ifelse(ALCFLAG == "(0) Never used (IRALCRC = 9)", "Never User",
ifelse(ALCMON == "(0) Did not use in the past month (IRALCRC = 2-3,0)",
"Former User", "Current User")))
full_data <- mutate(full_data, alc_30day_use = cut(ALCDAYS, c(0, 10, 20, 30),
c("Light User", "Moderate User", "Heavy User")))
full_data <- mutate(full_data, alc_user_status = ifelse(alc_user_status =="Current User",
as.character(alc_30day_use), alc_user_status))
full_data <- mutate(full_data, alc_user_status = as.factor(alc_user_status))
full_data <- mutate(full_data, alc_user_status = fct_relevel(alc_user_status,
"Never User",
"Light User",
"Moderate User",
"Heavy User"))
Using these definitions, we can now create two data subgroups: heavy users and light users. Now, we wish to compare the relative health of people who we have defined as heavy users versus those we have defined as light users.
heavy_users <- filter(full_data, ALCDAYS > heavy_use_criteria)
light_users <- filter(full_data, ALCDAYS < light_use_criteria)
Next, we remove the full_data object and drop the unneeded variables from our light_users and heavy_users datasets. These steps are not necessary, but clearing out unneeded objects and variables can improve the speed with with your code is executed.
# remove(full_data)
heavy_users <- select(heavy_users, HEALTH)
light_users <- select(light_users, HEALTH)
It may be useful to us to have the data about both groups (heavy and light users) in our data object. To accomplish this, we add a new variable user_status to both our data objects. For all observations in current_users we wish to set the value of this variable to “current” For all observations in light_users we wish to set the value of this variable to “light” Once we have added this variable to both datasets, we combine them using the rbind() function. rbind() stands for row bind – it binds or glues two datasets together along by putting one dataset on top of the other.
heavy_users$alc_user_status <- "heavy"
light_users$alc_user_status <- "light"
combined_data <- rbind(heavy_users, light_users)
We start our graphical analysis by making a some basic plots showing the self-reported overall health of regular users.
user_counts <- count(full_data,alc_user_status)
health_counts <- count(full_data,alc_user_status,HEALTH)
print(health_counts,n=nrow(health_counts)) # nrow counts number of rows exactly
## # A tibble: 28 x 3
## alc_user_status HEALTH n
## <fct> <fct> <int>
## 1 Never User (1) Excellent 4655
## 2 Never User (2) Very good 5658
## 3 Never User (3) Good 3136
## 4 Never User (4) Fair 942
## 5 Never User (5) Poor 176
## 6 Never User <NA> 10
## 7 Light User (1) Excellent 4919
## 8 Light User (2) Very good 7970
## 9 Light User (3) Good 4933
## 10 Light User (4) Fair 1379
## 11 Light User (5) Poor 215
## 12 Light User <NA> 6
## 13 Moderate User (1) Excellent 1054
## 14 Moderate User (2) Very good 1803
## 15 Moderate User (3) Good 988
## 16 Moderate User (4) Fair 229
## 17 Moderate User (5) Poor 37
## 18 Heavy User (1) Excellent 488
## 19 Heavy User (2) Very good 850
## 20 Heavy User (3) Good 557
## 21 Heavy User (4) Fair 140
## 22 Heavy User (5) Poor 37
## 23 <NA> (1) Excellent 3197
## 24 <NA> (2) Very good 5348
## 25 <NA> (3) Good 4388
## 26 <NA> (4) Fair 1736
## 27 <NA> (5) Poor 416
## 28 <NA> <NA> 4
print(user_counts)
## # A tibble: 5 x 2
## alc_user_status n
## <fct> <int>
## 1 Never User 14577
## 2 Light User 19422
## 3 Moderate User 4111
## 4 Heavy User 2072
## 5 <NA> 15089
Of 31782 “never users,” 9162 (or about 29%) reported “excellent” health, which agrees with what we see in our graph. [a few more lines go here]
What we see in the graph