EDA for Diabetes Prediction

Author

Koji Takagi

1. Introduction

This project uses data from the 2015 Behavioral Risk Factor Surveillance System (BRFSS) to explore potential predictors of diabetes. The binary outcome variable is diabetes_binary, indicating whether a respondent has been diagnosed with diabetes. Our goal is to identify promising predictors that can be used to model the likelihood of diabetes.

We focus our exploratory analysis on the following three predictors:

  • High Blood Pressure (high_bp): Prior studies have demonstrated a strong correlation between hypertension and diabetes.
  • Smoking (smoker): Smoking has been associated with increased insulin resistance and risk of type 2 diabetes.
  • Physical Activity (phys_activity): A lack of physical activity is a known risk factor for obesity and diabetes.

These variables were selected based on both domain knowledge and practicality in terms of lifestyle-related interventions. Note: While this analysis focuses on three predictors, other potential predictors (such as BMI and cholesterol) were considered but not selected for this limited scope.

2. Data Preparation

df <- read_csv("diabetes_binary_health_indicators_BRFSS2015.csv") %>%
  clean_names()

# Convert variables to appropriate types
df <- df %>%
  mutate(
    diabetes_binary = factor(diabetes_binary, labels = c("No", "Yes")),
    high_bp = factor(high_bp, levels = c(0, 1), labels = c("No", "Yes")),
    smoker = factor(smoker, levels = c(0, 1), labels = c("No", "Yes")),
    phys_activity = factor(phys_activity, levels = c(0, 1), labels = c("No", "Yes"))
  )

# Check for missingness and basic summary
skim(df)
Data summary
Name df
Number of rows 253680
Number of columns 22
_______________________
Column type frequency:
factor 4
numeric 18
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
diabetes_binary 0 1 FALSE 2 No: 218334, Yes: 35346
high_bp 0 1 FALSE 2 No: 144851, Yes: 108829
smoker 0 1 FALSE 2 No: 141257, Yes: 112423
phys_activity 0 1 FALSE 2 Yes: 191920, No: 61760

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
high_chol 0 1 0.42 0.49 0 0 0 1 1 ▇▁▁▁▆
chol_check 0 1 0.96 0.19 0 1 1 1 1 ▁▁▁▁▇
bmi 0 1 28.38 6.61 12 24 27 31 98 ▇▅▁▁▁
stroke 0 1 0.04 0.20 0 0 0 0 1 ▇▁▁▁▁
heart_diseaseor_attack 0 1 0.09 0.29 0 0 0 0 1 ▇▁▁▁▁
fruits 0 1 0.63 0.48 0 0 1 1 1 ▅▁▁▁▇
veggies 0 1 0.81 0.39 0 1 1 1 1 ▂▁▁▁▇
hvy_alcohol_consump 0 1 0.06 0.23 0 0 0 0 1 ▇▁▁▁▁
any_healthcare 0 1 0.95 0.22 0 1 1 1 1 ▁▁▁▁▇
no_docbc_cost 0 1 0.08 0.28 0 0 0 0 1 ▇▁▁▁▁
gen_hlth 0 1 2.51 1.07 1 2 2 3 5 ▅▇▇▃▁
ment_hlth 0 1 3.18 7.41 0 0 0 2 30 ▇▁▁▁▁
phys_hlth 0 1 4.24 8.72 0 0 0 3 30 ▇▁▁▁▁
diff_walk 0 1 0.17 0.37 0 0 0 0 1 ▇▁▁▁▂
sex 0 1 0.44 0.50 0 0 0 1 1 ▇▁▁▁▆
age 0 1 8.03 3.05 1 6 8 10 13 ▂▃▇▇▆
education 0 1 5.05 0.99 1 4 5 6 6 ▁▁▅▅▇
income 0 1 6.05 2.07 1 5 7 8 8 ▁▁▃▂▇
anyNA(df)
[1] FALSE

The skim() output above shows summary statistics for each variable. Most notably, all variables are complete, as anyNA(df) returns FALSE, indicating no missing data.

3. Univariate Exploration

df %>%
  count(diabetes_binary) %>%
  ggplot(aes(x = diabetes_binary, y = n, fill = diabetes_binary)) +
  geom_bar(stat = "identity") +
  labs(title = "Distribution of Diabetes Cases", y = "Count") +
  theme_minimal()

df %>%
  count(high_bp) %>%
  ggplot(aes(x = high_bp, y = n, fill = high_bp)) +
  geom_bar(stat = "identity") +
  labs(title = "Distribution of High Blood Pressure", y = "Count") +
  theme_minimal()

df %>%
  count(smoker) %>%
  ggplot(aes(x = smoker, y = n, fill = smoker)) +
  geom_bar(stat = "identity") +
  labs(title = "Distribution of Smoking Status", y = "Count") +
  theme_minimal()

df %>%
  count(phys_activity) %>%
  ggplot(aes(x = phys_activity, y = n, fill = phys_activity)) +
  geom_bar(stat = "identity") +
  labs(title = "Distribution of Physical Activity", y = "Count") +
  theme_minimal()

4. Bivariate Exploration

High Blood Pressure

df %>%
  count(high_bp, diabetes_binary) %>%
  ggplot(aes(high_bp, n, fill = diabetes_binary)) +
  geom_bar(stat = "identity", position = "fill") +
  labs(y = "Proportion", title = "Diabetes by High Blood Pressure") +
  theme_minimal()

We observe that individuals with high blood pressure are more likely to have diabetes.

Smoking

df %>%
  count(smoker, diabetes_binary) %>%
  ggplot(aes(smoker, n, fill = diabetes_binary)) +
  geom_bar(stat = "identity", position = "fill") +
  labs(y = "Proportion", title = "Diabetes by Smoking Status") +
  theme_minimal()

While the relationship is weaker, smokers appear slightly more likely to have diabetes than non-smokers.

Physical Activity

df %>%
  count(phys_activity, diabetes_binary) %>%
  ggplot(aes(phys_activity, n, fill = diabetes_binary)) +
  geom_bar(stat = "identity", position = "fill") +
  labs(y = "Proportion", title = "Diabetes by Physical Activity") +
  theme_minimal()

Engaging in physical activity appears to reduce the likelihood of diabetes.


Click here for the Modeling Page