Predicting Diabetes Using Logistic Regression in R

Final Project for Spatial Data Science

Author

Faithwin Gbadamosi

Published

December 1, 2024

Introduction

The disease “Diabetes Mellitus” is one of the most common critical diseases in the world. According to the World Health Organization (WHO), approximately 422 million people worldwide currently live with diabetes, with the majority residing in low- and middle-income countries (World Health Organization, 2023). The disease is characterized by elevated levels of blood glucose (or blood sugar), which leads over time to serious damage to the heart, blood vessels, eyes, kidneys and nerves.

For people at risk of diabetes, healthcare professionals have stressed the value of routine tests, emphasizing the necessity of early detection and intervention (Pranto eta al., 2020). In addition to diabetes care, prevention is essential. Prediction of diabetes from the onset can help healthcare providers take early preventive measures (Talukder et al., 2024).

This project aims to build a predictive model for diabetes using readily available patient data and key variables, such as pregnancies, glucose levels, BMI, and genetic factors. In addition, I will create visual summaries to communicate the insights and patterns identified in the data. The major objectives include:

  1. Collecting and cleaning the diabetes dataset with relevant health variables.

  2. Applying machine learning algorithm (logistic regression) to predict diabetes cases.

  3. Creating visual summaries to show the relationships between key variables.

Materials and methods

The Data

The dataset used in this project was found in a study by Chou et al., (2023). The outpatient examination data of a Taipei Municipal medical center was taken as the patient population and 15,000 women aged between 20 and 80 were selected as the samples. The women were patients who had gone to the hospital between 2018 and 2020 and between 2021 and 2022 and may or may not have been diagnosed with diabetes.

The dataset contains the following variables:

1. Pregnancies: Number of times pregnant

2. PlasmaGlucose: two hours following an oral glucose tolerance test, plasma glucose concentration

3. DiastolicBloodPressure: Diastolic blood pressure (mm Hg)

4. TricepsThickness: Triceps skin fold thickness (mm)

5. SerumInsulin: 2-Hour serum insulin (mu U/ml)

6. BMI: Body mass index (weight in kg/(height in m)^2)

7. DiabetesPedigree: a numerical estimate of an individual’s genetic risk for developing diabetes based on family history. A higher score indicates a greater likelihood of developing the condition.

8. Age: Age (years)

9. Diabetic Outcome: Class variable (0 or 1) with the class value 1 representing those who tested positive for diabetes.

Dataset can be found here: (https://drive.google.com/file/d/1eAplOYO-k7ZYHj4uHAY1tEr8VTeaxS6u/view?usp=sharing).

Required Steps to Build Model

  1. Load necessary packages

  2. Load and explore the dataset

  3. Data visualization and exploratory analysis

  4. Preprocess data and Train model

  5. Evaluate model

  6. Make prediction with case study

  7. ROC Curve

Load Required Packages

Load necessary r packages to aid analysis.

#install.packages("corrplot")
#install.packages("caret")
#install.packages("kableExtra")
 #install.packages("tinytex")
#install.packages("DT")
#install.packages("heatmaply")
library(tinytex)
library(tidyverse)
library(leaflet)
library(kableExtra)
library(htmlwidgets)
library(widgetframe)
library(dplyr)
library(tidyr)
library(forcats)
library(ggplot2)
library(class)
library(corrplot)
library(caret)
library(reshape2) #for melt function
library(rmarkdown) 
library(knitr)
library(pROC) #for ROC Curve
library(widgetframe)
library(DT)
library(heatmaply)
library(plotly)
knitr::opts_chunk$set(widgetframe_widgets_dir = 'widgets' ) 
knitr::opts_chunk$set(cache=TRUE)  # cache the results for quick compiling

Load and Explore Data

diabetes_data <- read.csv("https://drive.google.com/uc?export=download&id=1eAplOYO-k7ZYHj4uHAY1tEr8VTeaxS6u")

Exploring the dataset

diabetes_data%>%
  slice(1:10) %>% #show only 1:n rows
  kable(digits=2,align="c")%>% #make table and round to two digits
  kable_styling(bootstrap_options = 
                  c("striped", "hover", "condensed", "responsive")) 
PatientID Pregnancies PlasmaGlucose DiastolicBloodPressure TricepsThickness SerumInsulin BMI DiabetesPedigree Age Diabetic
1354778 0 171 80 34 23 43.51 1.21 21 0
1147438 8 92 93 47 36 21.24 0.16 23 0
1640031 7 115 47 52 35 41.51 0.08 23 0
1883350 9 103 78 25 304 29.58 1.28 43 1
1424119 1 85 59 27 35 42.60 0.55 22 0
1619297 0 82 92 9 253 19.72 0.10 26 0
1660149 0 133 47 19 227 21.94 0.17 21 0
1458769 0 67 87 43 36 18.28 0.24 26 0
1201647 8 80 95 33 24 26.62 0.44 53 1
1403912 1 72 31 40 42 36.89 0.10 26 0
#Exploring the structure of the data
str(diabetes_data)
'data.frame':   15000 obs. of  10 variables:
 $ PatientID             : int  1354778 1147438 1640031 1883350 1424119 1619297 1660149 1458769 1201647 1403912 ...
 $ Pregnancies           : int  0 8 7 9 1 0 0 0 8 1 ...
 $ PlasmaGlucose         : int  171 92 115 103 85 82 133 67 80 72 ...
 $ DiastolicBloodPressure: int  80 93 47 78 59 92 47 87 95 31 ...
 $ TricepsThickness      : int  34 47 52 25 27 9 19 43 33 40 ...
 $ SerumInsulin          : int  23 36 35 304 35 253 227 36 24 42 ...
 $ BMI                   : num  43.5 21.2 41.5 29.6 42.6 ...
 $ DiabetesPedigree      : num  1.213 0.158 0.079 1.283 0.55 ...
 $ Age                   : int  21 23 23 43 22 26 21 26 53 26 ...
 $ Diabetic              : int  0 0 0 1 0 0 0 0 1 0 ...

The dataset contains 15000 patient entries, with all features being numeric values.

Clean Dataset
duplicated(diabetes_data) #check for duplicates
sum(is.na(diabetes_data)) #to check missing values 

The dataset contains no duplicates or missing values. The next step is visual summary.

Data Visualization and Exploratory Analysis

Correlation and visual summary

#remove patientid and outcome for better analysis 
filtered_diabetes <- subset(diabetes_data, select = -c(PatientID,Diabetic))

correlation_matrix <- cor(filtered_diabetes)

# Convert correlation matrix from wide to long format for visualization
correlation_melted <- melt(correlation_matrix)

#see the outcome
correlation_melted%>%
  kable(digits=8,align="c")%>% #make table and round to two digits
  kable_styling(bootstrap_options = 
                  c("striped", "hover", "condensed", "responsive",  fixed_thead = TRUE))%>%
  scroll_box(width = "100%", height = "400px") #scroll option long table
Var1 Var2 value
Pregnancies Pregnancies 1.00000000
PlasmaGlucose Pregnancies 0.05450238
DiastolicBloodPressure Pregnancies 0.04352845
TricepsThickness Pregnancies 0.06360454
SerumInsulin Pregnancies 0.10448699
BMI Pregnancies 0.08638610
DiabetesPedigree Pregnancies 0.05424006
Age Pregnancies 0.13697248
Pregnancies PlasmaGlucose 0.05450238
PlasmaGlucose PlasmaGlucose 1.00000000
DiastolicBloodPressure PlasmaGlucose 0.00721196
TricepsThickness PlasmaGlucose 0.02709960
SerumInsulin PlasmaGlucose 0.03354493
BMI PlasmaGlucose 0.02065333
DiabetesPedigree PlasmaGlucose 0.00905733
Age PlasmaGlucose 0.03886361
Pregnancies DiastolicBloodPressure 0.04352845
PlasmaGlucose DiastolicBloodPressure 0.00721196
DiastolicBloodPressure DiastolicBloodPressure 1.00000000
TricepsThickness DiastolicBloodPressure 0.01110606
SerumInsulin DiastolicBloodPressure 0.02264855
BMI DiastolicBloodPressure 0.01587319
DiabetesPedigree DiastolicBloodPressure 0.01409873
Age DiastolicBloodPressure 0.04133254
Pregnancies TricepsThickness 0.06360454
PlasmaGlucose TricepsThickness 0.02709960
DiastolicBloodPressure TricepsThickness 0.01110606
TricepsThickness TricepsThickness 1.00000000
SerumInsulin TricepsThickness 0.02968762
BMI TricepsThickness 0.02474548
DiabetesPedigree TricepsThickness -0.00095109
Age TricepsThickness 0.06138287
Pregnancies SerumInsulin 0.10448699
PlasmaGlucose SerumInsulin 0.03354493
DiastolicBloodPressure SerumInsulin 0.02264855
TricepsThickness SerumInsulin 0.02968762
SerumInsulin SerumInsulin 1.00000000
BMI SerumInsulin 0.05122315
DiabetesPedigree SerumInsulin 0.04632376
Age SerumInsulin 0.08800683
Pregnancies BMI 0.08638610
PlasmaGlucose BMI 0.02065333
DiastolicBloodPressure BMI 0.01587319
TricepsThickness BMI 0.02474548
SerumInsulin BMI 0.05122315
BMI BMI 1.00000000
DiabetesPedigree BMI 0.02886835
Age BMI 0.06290975
Pregnancies DiabetesPedigree 0.05424006
PlasmaGlucose DiabetesPedigree 0.00905733
DiastolicBloodPressure DiabetesPedigree 0.01409873
TricepsThickness DiabetesPedigree -0.00095109
SerumInsulin DiabetesPedigree 0.04632376
BMI DiabetesPedigree 0.02886835
DiabetesPedigree DiabetesPedigree 1.00000000
Age DiabetesPedigree 0.05563319
Pregnancies Age 0.13697248
PlasmaGlucose Age 0.03886361
DiastolicBloodPressure Age 0.04133254
TricepsThickness Age 0.06138287
SerumInsulin Age 0.08800683
BMI Age 0.06290975
DiabetesPedigree Age 0.05563319
Age Age 1.00000000

Plot correlation heatmap

# Plot heatmap
heatmaply(
  correlation_matrix,
  dendrogram = "none",
  xlab = "Features",
  ylab = "Features",
  main = "Correlation Heatmap",
  colors = colorRampPalette(c("red", "white", "brown"))(100),
  limits = c(-1, 1),
  branches_lwd = 0.1,
  titleX = FALSE,
  titleY = FALSE,
        label_names = c("Variable", "Factor", "Value"),
        fontsize_row = 10, fontsize_col = 10,
        labCol = colnames(correlation_matrix),
        labRow = rownames(correlation_matrix),
        heatmap_layers = theme(axis.line = element_blank())
)

Interactive Correlation HeatMap

The correlation shows moderately positive correlations between the Age and Pregnancy, and the Insulin and Pregnancy. This indicates that as the age of the patients increased so did the number of pregnancies, also as the number of pregnancies, the quantity of insulin administered to the patients increased likewise.

Weak or no correlations can also be observed in the following attributes of the dataset; DiabetesPedigree and Skin Thickness.

Comparing Outcomes and Variables

Age vs Outcome

ggplot(data = diabetes_data, aes(x = Age)) + geom_histogram(color = "blue", fill = "lightblue") + facet_wrap(~Diabetic) + theme_dark() + ylab("Number of Patients") + labs(title = "Age(s) of Patients")

Graph Showing Correlation of Outcome vs Age

0 = Non-diabetic

1= Diabetic

The ages of the patients are skewed to the right with most of the patients being between the ages of 20 to 40.

BMI vs Outcome

ggplot(data = diabetes_data, aes(x = BMI)) + geom_histogram(color = "blue", fill = "lightblue") + facet_wrap(~Diabetic) + theme_dark() + ylab("Number of Patients") + labs(title = "BMI of Patients")

Plot Showing Correlation Between Outcome vs BMI

Blood Pressure vs Outcome

ggplot(diabetes_data, aes(x = factor(Diabetic), y = DiastolicBloodPressure, fill = factor(Diabetic))) +
  geom_violin() +
  geom_boxplot(width = 0.2, fill = "white", alpha = 0.5) +
  labs(title = "Patients' Blood Pressure", x = "Diabetes Status", fill = "Diabetes Status" ) +
  scale_fill_discrete(labels = c("Non-Diabetic", "Diabetic")) +
  theme_minimal()

Violin plot Showing Correlation Between Outcome vs BloodPressure

Visualizing the distribution of blood pressure for each outcome.

Preprocess and Train Data

Preprocess
# Convert the outcome variable to a factor 
diabetes_data$Diabetic <- factor(diabetes_data$Diabetic, 
                                levels = c(0, 1), 
                                labels = c("Non-Diabetic", "Diabetic"))
# Split the data into training and testing sets
set.seed(123)  # for reproducibility
split <- createDataPartition(diabetes_data$Diabetic, p = 0.7, list = FALSE)
train_data <- diabetes_data[split, ]
test_data <- diabetes_data[-split, ]

# Step 3: Fit the logistic regression model
diabetes_model <- glm(Diabetic ~ Pregnancies + PlasmaGlucose + DiastolicBloodPressure + TricepsThickness + 
                       SerumInsulin + BMI + DiabetesPedigree + Age,
                      data = train_data, family = binomial)

#  Summarize the model to see coefficients and other details
summary(diabetes_model)

Call:
glm(formula = Diabetic ~ Pregnancies + PlasmaGlucose + DiastolicBloodPressure + 
    TricepsThickness + SerumInsulin + BMI + DiabetesPedigree + 
    Age, family = binomial, data = train_data)

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)            -8.7112122  0.2194791 -39.690  < 2e-16 ***
Pregnancies             0.2698705  0.0080060  33.709  < 2e-16 ***
PlasmaGlucose           0.0096941  0.0008234  11.773  < 2e-16 ***
DiastolicBloodPressure  0.0121392  0.0015853   7.657  1.9e-14 ***
TricepsThickness        0.0226795  0.0018192  12.467  < 2e-16 ***
SerumInsulin            0.0039586  0.0001944  20.367  < 2e-16 ***
BMI                     0.0486785  0.0027794  17.514  < 2e-16 ***
DiabetesPedigree        1.0392443  0.0670211  15.506  < 2e-16 ***
Age                     0.0590712  0.0020860  28.318  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 13366.8  on 10499  degrees of freedom
Residual deviance:  9151.1  on 10491  degrees of freedom
AIC: 9169.1

Number of Fisher Scoring iterations: 5
Train Model
# Generate predictions on the test set
test_data$predicted_prob <- predict(diabetes_model, newdata = test_data, type = "response")
test_data$predicted_class <- factor(ifelse(test_data$predicted_prob > 0.5, 
                                         "Diabetic", "Non-Diabetic"),
                                  levels = c("Non-Diabetic", "Diabetic"))


# Create confusion matrix
confusion_matrix <- confusionMatrix(data = test_data$predicted_class,
                                  reference = test_data$Diabetic,
                                  positive = "Diabetic")


# Print confusion matrix and statistics
print(confusion_matrix)
Confusion Matrix and Statistics

              Reference
Prediction     Non-Diabetic Diabetic
  Non-Diabetic         2671      627
  Diabetic              329      873
                                          
               Accuracy : 0.7876          
                 95% CI : (0.7753, 0.7994)
    No Information Rate : 0.6667          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.497           
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.5820          
            Specificity : 0.8903          
         Pos Pred Value : 0.7263          
         Neg Pred Value : 0.8099          
             Prevalence : 0.3333          
         Detection Rate : 0.1940          
   Detection Prevalence : 0.2671          
      Balanced Accuracy : 0.7362          
                                          
       'Positive' Class : Diabetic        
                                          

This confusion matrix shows:

  • The model correctly predicted 873 diabetic patients.

  • The model correctly predicted 2671 non-diabetic patients.

  • Accuracy: Overall, the model is correct 78.76% of the time.

  • Sensitivity (Recall): 58.20%. This means the model correctly identifies 58.20% of actual diabetic cases.

  • Specificity: 89.03%. This means the model correctly identifies 89.03% of actual non-diabetic cases.

  • Precision: 72.63%. Of those predicted as diabetic, 72.63% are actually diabetic.

  • Negative Predictive Value: 80.99%. Of those predicted as non-diabetic, 80.99% are actually non-diabetic.

    The “Positive’ Class : Diabetic” explanation at the end means that the “Diabetic” class is considered the positive class in this analysis.

Case Study

Predicting diabetes for a new patient named Molly_Jane.

# Define the new patient's data for prediction
Molly_Jane <- data.frame(
  Pregnancies = 2,
  PlasmaGlucose = 120,
 DiastolicBloodPressure = 70,
  TricepsThickness = 30,
  SerumInsulin = 85,
  BMI = 28.5,
  DiabetesPedigree = 0.627,
  Age = 45
)

# Use the model to predict the probability for the new patient
prediction_prob <- predict(diabetes_model, newdata = Molly_Jane, type = "response")

# Convert probability to class prediction with explicit labeling
prediction_class <- ifelse(prediction_prob > 0.5, "Diabetic", "Non-Diabetic")

# Print the results with formatted probability and text label
cat("Predicted probability of diabetes:", round(prediction_prob, 3), "\n")#rounded to 3 decimal points
Predicted probability of diabetes: 0.391 
cat("Predicted class for the new patient:", prediction_class)
Predicted class for the new patient: Non-Diabetic

With a probability of 39.1 %, Molly_Jane is classified as Non-Diabetic. A probability higher than 0.5 means the patient might be diabetic.

ROC Curve

Model’s ROC Curve

The Receiver’s Operating Characteristic (ROC) shows the overall performance of the model is good. With an AUC (Area Under the Curve) of about 0.8 or higher, the model will be about 80% of the time accurate in predicting if a patient is diabetic or non-diabetic.

Conclusions

Diabetes is a serious chronic disease. Early diagnosis is crucial for effective management. This project used logistic regression to predict diabetes onset using eight key medical parameters which includes Age, Blood pressure, Insulin, BMI, Triceps thickness, number of pregnancies, Diabetes pedigree and glucose level.

After training and evaluation, the model achieved impressive results, with AUC score of 0.8. This shows the potential of machine learning to improve diabetes prediction. Using models like this to predict diabetes for new patients and existing patients can help increase early and effective diagnosis. It would also go a long way to encourage effective management.

References

Chou C-Y, Hsu D-Y, Chou C-H. Predicting the Onset of Diabetes with Machine Learning Methods. Journal of Personalized Medicine. 2023; 13(3):406. https://doi.org/10.3390/jpm13030406

Geeks for Geeks Prediction Using R Course: https://www.geeksforgeeks.org/diabetes-prediction-using-r/

Pranto B, Mehnaz S, Mahid EB et al. Evaluating machine learning methods for predicting diabetes among female patients in bangladesh. Information 2020; 11: 374.

Talukder MdA, Islam MdM, Uddin MA, et al. Toward reliable diabetes prediction: Innovations in data engineering and machine learning applications. DIGITAL HEALTH. 2024;10. doi:10.1177/20552076241271867

Tamunoye Darego (2022) Diabetes Prediction using kNN in R

World Health Organization (2023): https://www.who.int/news-room/fact-sheets/detail/diabetes

Xavier Robin, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez and Markus Müller (2011). “pROC: an open-source package for R and S+ to analyze and compare ROC curves”. BMC Bioinformatics12, p. 77. DOI: doi: 10.1186/1471-2105-12-77