#install.packages("corrplot")
#install.packages("caret")
#install.packages("kableExtra")
#install.packages("tinytex")
#install.packages("DT")
#install.packages("heatmaply")
library(tinytex)
library(tidyverse)
library(leaflet)
library(kableExtra)
library(htmlwidgets)
library(widgetframe)
library(dplyr)
library(tidyr)
library(forcats)
library(ggplot2)
library(class)
library(corrplot)
library(caret)
library(reshape2) #for melt function
library(rmarkdown)
library(knitr)
library(pROC) #for ROC Curve
library(widgetframe)
library(DT)
library(heatmaply)
library(plotly)
::opts_chunk$set(widgetframe_widgets_dir = 'widgets' )
knitr::opts_chunk$set(cache=TRUE) # cache the results for quick compiling knitr
Predicting Diabetes Using Logistic Regression in R
Final Project for Spatial Data Science
Introduction
The disease “Diabetes Mellitus” is one of the most common critical diseases in the world. According to the World Health Organization (WHO), approximately 422 million people worldwide currently live with diabetes, with the majority residing in low- and middle-income countries (World Health Organization, 2023). The disease is characterized by elevated levels of blood glucose (or blood sugar), which leads over time to serious damage to the heart, blood vessels, eyes, kidneys and nerves.
For people at risk of diabetes, healthcare professionals have stressed the value of routine tests, emphasizing the necessity of early detection and intervention (Pranto eta al., 2020). In addition to diabetes care, prevention is essential. Prediction of diabetes from the onset can help healthcare providers take early preventive measures (Talukder et al., 2024).
This project aims to build a predictive model for diabetes using readily available patient data and key variables, such as pregnancies, glucose levels, BMI, and genetic factors. In addition, I will create visual summaries to communicate the insights and patterns identified in the data. The major objectives include:
Collecting and cleaning the diabetes dataset with relevant health variables.
Applying machine learning algorithm (logistic regression) to predict diabetes cases.
Creating visual summaries to show the relationships between key variables.
Materials and methods
The Data
The dataset used in this project was found in a study by Chou et al., (2023). The outpatient examination data of a Taipei Municipal medical center was taken as the patient population and 15,000 women aged between 20 and 80 were selected as the samples. The women were patients who had gone to the hospital between 2018 and 2020 and between 2021 and 2022 and may or may not have been diagnosed with diabetes.
The dataset contains the following variables:
1. Pregnancies: Number of times pregnant
2. PlasmaGlucose: two hours following an oral glucose tolerance test, plasma glucose concentration
3. DiastolicBloodPressure: Diastolic blood pressure (mm Hg)
4. TricepsThickness: Triceps skin fold thickness (mm)
5. SerumInsulin: 2-Hour serum insulin (mu U/ml)
6. BMI: Body mass index (weight in kg/(height in m)^2)
7. DiabetesPedigree: a numerical estimate of an individual’s genetic risk for developing diabetes based on family history. A higher score indicates a greater likelihood of developing the condition.
8. Age: Age (years)
9. Diabetic Outcome: Class variable (0 or 1) with the class value 1 representing those who tested positive for diabetes.
Dataset can be found here: (https://drive.google.com/file/d/1eAplOYO-k7ZYHj4uHAY1tEr8VTeaxS6u/view?usp=sharing).
Required Steps to Build Model
Load necessary packages
Load and explore the dataset
Data visualization and exploratory analysis
Preprocess data and Train model
Evaluate model
Make prediction with case study
ROC Curve
Load Required Packages
Load necessary r packages to aid analysis.
Load and Explore Data
<- read.csv("https://drive.google.com/uc?export=download&id=1eAplOYO-k7ZYHj4uHAY1tEr8VTeaxS6u") diabetes_data
Exploring the dataset
%>%
diabetes_dataslice(1:10) %>% #show only 1:n rows
kable(digits=2,align="c")%>% #make table and round to two digits
kable_styling(bootstrap_options =
c("striped", "hover", "condensed", "responsive"))
PatientID | Pregnancies | PlasmaGlucose | DiastolicBloodPressure | TricepsThickness | SerumInsulin | BMI | DiabetesPedigree | Age | Diabetic |
---|---|---|---|---|---|---|---|---|---|
1354778 | 0 | 171 | 80 | 34 | 23 | 43.51 | 1.21 | 21 | 0 |
1147438 | 8 | 92 | 93 | 47 | 36 | 21.24 | 0.16 | 23 | 0 |
1640031 | 7 | 115 | 47 | 52 | 35 | 41.51 | 0.08 | 23 | 0 |
1883350 | 9 | 103 | 78 | 25 | 304 | 29.58 | 1.28 | 43 | 1 |
1424119 | 1 | 85 | 59 | 27 | 35 | 42.60 | 0.55 | 22 | 0 |
1619297 | 0 | 82 | 92 | 9 | 253 | 19.72 | 0.10 | 26 | 0 |
1660149 | 0 | 133 | 47 | 19 | 227 | 21.94 | 0.17 | 21 | 0 |
1458769 | 0 | 67 | 87 | 43 | 36 | 18.28 | 0.24 | 26 | 0 |
1201647 | 8 | 80 | 95 | 33 | 24 | 26.62 | 0.44 | 53 | 1 |
1403912 | 1 | 72 | 31 | 40 | 42 | 36.89 | 0.10 | 26 | 0 |
#Exploring the structure of the data
str(diabetes_data)
'data.frame': 15000 obs. of 10 variables:
$ PatientID : int 1354778 1147438 1640031 1883350 1424119 1619297 1660149 1458769 1201647 1403912 ...
$ Pregnancies : int 0 8 7 9 1 0 0 0 8 1 ...
$ PlasmaGlucose : int 171 92 115 103 85 82 133 67 80 72 ...
$ DiastolicBloodPressure: int 80 93 47 78 59 92 47 87 95 31 ...
$ TricepsThickness : int 34 47 52 25 27 9 19 43 33 40 ...
$ SerumInsulin : int 23 36 35 304 35 253 227 36 24 42 ...
$ BMI : num 43.5 21.2 41.5 29.6 42.6 ...
$ DiabetesPedigree : num 1.213 0.158 0.079 1.283 0.55 ...
$ Age : int 21 23 23 43 22 26 21 26 53 26 ...
$ Diabetic : int 0 0 0 1 0 0 0 0 1 0 ...
The dataset contains 15000 patient entries, with all features being numeric values.
Clean Dataset
duplicated(diabetes_data) #check for duplicates
sum(is.na(diabetes_data)) #to check missing values
The dataset contains no duplicates or missing values. The next step is visual summary.
Data Visualization and Exploratory Analysis
Correlation and visual summary
#remove patientid and outcome for better analysis
<- subset(diabetes_data, select = -c(PatientID,Diabetic))
filtered_diabetes
<- cor(filtered_diabetes)
correlation_matrix
# Convert correlation matrix from wide to long format for visualization
<- melt(correlation_matrix)
correlation_melted
#see the outcome
%>%
correlation_meltedkable(digits=8,align="c")%>% #make table and round to two digits
kable_styling(bootstrap_options =
c("striped", "hover", "condensed", "responsive", fixed_thead = TRUE))%>%
scroll_box(width = "100%", height = "400px") #scroll option long table
Var1 | Var2 | value |
---|---|---|
Pregnancies | Pregnancies | 1.00000000 |
PlasmaGlucose | Pregnancies | 0.05450238 |
DiastolicBloodPressure | Pregnancies | 0.04352845 |
TricepsThickness | Pregnancies | 0.06360454 |
SerumInsulin | Pregnancies | 0.10448699 |
BMI | Pregnancies | 0.08638610 |
DiabetesPedigree | Pregnancies | 0.05424006 |
Age | Pregnancies | 0.13697248 |
Pregnancies | PlasmaGlucose | 0.05450238 |
PlasmaGlucose | PlasmaGlucose | 1.00000000 |
DiastolicBloodPressure | PlasmaGlucose | 0.00721196 |
TricepsThickness | PlasmaGlucose | 0.02709960 |
SerumInsulin | PlasmaGlucose | 0.03354493 |
BMI | PlasmaGlucose | 0.02065333 |
DiabetesPedigree | PlasmaGlucose | 0.00905733 |
Age | PlasmaGlucose | 0.03886361 |
Pregnancies | DiastolicBloodPressure | 0.04352845 |
PlasmaGlucose | DiastolicBloodPressure | 0.00721196 |
DiastolicBloodPressure | DiastolicBloodPressure | 1.00000000 |
TricepsThickness | DiastolicBloodPressure | 0.01110606 |
SerumInsulin | DiastolicBloodPressure | 0.02264855 |
BMI | DiastolicBloodPressure | 0.01587319 |
DiabetesPedigree | DiastolicBloodPressure | 0.01409873 |
Age | DiastolicBloodPressure | 0.04133254 |
Pregnancies | TricepsThickness | 0.06360454 |
PlasmaGlucose | TricepsThickness | 0.02709960 |
DiastolicBloodPressure | TricepsThickness | 0.01110606 |
TricepsThickness | TricepsThickness | 1.00000000 |
SerumInsulin | TricepsThickness | 0.02968762 |
BMI | TricepsThickness | 0.02474548 |
DiabetesPedigree | TricepsThickness | -0.00095109 |
Age | TricepsThickness | 0.06138287 |
Pregnancies | SerumInsulin | 0.10448699 |
PlasmaGlucose | SerumInsulin | 0.03354493 |
DiastolicBloodPressure | SerumInsulin | 0.02264855 |
TricepsThickness | SerumInsulin | 0.02968762 |
SerumInsulin | SerumInsulin | 1.00000000 |
BMI | SerumInsulin | 0.05122315 |
DiabetesPedigree | SerumInsulin | 0.04632376 |
Age | SerumInsulin | 0.08800683 |
Pregnancies | BMI | 0.08638610 |
PlasmaGlucose | BMI | 0.02065333 |
DiastolicBloodPressure | BMI | 0.01587319 |
TricepsThickness | BMI | 0.02474548 |
SerumInsulin | BMI | 0.05122315 |
BMI | BMI | 1.00000000 |
DiabetesPedigree | BMI | 0.02886835 |
Age | BMI | 0.06290975 |
Pregnancies | DiabetesPedigree | 0.05424006 |
PlasmaGlucose | DiabetesPedigree | 0.00905733 |
DiastolicBloodPressure | DiabetesPedigree | 0.01409873 |
TricepsThickness | DiabetesPedigree | -0.00095109 |
SerumInsulin | DiabetesPedigree | 0.04632376 |
BMI | DiabetesPedigree | 0.02886835 |
DiabetesPedigree | DiabetesPedigree | 1.00000000 |
Age | DiabetesPedigree | 0.05563319 |
Pregnancies | Age | 0.13697248 |
PlasmaGlucose | Age | 0.03886361 |
DiastolicBloodPressure | Age | 0.04133254 |
TricepsThickness | Age | 0.06138287 |
SerumInsulin | Age | 0.08800683 |
BMI | Age | 0.06290975 |
DiabetesPedigree | Age | 0.05563319 |
Age | Age | 1.00000000 |
Plot correlation heatmap
# Plot heatmap
heatmaply(
correlation_matrix,dendrogram = "none",
xlab = "Features",
ylab = "Features",
main = "Correlation Heatmap",
colors = colorRampPalette(c("red", "white", "brown"))(100),
limits = c(-1, 1),
branches_lwd = 0.1,
titleX = FALSE,
titleY = FALSE,
label_names = c("Variable", "Factor", "Value"),
fontsize_row = 10, fontsize_col = 10,
labCol = colnames(correlation_matrix),
labRow = rownames(correlation_matrix),
heatmap_layers = theme(axis.line = element_blank())
)
Interactive Correlation HeatMap
The correlation shows moderately positive correlations between the Age and Pregnancy, and the Insulin and Pregnancy. This indicates that as the age of the patients increased so did the number of pregnancies, also as the number of pregnancies, the quantity of insulin administered to the patients increased likewise.
Weak or no correlations can also be observed in the following attributes of the dataset; DiabetesPedigree and Skin Thickness.
Comparing Outcomes and Variables
Age vs Outcome
ggplot(data = diabetes_data, aes(x = Age)) + geom_histogram(color = "blue", fill = "lightblue") + facet_wrap(~Diabetic) + theme_dark() + ylab("Number of Patients") + labs(title = "Age(s) of Patients")
0 = Non-diabetic
1= Diabetic
The ages of the patients are skewed to the right with most of the patients being between the ages of 20 to 40.
BMI vs Outcome
ggplot(data = diabetes_data, aes(x = BMI)) + geom_histogram(color = "blue", fill = "lightblue") + facet_wrap(~Diabetic) + theme_dark() + ylab("Number of Patients") + labs(title = "BMI of Patients")
Blood Pressure vs Outcome
ggplot(diabetes_data, aes(x = factor(Diabetic), y = DiastolicBloodPressure, fill = factor(Diabetic))) +
geom_violin() +
geom_boxplot(width = 0.2, fill = "white", alpha = 0.5) +
labs(title = "Patients' Blood Pressure", x = "Diabetes Status", fill = "Diabetes Status" ) +
scale_fill_discrete(labels = c("Non-Diabetic", "Diabetic")) +
theme_minimal()
Visualizing the distribution of blood pressure for each outcome.
Preprocess and Train Data
Preprocess
# Convert the outcome variable to a factor
$Diabetic <- factor(diabetes_data$Diabetic,
diabetes_datalevels = c(0, 1),
labels = c("Non-Diabetic", "Diabetic"))
# Split the data into training and testing sets
set.seed(123) # for reproducibility
<- createDataPartition(diabetes_data$Diabetic, p = 0.7, list = FALSE)
split <- diabetes_data[split, ]
train_data <- diabetes_data[-split, ]
test_data
# Step 3: Fit the logistic regression model
<- glm(Diabetic ~ Pregnancies + PlasmaGlucose + DiastolicBloodPressure + TricepsThickness +
diabetes_model + BMI + DiabetesPedigree + Age,
SerumInsulin data = train_data, family = binomial)
# Summarize the model to see coefficients and other details
summary(diabetes_model)
Call:
glm(formula = Diabetic ~ Pregnancies + PlasmaGlucose + DiastolicBloodPressure +
TricepsThickness + SerumInsulin + BMI + DiabetesPedigree +
Age, family = binomial, data = train_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.7112122 0.2194791 -39.690 < 2e-16 ***
Pregnancies 0.2698705 0.0080060 33.709 < 2e-16 ***
PlasmaGlucose 0.0096941 0.0008234 11.773 < 2e-16 ***
DiastolicBloodPressure 0.0121392 0.0015853 7.657 1.9e-14 ***
TricepsThickness 0.0226795 0.0018192 12.467 < 2e-16 ***
SerumInsulin 0.0039586 0.0001944 20.367 < 2e-16 ***
BMI 0.0486785 0.0027794 17.514 < 2e-16 ***
DiabetesPedigree 1.0392443 0.0670211 15.506 < 2e-16 ***
Age 0.0590712 0.0020860 28.318 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13366.8 on 10499 degrees of freedom
Residual deviance: 9151.1 on 10491 degrees of freedom
AIC: 9169.1
Number of Fisher Scoring iterations: 5
Train Model
# Generate predictions on the test set
$predicted_prob <- predict(diabetes_model, newdata = test_data, type = "response")
test_data$predicted_class <- factor(ifelse(test_data$predicted_prob > 0.5,
test_data"Diabetic", "Non-Diabetic"),
levels = c("Non-Diabetic", "Diabetic"))
# Create confusion matrix
<- confusionMatrix(data = test_data$predicted_class,
confusion_matrix reference = test_data$Diabetic,
positive = "Diabetic")
# Print confusion matrix and statistics
print(confusion_matrix)
Confusion Matrix and Statistics
Reference
Prediction Non-Diabetic Diabetic
Non-Diabetic 2671 627
Diabetic 329 873
Accuracy : 0.7876
95% CI : (0.7753, 0.7994)
No Information Rate : 0.6667
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.497
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.5820
Specificity : 0.8903
Pos Pred Value : 0.7263
Neg Pred Value : 0.8099
Prevalence : 0.3333
Detection Rate : 0.1940
Detection Prevalence : 0.2671
Balanced Accuracy : 0.7362
'Positive' Class : Diabetic
This confusion matrix shows:
The model correctly predicted 873 diabetic patients.
The model correctly predicted 2671 non-diabetic patients.
Accuracy: Overall, the model is correct 78.76% of the time.
Sensitivity (Recall): 58.20%. This means the model correctly identifies 58.20% of actual diabetic cases.
Specificity: 89.03%. This means the model correctly identifies 89.03% of actual non-diabetic cases.
Precision: 72.63%. Of those predicted as diabetic, 72.63% are actually diabetic.
Negative Predictive Value: 80.99%. Of those predicted as non-diabetic, 80.99% are actually non-diabetic.
The “Positive’ Class : Diabetic” explanation at the end means that the “Diabetic” class is considered the positive class in this analysis.
Case Study
Predicting diabetes for a new patient named Molly_Jane.
# Define the new patient's data for prediction
<- data.frame(
Molly_Jane Pregnancies = 2,
PlasmaGlucose = 120,
DiastolicBloodPressure = 70,
TricepsThickness = 30,
SerumInsulin = 85,
BMI = 28.5,
DiabetesPedigree = 0.627,
Age = 45
)
# Use the model to predict the probability for the new patient
<- predict(diabetes_model, newdata = Molly_Jane, type = "response")
prediction_prob
# Convert probability to class prediction with explicit labeling
<- ifelse(prediction_prob > 0.5, "Diabetic", "Non-Diabetic")
prediction_class
# Print the results with formatted probability and text label
cat("Predicted probability of diabetes:", round(prediction_prob, 3), "\n")#rounded to 3 decimal points
Predicted probability of diabetes: 0.391
cat("Predicted class for the new patient:", prediction_class)
Predicted class for the new patient: Non-Diabetic
With a probability of 39.1 %, Molly_Jane is classified as Non-Diabetic. A probability higher than 0.5 means the patient might be diabetic.
ROC Curve
Model’s ROC Curve
The Receiver’s Operating Characteristic (ROC) shows the overall performance of the model is good. With an AUC (Area Under the Curve) of about 0.8 or higher, the model will be about 80% of the time accurate in predicting if a patient is diabetic or non-diabetic.
Conclusions
Diabetes is a serious chronic disease. Early diagnosis is crucial for effective management. This project used logistic regression to predict diabetes onset using eight key medical parameters which includes Age, Blood pressure, Insulin, BMI, Triceps thickness, number of pregnancies, Diabetes pedigree and glucose level.
After training and evaluation, the model achieved impressive results, with AUC score of 0.8. This shows the potential of machine learning to improve diabetes prediction. Using models like this to predict diabetes for new patients and existing patients can help increase early and effective diagnosis. It would also go a long way to encourage effective management.
References
Chou C-Y, Hsu D-Y, Chou C-H. Predicting the Onset of Diabetes with Machine Learning Methods. Journal of Personalized Medicine. 2023; 13(3):406. https://doi.org/10.3390/jpm13030406
Geeks for Geeks Prediction Using R Course: https://www.geeksforgeeks.org/diabetes-prediction-using-r/
Pranto B, Mehnaz S, Mahid EB et al. Evaluating machine learning methods for predicting diabetes among female patients in bangladesh. Information 2020; 11: 374.
Talukder MdA, Islam MdM, Uddin MA, et al. Toward reliable diabetes prediction: Innovations in data engineering and machine learning applications. DIGITAL HEALTH. 2024;10. doi:10.1177/20552076241271867
Tamunoye Darego (2022) Diabetes Prediction using kNN in R
World Health Organization (2023): https://www.who.int/news-room/fact-sheets/detail/diabetes
Xavier Robin, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez and Markus Müller (2011). “pROC: an open-source package for R and S+ to analyze and compare ROC curves”. BMC Bioinformatics, 12, p. 77. DOI: doi: 10.1186/1471-2105-12-77