Diabetes Risk Prediction using Shapley Additive Explanations for Feature Engineering

Main Article Content

Chinwe Miracle Chituru
Sin-Ban Ho
Ian Chai

Abstract

Diabetes is prevalent globally, expected to increase in the next few years. This includes people with different types of diabetes including type 1 diabetes and type 2 diabetes. There are several causes for the increase: dietary decisions and lack of exercise as the main ones. This global health challenge calls for effective prediction and early management of the disease. This research focuses on the decision tree algorithm utilization to predict the risk of diabetes and model interpretability with the integration of SHapley Additive exPlanations (SHAP) for feature engineering. Random forest and gradient boosting models were developed to identify the risk factors and compare the prediction with the decision tree model. The performance of these classifiers was evaluated using the metrics for accuracy, f1-score, precision, and recall. Understanding the features that drive predictions can enhance clinical decision-making as much as predictive accuracy. With the use of a comprehensive dataset having 520 instances with 17 features including the target output, the proposed decision tree model had an accuracy of 97%. The decision tree model’s categorical variables enable straightforward data visualization. The SHAP tool was applied to interpret the model’s prediction after developing the model. This is crucial for healthcare practitioners as it provides specific health metrics to identify high-risk diabetic patients. Preliminary results indicate that a combination of polyuria, polydipsia, and age are predictors of diabetes risk. This study highlights the benefits that the integration of SHAP and decision trees algorithm provides predictive capability and transparent model interpretability. It also contributes to the growing body of literature on machine learning in the healthcare industry. The results advocate for the application of this methodology in clinical settings for prediction fostering trust between the approach and practitioners and patients alike.

Article Details

How to Cite
Chituru, C. M., Ho, S.-B., & Chai, I. (2025). Diabetes Risk Prediction using Shapley Additive Explanations for Feature Engineering. Journal of Informatics and Web Engineering, 4(2), 18–35. https://doi.org/10.33093/jiwe.2025.4.2.2
Section
Regular issue

References

J. A. M. Rexie, P. Santhosh, P. N. Solomon, and P. A. Vishnu, “Early Prediction of Diabetes using Several Machine Learning Algorithms,” in 2023 7th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, May 17, 2023, pp. 449–453. doi: 10.1109/iciccs56967.2023.10142749.

E. W. Gregg et al., “Improving health outcomes of people with diabetes: target setting for the WHO Global Diabetes Compact,” The Lancet, vol. 401, no. 10384, pp. 1302–1312, Mar. 2023, doi: 10.1016/s0140-6736(23)00001-6.

A. S. Antonini et al., “Machine Learning model interpretability using SHAP values: Application to Igneous Rock Classification task,” Applied Computing and Geosciences, vol. 23, p. 100178, Sep. 2024, doi: 10.1016/j.acags.2024.100178.

C. K. Tan, K. M. Lim, C. P. Lee, R. K. Y. Chang, and A. Alqahtani, “SDVIT: Stacking of Distilled Vision Transformers for Hand Gesture Recognition,” Applied Sciences, vol. 13, no. 22, p. 12204, Nov. 2023, doi: 10.3390/app132212204.

H. Ulutas, R. B. Günay, and M. E. Sahin, “Detecting diabetes in an ensemble model using a unique PSO-GWO hybrid approach to hyperparameter optimization,” Neural Computing and Applications, Jul. 2024, doi: 10.1007/s00521-024-10160-y.

N. Nipa, M. H. Riyad, S. Satu, N. Walliullah, K. C. Howlader, and M. A. Moni, “Clinically adaptable machine learning model to identify early appreciable features of diabetes,” Intelligent Medicine, vol. 4, no. 1, pp. 22–32, Feb. 2023, doi: 10.1016/j.imed.2023.01.003.

A. Saboor, A. U. Rehman, T. M. Ali, S. Javaid, and A. Nawaz, “An Applied Artificial Intelligence Technique For Early Prediction of Diabetes Disease,” in 2022 Third International Conference on Latest Trends in Electrical Engineering and Computing Technologies (INTELLECT), Nov. 16, 2022, pp. 1–6. doi: 10.1109/intellect55495.2022.9969401.

L. Jiang et al., “A feature optimization study based on a diabetes risk questionnaire,” Frontiers in Public Health, vol. 12, Feb. 2024, doi: 10.3389/fpubh.2024.1328353.

M. M. Faniqul, R. Ferdousi, S. Rahman, and H. Y. Bushra, “Likelihood prediction of diabetes at early stage using data mining techniques,” in Advances in Intelligent Systems and Computing, 2020, vol. 992, pp. 113–125. doi: 10.1007/978-981-13-8798-2_12.

H. Mahmoud, M. Thabet, M. H. Khafagy, and F. A. Omara, “Multiobjective task scheduling in cloud environment using Decision tree algorithm,” IEEE Access, vol. 10, pp. 36140–36151, Jan. 2022, doi: 10.1109/access.2022.3163273.

C. Azad, B. Bhushan, R. Sharma, A. Shankar, K. K. Singh, and A. Khamparia, “Prediction model using SMOTE, genetic algorithm and decision tree (PMSGD) for classification of diabetes mellitus,” Multimedia Systems, vol. 28, no. 4, pp. 1289–1307, Jun. 2021, doi: 10.1007/s00530-021-00817-2.

E. Dritsas and M. Trigka, “Data-Driven Machine-Learning Methods for Diabetes Risk Prediction,” Sensors, vol. 22, no. 14, p. 5304, Jul. 2022, doi: 10.3390/s22145304.

W. Ge, P. Lalbakhsh, L. Isai, A. Lensky, and H. Suominen, “Comparing deep learning models for the task of volatility prediction using multivariate data,” arXiv (Cornell University), Jan. 2023, doi: 10.48550/arxiv.2306.12446.

P. Ruediger-Flore, M. Klar, M. Hussong, A. Mukherjee, M. Glatt, and J. C. Aurich, “Comparing binary classification and autoencoders for Vision-Based anomaly detection in material flow,” Procedia CIRP, vol. 121, pp. 138–143, Jan. 2024, doi: 10.1016/j.procir.2023.09.241.

N. Boyko, “Evaluating binary classification algorithms on data lakes using machine learning,” Revue D Intelligence Artificielle, vol. 37, no. 6, pp. 1423–1434, Dec. 2023, doi: 10.18280/ria.370606.

R. O. Alabi, M. Elmusrati, I. Leivo, A. Almangush, and A. A. Makitie, “Machine learning explainability in nasopharyngeal cancer survival using LIME and SHAP,” Scientific Reports, vol. 13, no. 1, Jun. 2023, doi: 10.1038/s41598-023-35795-0.

“Early-Stage Diabetes Risk Prediction Dataset,” Kaggle, Sep. 21, 2020. https://www.kaggle.com/datasets/ishandutta/early-stage-diabetes-risk-prediction-dataset/data

J. Rogel-Salazar, “Statistics and Data Visualisation with Python,” Chapman and Hall/CRC, Jan. 31, 2023, doi: 10.1201/9781003160359.

M. Marudi, I. Ben-Gal, and G. Singer, “A decision tree-based method for ordinal classification problems,” IISE Transactions, vol. 56, no. 9, pp. 960–974, Jul. 2022, doi: 10.1080/24725854.2022.2081745.

J.-L. Goh, S.-B. Ho, and C.-H. Tan, “Weather-Based Arthritis Tracking: a mobile mechanism for Preventive Strategies,” Journal of Informatics and Web Engineering, vol. 3, no. 1, pp. 210–225, Feb. 2024, doi: 10.33093/jiwe.2024.3.1.14.

N. Mrewa, A. M. Ramly, A. Amphawan, and T. K. Neo, “Optimizing Medical IoT Disaster Management with Data Compression,” Journal of Informatics and Web Engineering, vol. 3, no. 1, pp. 55–66, Feb. 2024, doi: 10.33093/jiwe.2024.3.1.4.

J. Jayaram, Y. Kulkarni, L. V. Ganesh, Palanichamy Naveen, and Elham Abdulwahab Anaam, “Treatment Recommendation using BERT Personalization,” Journal of Informatics and Web Engineering, vol. 3, no. 3, pp. 41–62, Oct. 2024, doi: 10.33093/jiwe.2024.3.3.3.

W.-X., Ong, S.-B., Ho, & C.-H., Tan, “Enhancing Migraine Management System through Weather Forecasting for a Better Daily Life,” Journal of Informatics and Web Engineering, vol. 2, no. 2, pp. 201-217, Sept. 2023, DOI: 10.33093/jiwe.2023.2.2.15.

S.-K. Tan, S.-C. Chong, K.-K. Wee, and L.-Y. Chong, “Personalized Healthcare: A Comprehensive Approach for Symptom Diagnosis and Hospital Recommendations Using AI and Location Services,” Journal of Informatics and Web Engineering, vol. 3, no. 1, pp. 117–135, Feb. 2024, doi: 10.33093/jiwe.2024.3.1.8.

S.-B., Ho, E.-Y., Chew, & C.-H., Tan, “Streamlining Dental Clinic Management for Effective Digitisation Productivity and Usability”, Journal of Informatics and Web Engineering, vol. 3, no. 2, pp. 70-85, 2024, DOI: 10.33093/jiwe.2023.3.2.5.

R. Haque, S.-B. Ho, I. Chai, C.-W. Teoh, A. Abdullah, C.-H. Tan, & K. S. Dollmat, “Intelligent health informatics with personalisation in weather-based healthcare using machine learning,” in International Conference of Reliable Information and Communication Technology, Cham: Springer International Publishing, pp. 29–40, Dec. 2020, doi: 10.1007/978-3-030-70713-2_4.

S.-B. Ho, S.-L Chean, I. Chai, & C.-H. Tan, “Engineering meaningful computing education: programming learning experience model,” in 2019 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), IEEE, pp. 925-929, 2019, doi: 10.1109/IEEM44572.2019.8978920.

S.-B. Ho, I. Chai, & C. H. Tan, “Leveraging framework documentation solutions for intermediate users in knowledge acquisition,” International Journal of Information Science, vol. 3, no. 1, pp. 13-23, 2013.

S.-B. Ho, I. Chai, & C. H. Tan, “An empirical investigation of methods for teaching design patterns within object-oriented frameworks,” International Journal of Information Technology & Decision Making, vol. 6, no. 4, pp. 701-722, 2007. doi: 10.1142/S021962200700271X.

I. Ibriwesh, S.-B. Ho, I. Chai, & C. H. Tan, “A controlled experiment on comparison of data perspectives for software requirements documentation,” Arabian Journal for Science and Engineering, vol. 42, pp. 3175-3189, 2017. doi: 10.1007/s13369-017-2425-2.