Issuu

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 12 Issue: 11 | Nov 2025

p-ISSN: 2395-0072

www.irjet.net

Retrospective Study on Predictive Indicators of Heart Disease Using Publicly Available Datasets Mohamed Ameen S, Lecturer, Department of Computer Science, Sri Venkateshwaraa Group of Institutions, Puducherry ---------------------------------------------------------------------------***--------------------------------------------------------------------------Abstract - Heart disease remains one of the leading causes of mortality around the world. This study utilizes a public dataset to

identify the key indicators that contribute to heart disease using statistical modeling methods. A detailed analysis involving descriptive statistics, histograms, and logistic regression modeling was conducted on a dataset comprising 1025 subjects. Results revealed that factors such as age, sex, chest pain type, fasting blood sugar, and exercise-induced angina significantly influence the likelihood of heart disease. The logistic regression model achieved an AUC of 0.684, indicating moderate predictive capability. These findings highlight the importance of early risk identification and support the use of open datasets in medical research

Introduction Cardiovascular diseases (CVD) account for nearly one-third of worldwide fatalities. Early detection and diagnostic assessment significantly improve patient outcomes and reduce healthcare burdens. Predictive analytics and statistical modelling play a critical role in identifying risk factors that influence disease presentation. Public datasets enable large-scale analysis without ethical concerns associated with direct patient data collection. This manuscript explores predictive variables using logistic regression supported by visual analytics. The content is expanded to a thesis-like depth for use in academic publication, internal assessment, or journal submission.

Literature Review Cardiovascular disease (CVD) prediction has been a central research area in clinical epidemiology for several decades. Early foundational work emerged from the Framingham Heart Study, which produced widely adopted multivariable risk equations using logistic and Cox regression models. These models incorporated classical risk factors such as age, sex, cholesterol, systolic blood pressure, smoking status, and diabetes. They consistently achieved moderate discrimination (AUC typically between 0.70 and 0.75), serving as the benchmark for subsequent prediction tools. Classical models, while interpretable and clinically intuitive, are limited by linearity assumptions, relatively small feature sets, and reduced adaptability to diverse population groups. With the evolution of open data repositories such as the UCI Heart Disease Dataset and several Kaggle-based collections, researchers have increasingly utilized statistical learning techniques to explore risk factor patterns and optimize prediction accuracy. The UCI dataset in particular—which includes variables such as chest pain type, resting blood pressure, fasting blood sugar, serum cholesterol, resting ECG, maximum heart rate (thalach), ST depression, and number of major coronary vessels—has been the basis of numerous studies applying logistic regression. These studies consistently identify age, male sex, chest pain type, cholesterol level, and exercise-induced angina as statistically significant predictors. Logistic regression remains widely used because it provides interpretable coefficients, odds ratios, and p-values, making it suitable for evidencebased clinical decision-making. Recent literature, however, highlights the increasing application of machine learning (ML) models to improve performance beyond traditional logistic regression. Methods such as Support Vector Machines (SVM), Random Forest, K-Nearest Neighbors (KNN), Gradient Boosting, and Artificial Neural Networks (ANN) have demonstrated higher classification accuracy in several comparative studies. For example, Random Forest and Gradient Boosting often outperform linear models due to their ability to capture non-linear relationships and feature interactions. Some studies report accuracies exceeding 90%, although such results often depend on aggressive preprocessing, smaller datasets, or oversampling techniques, raising concerns regarding model generalizability and overfitting. Explainable AI (XAI) approaches—particularly SHapley Additive exPlanations (SHAP)—have recently gained prominence to address the interpretability challenges associated with ML models. Research integrating SHAP with Random Forest, XGBoost,

Impact Factor value: 8.315

ISO 9001:2008 Certified Journal

Page 905