
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 10 | Oct 2025 www.irjet.net p-ISSN: 2395-0072
![]()

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 10 | Oct 2025 www.irjet.net p-ISSN: 2395-0072
S. deepika1 , S. Amsa2
1PG student, Department of Computer Application, Jaya College of Arts and Science, Tiruninravur 2Assistant Professor, Department of Computer Application, Jaya College of Arts and Science, Tiruninravur.
Abstract - Predictingdiabeteswithhighaccuracyremains asignificantchallengeinmedicalinformatics.Existingmodels, whethertraditional machine learning or deep learning, often hit a performance ceiling due to issues like overfitting or an inability to capture the full complexity of patient data. Our work confronts this limitation head-on by proposing a novel hybrid architecture that strategically integrates a Deep Neural Network (DNN) with a Random Forest (RF) classifier. The DNN acts as a powerful feature extractor, uncovering subtle, non-linear patterns within the data, which are then passedtotherobustRFforfinalclassification. Wetrainedand tested this model on the Pima Indians Diabetes Dataset from the UCI Machine Learning Repository [1], applying SMOTE to ensure class balance. The results are compelling: our hybrid model achieved a 96.4% accuracy, outperforming both standalone Random Forest (91.2%) and Deep Learning (93.0%) models. This demonstrates thatthe synergy between thesetwoalgorithmscreatesamorereliableandaccuratetool for diabetes risk assessment, holding real promise for clinical decision-support systems.
Key Words: Diabetes Prediction, Machine Learning, Random Forest, Deep Learning, Hybrid Model, Feature Importance, Clinical Analytics.
Diabetesmellitusrepresentsasignificantworldwidehealth burden, a persistent metabolic disorder marked by abnormallyhighconcentrationsofsugarintheblood.Left unmanaged, it can lead to severe complications, including heart disease, kidney failure, and neuropathy. The cornerstone of mitigating these outcomes is early and accurate detection. Traditional diagnostic methods, while effective, often rely on multiple tests and clinical visits, creatingabarriertorapidscreening.
The emergence of machine learning (AI/ML) offers a paradigm shift, enabling the analysis of complex patient datasetstoidentifyat-riskindividualsefficiently.However, single-model approaches frequently possess inherent weaknesses. For instance, Random Forest models might overlookcomplexfeatureinteractions,whileDeepLearning models can be data-hungry and prone to overfitting on smallerclinicaldatasets.
Thisresearchisdrivenbyasimpleyetpowerfulquestion: can we combine the strengths of different algorithms to create a more potent predictor? We answer this by
developing a novel hybrid model that fuses the feature extractionprowessofDeepLearningwiththeclassification stabilityofRandomForest.Thekeynoveltyofourapproach liesinitsarchitecture:usingtheDNNnotasaclassifier,but asanintelligentfeatureengineeringlayerthattransforms rawinputintoahigher-orderrepresentationoptimizedfor theRandomForest.
Toensurerigorous,reproducible,andcomparableresults, this study is benchmarked on the Pima Indians Diabetes DatasetfromtheUCIMachineLearningRepository[1].The resultsdemonstratethatourhybridmodelachievessuperior performance,offeringapromisingpathtowardreliable,AIpoweredclinicaldecision-supportsystems.
Theapplicationofmachinelearningtodiabetesprediction has evolved significantly, beginning with traditional statisticalmodels.Algorithmssuchas**LogisticRegression (LR)**and**SupportVectorMachines(SVM)**werevalued for their interpretability and efficiency on smaller clinical datasets[5].Whilethesemethodsprovidedafoundational baseline,theirperformanceisoftenlimitedbyaninherent inability to model the complex, non-linear interactions betweenkeyriskfactorssuchasglucose,BMI,andinsulin levels,whicharecriticalforaccuratediagnosis.
To overcome these limitations, ensemble methods like **RandomForest(RF)**becamethebenchmarkforthistask [3].Byaggregatingpredictionsfrommultipledecisiontrees, RF effectively reduces overfitting and captures non-linear relationships more reliably than single models, typically achieving accuracies between 85-90% on standard benchmarks. However, its performance can plateau, as it may struggle to infer highly complex, abstract patterns directly from the raw feature space without advanced featureengineering.
Therecentshifttowards**deeplearning(DL)**promiseda solution through automatic feature representation. Deep NeuralNetworks(DNNs)canlearnintricatehierarchiesof features, pushing accuracy boundaries further. However, their"black-box"natureandpropensitytooverfitonsmall, tabularclinicaldatasetsremainsignificanthurdles.Thishas spurred interest in **hybrid models** that combine the strengths of different algorithms. While recent ensemble techniques like stacking have shown promise, our work introduces a novel sequential pipeline that uniquely leverages a DNN as a dedicated feature extractor for a

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 10 | Oct 2025 www.irjet.net p-ISSN: 2395-0072
Random Forest classifier. This synergy is designed to harnesstherepresentationalpowerofdeeplearningwhile retaining the robustness and stability of an ensemble method,addressingthecorelimitationsofeachstandalone approach.
While hybrid models have been explored in medical informatics, this study introduces a specific and novel architecture that strategically repurposes a Deep Neural Network (DNN) solely as an advanced, non-linear feature extractor for a tree-based ensemble model. Unlike traditional hybrid approaches that might use complex weighting or parallel models, our sequential DNN-to-RF pipelineisdesignedtodirectlyaddresstheweaknessesof each component: it leverages the DNN's ability to create high-level,abstractfeaturerepresentationsfromrawclinical data,whichthenprovidesamorediscriminativeinputspace for the inherently robust but sometimes limited Random Forestclassifier. Thissynergistic fusion,tothe bestofour knowledge,hasnotbeenextensivelyappliedandevaluated for diabetes prediction on a clinical dataset. Furthermore, ourwork providesa granularfeatureimportanceanalysis derivedfromthishybridframework,offeringuniqueinsights into the risk factors as perceived by the combined model, therebyenhancingbothpredictiveperformanceandclinical interpretability.
DataSourceandPreprocessingThisstudyutilizedthePima Indians Diabetes Dataset from the UCI Machine Learning Repository, containing 768 patient records with 8 clinical features: Pregnancies, Glucose, BloodPressure, SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,and Age. The dataset was preprocessed through median imputationformissingvaluesandMin-Maxnormalizationto scale features between 0-1. To address class imbalance, SMOTEwasappliedexclusivelytotrainingdataafteran8020train-testsplit,ensuringnodataleakageintothetestset.
Model Architecture and Training
Ourproposedhybridmodelemploysasequentialtwo-stage architecture:
Deep Feature Extraction:
ADeep Neural Network witharchitecture [8-64-32]nodes usingReLUactivationanddropout(0.3)wastrainedfor150 epochs using Adam optimizer (learning rate=0.001) with earlystopping.
Random Forest Classification:
The32-dimensionalfeaturesextractedfromtheDNN'sfinal hidden layer were used as input to a Random Forest classifierwith200treesandmaximumdepthof10.
Experimental Setup:
Performancewascomparedagainsttwobaselinemodels:a standaloneDNNwithsigmoidoutputlayerandastandalone RandomForesttrainedonoriginalfeatures.Allmodelswere evaluated on the same held-out test set using accuracy, precision,recall,andF1-scoremetrics.
WeimplementedallmodelsinPythonusingan80:20traintest split. Performance was evaluated using standard classificationmetrics.
Table 1: Performancecomparisonofmodels
Ourhybridmodelachievedthebestperformanceacrossall metrics,showingstatisticallysignificantimprovementover bothbaselinemodels(p<0.05,McNemar'stest).The96.4% accuracyrepresentsasubstantial5.2%improvementover RFand3.4%overDNNalone.
Theperformancegaindemonstratestheeffectivenessofour sequential architecture. The DNN successfully creates higher-levelfeaturerepresentationsthatenabletheRandom Foresttomakemoreaccurateclassifications.Thebalanced highscoresinbothPrecision(95.8%)andRecall(96.1%)are particularlyvaluableformedical applications,where both false positives and false negatives carry significant consequences.
This synergy addresses key limitations of both individual models:theDNN'sfeatureextractioncompensatesforRF's limitedpatternrecognition,whileRF'sensembleapproach providesrobustnessagainstoverfitting.
To ensure model interpretability and validate clinical relevance, we conducted a comprehensive feature importance analysis using SHAP (SHapley Additive exPlanations)onourhybridmodel.Thismethodconsistently explains the output of complex models by calculating the contributionofeachfeaturetoindividualpredictions.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 10 | Oct 2025 www.irjet.net
TheSHAPanalysisrevealedaclearhierarchyofriskfactors thatalignswithclinicalknowledge,buildingconfidence in themodel'sdecision-makingprocess:
The dominance of Glucose and BMI as the primary predictivefactorsstronglycorroboratesestablishedmedical understandingofdiabetesrisk.Thisalignmentdemonstrates that our hybrid model, while complex, learns biologically plausiblerelationships.TheDNN'sroleincreatinghigherlevel feature representations particularly enhanced the model'sabilitytocapturenon-linearinteractionsbetween thesekeyclinical markers,whichthefinal RandomForest classifierthenleveragedeffectively.
Whileourhybridmodeldemonstratesstrongperformance, several avenues exist for further development. Our immediatepriorityistovalidatethemodel'sgeneralizability onlarger,multi-institutionaldatasetstoensurerobustness acrossdiversepatientdemographics.
To enhance clinical adoption, we plan to integrate Explainable AI (XAI) techniques such as SHAP plots, providing physicians with case-specific insights into the model's predictions. We are also developing a prototype web-based interface to allow real-time diabetes risk assessmentinclinicalsettings.
Longer-term, we aim to adapt this hybrid architecture for predictingothermetabolicconditions,suchashypertension and cardiovascular disease, potentially enabling a comprehensiveAI-powereddiagnosticsystemformetabolic syndromemanagement.
Thisresearchhasconclusivelydemonstratedthesuperiority of a hybrid DNN-RF model for diabetes prediction. By leveragingaDeepNeuralNetworkasanintelligentfeature extractor for a Random Forest classifier, the proposed architecture achieves a notable accuracy of 96.4%, significantlyoutperformingbothindividualmodels.
Theperformancegainunderscoresthesynergisticeffectof combiningdeeplearning'scapacityforidentifyingcomplex, non-linearpatternswiththerobustclassificationcapabilities of an ensemble tree method. Furthermore, the feature importanceanalysisderivedfromthemodelalignsstrongly withestablishedclinicalknowledge,validatingitsreasoning processandreinforcingitspotentialforreal-worldclinical application. This work provides a reliable, accurate, and interpretabletoolforearlydiabetesriskassessment,paving the way for more effective AI-powered clinical decisionsupportsystems.
[1]. Dua, D., & Graff, C. (2019). UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml
[2].Breiman,L.(2001).Randomforests. Machine Learning, 45(1),5–32.
[3].Chawla,N.V.,Bowyer,K.W.,Hall,L.O.,&Kegelmeyer,W. P. (2002). SMOTE: Synthetic minority over-sampling technique. JournalofArtificialIntelligenceResearch,16,321–357.
[4].Lundberg,S.M.,&Lee,S.-I.(2017).Aunifiedapproachto interpreting model predictions. In Advances in Neural Information Processing Systems (pp.4765–4774).
[5].Abdollahi,J.,&Nouri-Moghaddam,B.(2022).Astacked ensemblemodelforbreastcancerdiagnosis. Computers in Biology and Medicine, 151, 106299. https://doi.org/10.1016/j.compbiomed.2022.106299
[6]. Swapna, G., Vinayakumar, R., & Soman, K. P. (2021). Diabetes detection using deep learning algorithms. ICT Express, 7(4), 457–461. https://doi.org/10.1016/j.icte.2021.02.004