Heart Disease Prediction Using Machine Learning Techniques

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 09 Issue: 10 | Oct 2022 www.irjet.net p-ISSN: 2395-0072

Heart Disease Prediction Using Machine Learning Techniques

Sai Bhavan Gubbala

***

Abstract - cardiovascular diseases are responsible for the majorityofdeathsglobally.Inmostsituations,thecertaintyof heart disease is known, but it may be anticipated early utilizing various machine learning methods. A proper diagnosis of heart patients can be made based on variables that are readily apparent. Age, diabetes, hypertension, cholesterol, obesity, and physical activity are a few of the contributing variables. In this study we are going to find an algorithmwhich predicts accurately. The dataset contains 14 attributes/features and 270 instances, out of which 150 have heart disease and 120 are considered to be normal. The data seems to be complete with no null values so we are going to use supervised machine learning models. To find out the best model we are going to compare five machine learning algorithms such as Random Forest classification, Support Vector Machine, AdaBoost Classifier, Logistic Regression and Decision Tree Classifier. The goal of this paper is to do comparative research on machine learning algorithms accuracy. This research is going to be developed further for best accuracy model prediction.

Key Words: Random Forest Classification, AdaBoost Classifier, Logistic Regression, Decision Tree Classifier, Support Vector Machine, Algorithm, Instances.

1.INTRODUCTION

AccordingtotheWorldHealthOrganization,everyyear15 million deaths occur worldwide due to heart disease. The loadofcardiovasculardiseaseisrapidlyincreasingallover the world from the past few years. Many researches have beenconductedinattempttopinpointthemostinfluential factors of heart disease as well as accurately predict the overall risk. Heart Disease is even highlighted as a silent killerwhichleadstothedeathofthepersonwithoutobvious symptoms.Theearlydiagnosisofheartdiseaseplaysavital role in making decisions on lifestyle changes in high-risk patientsandinturnreducethecomplications.Inthisproject, wehavedevelopedandresearchedaboutmodelsforheart disease prediction through the various heart attributes of patientanddetectimpendingheartdiseaseusingMachine learning techniques on the dataset available in the UCI repository, further evaluating the results using confusion matrix and cross validation. The author had used various machinelearningtechniquestopredicttheaccuracyoftest model such as Random Forest, Support Vector Machine, AdaBoostClassifierandwehavecleandatasetwhichcanbe readilyusedwithoutcleaningormodifyingthedataset.

2. LITERATURE OVERVIEW

There any many research works done on heart disease diagnosisbyvariousfactors.Inthisstudy,wearegoingtodo a comparative analysis of different classification and regressionalgorithmsandtheresultsrevealedthatRFhad obtainedanhigheraccuracywhichisthehighestamongstthe other algorithms. Authors had proposed a Random Forest ClassificationforHeartdiseasePredictionbyintegratingPCA and Cluster techniques. This study provided a significant contributionincomputingstrengthscoreswithcompelling predictors in heart disease prediction. Compared to five machine learning algorithms-Support Vector Machine, Decision Trees, Random Forest, AdaBoost Classifier and Logistic Regression-to predict heart disease. Among these algorithms, Random Forest gives the best accuracy, at 85.22%comparedtootheralgorithms.Thesystemevaluates alltheparametersusingthetrainingandtestingtechnique. The dataset is evaluated in python coding language. The coding language is then further processed in a Jupyter notebook which evaluates the process by a step-by-step manner.Differenttypesoftrainingandtestingphaseswere implemented.Intheend,thebestpairoftestingandtraining were selected and used in the process for achieving the highestaccuracy.

3. DATASET INFORMATION

Thedatasetiscollectedfromanonlineresourcealsoknown astheUCIrepository.Thesedays,thedataisavailableevery dayanditisbesttoimplementthemodelonadatasetwhich is available from a reliable resource. The dataset contains variousattributes/featuressuchasage,gender,fastingblood sugar, serum cholesterol, chest pain type, resting electrocardiographic results, exercise induced angina, ST depression,slopeofpeakexercise,restingbloodpressure, numberofmajorvessels,thalassemiaandtarget.Thedataset contains270instanceswith14attributes.TheTable1gives usideaabouthowthedatasetisimplementedbynumerical values Figure 1 presents us a general overview of the datasetbyindicatingthat44%ofpatientsarenotdiagnosed withheartdiseaseand56%ofpatientssuffercardiovascular disease.

7.529 | ISO 9001:2008 Certified Journal | Page77

value:

1Student, Department of Computer Science and Engineering, Sathyabama Institute of Science and Technology

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 09 Issue: 10 | Oct 2022 www.irjet.net p-ISSN: 2395-0072

3.1 DATASET PREPROCESSING

Attributes Description

Age Ageinyears

Sex Male-1,female-0

ChestPain

Typicalangina-0, Atypicalangina-1, non-anginalpain-2, Asymptomatic-3

Resting Blood Pressure 94-200mminHg

Cholesterol 126-524mg/dl

FastingBloodSugar True-1,False-0

Resting Electrocardiographic Normal-0,ST-Twaveabnoramility-1, Leftventricularhypertrophy-2

MaximumHeart 71-202heartrate

Exercise Induced Angina Yes-1,No-0

STDepression 0-6.2values

Slopeofpeakexercise STsegment Upsloping-0,flat-1,downslopping-2

Number of major vessels 0-3values

Thalassemia 0-normal,1-fixeddefect,2-reversable defect

Heartdisease(Target) No-1,Yes-2

Table - 1:DatasetAttributes

Data correlation, null value verification, loading python libraries,andpartitioningthedatasetintotrainingandtest portions are all steps in the preprocessing of the dataset. Preprocessingprovidesinsightintohowthecharacteristics ofthedataareinfluencingthedata. Individualelementsin thebelowcorrelation,testing,andtrainingdataaredirectly predictiveofheartdiseaseillness.Peoplewithheartdisease areshowninFigure2alongwiththeirlinktootherrelevant variables. The primary factor impacting heart disease has beenfoundasthehighestheartrateattained.Finally,Figure 3 provides us with a comprehensive connection of all features,whichhelpsusidentifythecausesofheartdisease. Thisunderstandingisexceptionalbecauseitdistinguishes betweenbigandsmallelementsinregardtothegoalvalue andisextremelyhelpfulevenwhenthereisalinkbetween them

Fig - 1:PercentageofHeartDisease

Fig -2:Attributesinrespecttoheartdiseasepatients

Certified Journal | Page78

IRJET | Impact Factor value: 7.529 | ISO 9001:2008

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 09 Issue: 10 | Oct 2022 www.irjet.net p-ISSN: 2395-0072

ensemble learning approach. It can used to solve for regression or classification problems. The Random Forest Classifieralgorithmisusedanditstestaccuracywas85.22 percent.

5.2 DECISION TREE

DecisionTreeisastructuredclassifierwhereinternalnodes standinforadataset’sfeatures,branchesforthedecisionmaking process, and each leaf node for the classification result.IthastwonodessuchasDecisionnodeandLeafnode. Thetestaccuracywas74.34percent.

5.3 SUPPORT VECTOR MACHINE

Support Vector Machines implement nonlinear class boundaries using a linear model. The target classes are separatedusingsupportvectors.BeforetrainingalinearSVM model to classify the data in a higher-dimensional feature space to deal with a nonlinear condition, the model transforms the input using a mapping function. The test accuracywas67.43percent.

5.4 ADABOOST CLASSIFIER

Fig – 3:Correlationofdata

4. SPLITTING DATA INTO TESTING AND TRAINING

Thetrainingphaseretrievesthecharacteristicsfromthedata whilethetestingproceduredeterminesthemodelbehaviour for prediction. While the testing phase uses dependent factors,thetrainingphaseusesindependentvariables.There aremanyportionsofthedata.Thesearethestepsoftesting andtraining.90%ofthedatasetisusedfortraining,while 10%isusedfortesting.Weconsidertherandomstatetobe1. Todecidehowthedatawillbesplitintotrainandtestindices, weusetherandomstateparameter.Wecanbecertainthat the method will always generate the same set of random integersifwesettherandomstatetoaconstantvalue.The algorithmsarealreadysetupinaJupyternotebook.

FIG - 4: Splittingdata

5. METHODOLOGY

Thereareseveralsortsofalgorithms,someofwhichdiffer fromoneanotherwhileothersaresimilar;nonetheless,we nowuseallofthesealgorithmstocomparetheiraccuracyso that the one with the best accuracy may be chosen for improvedpredictability.

5.1 RANDOM FOREST CLASSIFIER

Alargenumberofdecisiontreesarebuiltduringthetraining phase of the random forests or random decision forests

An AdaBoost classifier is a meta-estimator that first fits a classifierontheoriginaldataset,thenfitsadditionalcopiesof the classifier on the same dataset with the weights of instancesthatwereincorrectlyclassifiedbeingchangedso thatsubsequentclassifiersconcentratemoreonchallenging cases.Thetestaccuracywas78.59percent.

5.5 LOGISTIC REGRESSION

Logistic Regression is also known as Analytical modelling technique.Itisutilizedtoanalyzedatasetswithoneormore independentfactorsthataffecttheoutcome.Arandomstate of0wasimportedagainforlogisticregression.Thetraining modelwasthenfitted.Thetest’saccuracywas81.23percent.

7.529 | ISO 9001:2008 Certified Journal | Page79

Factor value:

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 09 Issue: 10 | Oct 2022 www.irjet.net p-ISSN: 2395-0072

6. FLOWCHART OF DIAGRAM

Fig

-4:FlowchartofResearch

The flowchart explains how the dataset is utilized while developing a prediction model. Understanding this study reportrequiresunderstandingthisflowchart.

7. RESULTS

Accordingtothetable,RandomForesthasthebestaccuracy of all the classifiers, at 85.22 percent. Python was used to trainthemodels,splitthedatasetintotrainingandtestdata, andmeasurehowaccuratetheywere.Belowisacomparison of the algorithms' results, along with a table listing their accuracypercentages Sl. No. Algorithms

Fig –5:RandomForestAccuracy

8. CONCLUSIONS

Therefore,wehavefinalizedthatRandomForestClassifier hasattainedthehighestlevelofaccuracy,asindicatedbythe numericalfigures.AlthoughLogisticRegression,AdaBoost Classifier,DecisionTree,andSupportVectorMachineallhad accuracy levels greater than 50%, they underperformed Random Forest. When compared to the other methods discussed, it can be reasoned that the random forest approachmayalsoattainsuperioraccuracy.

9.FUTURE SCOPE

Additionally, researchers may contrast this research’s hierarchy levels to those from other datasets and draw interestingconclusions.Theauthors'intendedstudywillaid in the creation of more accurate, reliable, and productive methodsforpredictingsickness,whichwillbenefitnotjust themedicalcommunitybutalsonumerousothergroupsand individualsthroughouttheglobe.

REFERENCES

[1] PouriyehS,VahidS,SanninoG,DePietroG,ArabniaH, Gutierrez J. A comprehensive investigation and comparison of machine learning techniques in the domainofheartdisease.In:2017IEEEsymposiumon computers and communications (ISCC). IEEE. p. 204–207.

[2] https://archive.ics.uci.edu/ml/datasets/statlog+(heart).

[3] "Prediction of Heart Disease using Machine Learning Algorithms" Krishnan J Santhana and S Geetha ICIICT |Year:2019|ConferencePaper|Publisher:IEEE

[4] Bouali H, Akaichi J. Comparative study of different classification techniques: heart disease use case. In: 201413thinternationalconferenceonmachinelearning andapplications.IEEE.p.482–86.

[5] World Health Organization. http://www.who.int/cardiovascular diseases/en. 2019.I.S.JacobsandC.P.Bean,“Fineparticles,thinfilms and exchange anisotropy,” in Magnetism, vol. III, G.T.

©

2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page80

1. RandomForestClassifier 85.22% 2. LogisticRegression 81.23% 3. AdaBoostClassifier 78.59% 4. DecisionTree 74.34% 5. SupportVectorMachine 67.43%

Accuracy

Table -2:Performanceevaluationofalgorithms

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 09 Issue: 10 | Oct 2022 www.irjet.net p-ISSN: 2395-0072

RadoandH.Suhl,Eds.NewYork:Academic,1963,pp. 271-350.

[6] ChauhanR,BajajP,ChoudharyK,GigrasY.Framework to predict health diseases using attribute selection mechanism. In: 2015 2nd international conference on computing for sustainable global development (INDIACom).IEEE.p.1880–84.

[7] "Effective Heart Disease Prediction Using Hybrid MachineLearning Techniques" Senthil Kumar, Mohan Chandrasegar Thirumalai and Gautam Srivastva |Year :2019|ConferencePaper|Publisher:IEEE.

Turn static files into dynamic content formats.

Create a flipbook