Heart disease classification using Random Forest

Page 1

Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN:2395-0072

Heart disease classification using Random Forest

Arpit Gupta, Ankush Shahu, Masud Ansari, Nilalohit Khadke, Prof. Ashwini Urade

Students, Department of Computer Science J D College of Engineering and Management Nagpur, India Professor, Department of Computer Science J D College of Engineering and Management Nagpur, India ***

Abstract: Cardiovascular disease is still the leading cause of death worldwide, and the early prediction of heart disease is of great importance. In this paper, we propose a supervised learning algorithm for early prediction of heart disease using old patient medical records and compare the results with a well-known supervised classifier – Random Forest. Patient record information is classified using a CNN (Cascade Neural Network) classifier. At the classification stage, 13 features are provided as input to the CNN classifier to determine heart disease risk. The proposed system will help doctors diagnose diseases more efficiently. The effectiveness of the classifier was tested on 303 patient records. The raw data comes from a combination of 4 databases: Cleveland, Hungary, Switzerland and VA Long Beach data from the UCI Machine Learning Repository. This result suggests that CNN classifiers can more effectively predict the likelihood of heart disease. The proposed method allowed the model to achieve an accuracy of 95.17% in predicting heart disease. Experimental results show that our algorithm improves theaccuracyofheartdiseasediagnosis.

Keywords: Random forests, heart disease prediction, Machinelearning.

I. INTRODUCTION

Many medical data records created by medical professionals are available for analysis. Data mining techniques are methods of extracting valuable and hidden information from large amounts of available data. Medical databases consist mostly of fragmentary information. Therefore, making decisions usingdiscrete data sets becomes a complex and difficult task. Machine learning (ML), as a subfield of data mining, can efficiently handle well-structured large-scale datasets. In medicine, machine learning can be used to diagnose, detect and predict variousdiseases.

Themainpurposeofthisarticleistoprovideatool to help doctors detect heart disease at an early stage. This will help to effectively treat patients and avoid serious consequences. ML plays a very important role in detecting hidden discrete samples and thus provides data analysis. Afteranalyzingthedata,machinelearningtechnologyhelps predictheartdiseaseandmakeearlydiagnosis.Thisarticle presents an analysis of the performance of random forest techniquesinthepredictionofearlyheartdisease.

II. RELATED WORK

BackgroundResearch:

Applying ML classifiers on ECG dataset for predicting heart disease

IEEE,2021

AdibaHossain SabitriSikder

Current SVM model has 85.49% accuracy.Infuture,more analysis can be performed with the different combinations ofalgorithmstoobtaina better heart disease predictionmodel.

Ahealthylifestyleandearlydetectionaretheonly ways to prevent heart disease. The greatest challenge in healthcaretodayistoprovidethehighestqualityofservice and accurate and efficient diagnosis. Although heart disease has proven to be the leading cause of death worldwide in recent years, it is also a disease that can be effectively controlled and managed. Any precision in the management of diseases depends on the right timing of these diseases. The proposed work attempts to detect these heart diseases early enough to avoid catastrophic

The human heart is the main organ of the human body.Anytypeofdisturbanceinthenormalfunctioningof the heart can be classified as heart disease. In today's modernworld,heartdiseaseisoneoftheleadingcausesof mostdeaths.Heart disease canbecaused byanunhealthy lifestyle, smoking, drinking alcohol, and eating too much fat.AccordingtotheWorldHealthOrganization,morethan 10 million people worldwide die of heart disease every year. outcomes.

©

Using machine learning to predict heart disease

Latest trends of heart disease prediction using ML and image fusion

CCLAUSA, 2022 NikhilBora

Elsevier,2020

ManojDiwakar

P.Singh

Thelowestaccuracywas from Naïve Bayes of 79.83% against highest accuracyof94.12%from RandomForest.

Quality of dataset is an important factor, and thus hospitals should be encouraged to publish highqualitydatasets.

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056
2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal
Page1013
|
Publication,Year
Title
ResearchGap

Heart Disease Prediction

IRJETS,2022

NakkinaRajjani

Thescopeistocheckthe availability of heart disease with fewertasks and attributes that gives highaccuracyand efficiency.

Title Publication, Year Research Gap

Random forest swarm optimization for heart diseases diagnosis

Machine Learning Models for prediction of co-occurrenceof diabetes and cardiovascular diseases

Prediction of Heart Disease utilising SVM andANN

Improving the prediction of Heart Failure Patients’ Survival using SMOTE and Data Mining Techniques

Elsevier,2021

Shahrokh Asadi Michael Kattan

Springer, 2022

Ahmad

Abdalrada

JemalAbawajy

IJEECS,2021

AlaaKhaleel Faieq

IEEE,2021

AbidIshaq, Muhammad Umer

III. METHODOLOGY

Manyothermulti-objective optimization methods appear in the literature such as non-dominated genetic algorithm ii which canbeemployedinsteadof MOPSO.

The model has high accuracy and in the future, itcanbeemployedasatool for web-based and mobile phone application, thus increasing its reach among people and healthcare providers

SVMisemployedcurrently. While in the future, other techniques can be applied to predict other heart diseases using the same data

To improve the performanceofMLmodels, better features selection techniques can be devised In this case, metaheuristics can be used due to NP-hard nature of featureselectionproblems.

A. Data Collection and Preprocessing

The dataset used is the Cardiology dataset, which is a combination of 4 different databases, but only the UCI Cleveland dataset was used. The database contained a totalof76traits,butallpublishedtestsonlyreferenceda subset of 14 traits. Therefore, we use the UCI Cleveland processing dataset available on the Kaggle website for analysis.Table1belowgivesa fulldescriptionofthe14 attributesusedintheproposedwork.

TABLE I. FEATURES SELECTED FROM DATASET

1.

Age- represent the age of a person Multiplevalues between29&71

2. Sex- describe the gender of person(0-Feamle,1-Male) 0,1

3.

CP- represents the severity of chestpainpatientissuffering. 0,1,2,3

RestBP-Itrepresentsthepatient’s BP Multiplevalues between94&200

4.

5.

6.

7.

8.

9.

10.

Chol-It shows the cholesterol levelofthepatient. Multiplevalues between126& 564

FBS-It represent the fasting bloodsugarinthepatient. 0,1

Resting ECG-It shows the result ofECG 0,1,2

Heartbeat- shows the max heart beatofpatient Multiplevalues from71to202

Exang-usedtoidentifyifthereis an exercise induced angina If yes=1orelseno=0

0,1

OldPeak-describes patient’s depressionlevel Multiplevalues between0to62

11 Slope- describes patient conditionduringpeakexercise.It is divided into three segments(Unsloping,Flat,Down sloping)

1,2,3

12 CA-Resultoffluoroscopy. 0,1,2,3

13 Thal- test required for patient suffering from pain in chest or difficultyinbreathing.There are 4kindsofvalues whichrepresentThalliumtest

Proposed Model: Fig. 1 shows the entire process involved.

Target-It is the final column of the dataset It is class or label Colum Itrepresentsthenumber

0,1,2,3

0,1

Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056
10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN:2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page1014
International
Volume:
AttributeDescription DistinctValuesof
Sl.No.
Attribute

14 ofclassesindataset Thisdataset has binary classification i.e. two classes (0,1)In class “0” representthereislesspossibility of heart disease whereas “1” represent high chances of heart disease

The value “0”Or“1” dependsonother13attribute

Dataexploration,alsoknownasexploratory data analysis (EDA), is an essential step in the machine learning process. It involves analyzing and understandingdatasetstobetterunderstandthedata and identify patterns, relationships, and anomalies. While exploring data, we use a variety of statistical and visualization techniques to summarize and describedata.Thesetechniquesinclude:

1) Descriptive statistics: the mean, median, mode, standard deviation, correlation and other statisticsareusedtosummarizethedata.

2) Data Visualization: Histograms, scatter plots,boxplots,heatmaps,andothervisualizationsare used to visually explore data and identify patterns andtrends.

3) Dimensionality reduction: Use principal component analysis (PCA), t-SNE, and other techniquestoreducethedimensionalityofthedataset andvisualizeitinalow-dimensionalspace.

4) Outlier detection:Identifyandanalyzeoutliersto determineiftheyaretruedatapointsorfalsedata points.

model.Performanceiscalculatedandanalyzedbasedon different metrics used, such as accuracy, precision, retrievability,andF-score.

C. Training data using random forest

Random Forest: Random Forest is a popular machine learning algorithm used for classification, regression, and other tasks. An ensemble learning method that combinesmultipledecisiontreestomakemoreaccurate predictions. In random forests, a set of decision trees is createdonarandomsubsetoftheoriginaldataset.Each decisiontreeintheforestisbuiltusingadifferentsubset of features and training examples. The process of buildingeachtreeisrepeateduntilthespecifiednumber oftreeshavebeencreated.

To predict the use of a random forest, we walk through each tree in the forest and make a classification or regression decision. The predictions from each tree are then combined to form the final prediction. The combined prediction is done by majority vote (in classification)orbyaverage(inregression).

The advantages of random forest algorithm are:

1) High accuracy: Random forests are known for the highaccuracyoftheirpredictions.

2) Robustness: Random forests are less prone to overfitting thanindividualdecisiontrees.

3) Ease of use: Random Forest does not require extensivedatapreparationandcanhandlemissingdata.

4) Feature Importance: Random forests provide a measureoffeatureimportancethatcanhelpidentifythe mostimportantfeaturesforprediction.

Optimizing the accuracy of the model: The initial predictions of the model are not always correct. To further improve accuracy and precision, we performed thefollowingsteps:

Figure1:FeatureImportance

B. Classification:

The input dataset is split into 80% of the training dataset and the remaining 20% of the test dataset. Trainingdataset

isthedatasetusedtotrainthemodel.Thetestdataset is used to verify the performance of the trained

1) Hyperparameter tuning: Hyperparameter tuning is theprocessofselecting the bestset ofhyperparameters for a machine learning model. Hyperparameters are parametersthatare notlearnedduring training, butare set before training begins, such as the learning rate, the strengthofregularization,orthenumberofhiddenunits inaneuralnetwork.

Hyperparameter settings are important because model performance depends on the hyperparameters chosen. We do this through trial and error by training models with different combinations of hyperparameters and evaluatingtheirperformanceonthevalidationset.

We automate this using techniques such as grid search, randomsearch,orBayesianoptimization.

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056 Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN:2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page1015

Proper tuning of hyperparameters can significantly improve the performance of a machine learning model,whileimpropertuningcancausethemodelto performpoorlyorevenfailcompletely.

2) Confusion matrix: A confusion matrix is a table that summarizes the performance of a classification model on a set of test data whose true values are known. It is a way to visualize the performance of machine learning algorithms by comparing predicted and actual values. The confusion matrix is typically a 2x2matrixforbinaryclassification,withfourpossible outcomes:

Truepositive(TP):Themodelpredictsapositiveand theactualvalueispositive.

False Positive (FP): The model predicts a positive outcome,buttheactualvalueisnegative.

TrueNegative(TN):Themodelpredictionisnegative andtheactualvalueisnegative.

False Negative (FN): The model predicted negative, buttheactualvaluewaspositive.

Theconfusionmatrixcanbeusedtocalculatevarious evaluation metrics for classification models, such as accuracy,precision,recall,F1score,etc.Thesemetrics provide insight into model performance and help identifyareasforimprovement.

algorithm and using a larger data set than used in this analysis, which would lead to better delivery of results and help health professionals. Heart disease can be predictedeffectively.andefficiently.

D. Result and Analysis:

With the increasing number of deaths from heart

disease, it is imperative to develop an efficient and accurate heart disease prediction system. The motivationforStudy

was to find the most efficient ML algorithm for detecting heart disease. The random forest algorithm achieved86%accuracyinpredictingheartdisease.In thefuture,theworkcouldbeimprovedbydeveloping a web application based on the Random Forest

IV. FUTURE WORK

The quality of the data used to train a model has a significant impact on the final predictions of the model. Future work may involve improving data quality by cleaningandpre-processingthedatamorethoroughlyor collecting more data. Additionally, collecting data from patients of all age groups can lead to significant improvements.Morefine-tuningofhyperparameterscan be performed to help improve model accuracy. Heart diseasecanalsomanifestindifferentways,soclassifiers areconstructedfordifferentoutcomes(eg:heartattack, stroke, cardiac arrhythmia) may be more helpful. This can help develop more targeted interventions and improveoverallhealth.

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056 Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN:2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page1016
Figure 2: Confusion matrix Figure3:Cross-validatedclassificationmetrics Figure4:Userinterface

V. CONCLUSION

In this paper, Random Forest data mining algorithm was implemented to predict heart disease. In the proposed work, weachieveda classificationaccuracy of86.9%forpredictingheartdiseasewithadiagnosis rate of 93.3% using the random forest algorithm. As anextensionofthiswork,differenttypesofclassifiers canbeincludedintheanalysisandfurthersensitivity analysis can be performed. This classifier can also be extended by applying the same data set analysis of other bioinformatics diseases and seeing the performanceoftheseclassifierstoclassifyandpredict these diseases. Cloud computing technology can also be used for the proposed system to manage large volumesofpatientdata.

VI. REFERENCES

1. J. Krishnan Santana; S. Geetha “Prediction of Heart Disease Using Machine Learning Algorithms”. 2019 1st International Conference on Innovations in Information and Communication Technology (ICIICT)Publisher: IEEE

2. Mohan,S., Thirumalai,C.,& Srivastava,G.(2019). “EffectiveHeartDiseasePredictionusingHybrid MachineLearningTechniques”.IEEE Access,1–1. doi:10.1109/access.2019.2923707

3. Rajdhan Apurb, Agarwal Avi, Sai Milan, Ravi Dundigalla, Ghuli Poonam.” Heart Disease Prediction using Machine Learning” INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH&TECHNOLOGY

4. AbdullahAS.“ADatamining Modelforpredicting theCoronaryHeartDiseaseusingRandomForest Classifier”, Proceedings on International Conference in Recent trends in Computational Methods, Communication and Controls (Icon3c); 2012.p.22–5.

5. Kelwade JP. “Radial basis function Neural Network for Prediction of Cardiac Arrhythmias basedonHeartrate”

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056 Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN:2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page1017

Turn static files into dynamic content formats.

Create a flipbook