Prediction of Used Car Prices using Machine Learning Techniques by IRJET Journal

Prediction of Used Car Prices using Machine Learning Techniques

1, 3, 4 Student, Dept of IT Engineering, Thakur College of Engineering and Technology, Mumbai, Maharashtra, India. 2 Student, Dept. of Information Technology, St. Francis Institute of Technology, Mumbai, Maharashtra, India ***

Abstract - The manufacturer sets the price of a new car in the industry, with the government incurring some additional expenditures in the form of taxes. Customers purchasing a new car may thus be sure that their investment will be worthwhile. However, due to rising new car prices and buyers' financial inability to purchase them, used car sales are increasing globally. As a result, a used car price prediction system that efficiently assesses the worthiness of the car utilizing a range of factors is required. The current system comprises a system in which a dealer decides on a price at random and the buyer has no knowledge of the car or its current worth. In reality, the seller has no clue what the car is worth or what price he should charge for it. To address this issue, we have devised a highly effective model. Regression algorithms are employed because they produce a continuous value rather than a classified value as an output. As a result, rather than predicting a car's price range, it will be feasible to estimate its real price. A user interface has also been created that takes input from any user and shows the price of a car based on the inputs.

Key Words: Used Car Price Prediction, Regression Algorithms, Machine Learning,LinearRegression,Ridge and Lasso Regression, Bayesian Ridge Regression, Decision Tree, Random Forest, XG Boost, Gradient Boosting.

1.INTRODUCTION

Determining if the quoted price of a used car is fair is a difficult process owing to the numerous elements that influence a used vehicle's market pricing. The goal of this research is to create machine learning models that can properly anticipate the price of a used car based on its features so that buyers can make informed choices. We create and analyze numerous learning algorithms using a datasetthatincludesthesellingpricesofvariousbrandsand models. We will compare and choose the best machine learning algorithms such as Linear Regression, Lasso Regression, Ridge Regression, Bayesian Ridge Regression, Decision Tree Regression, Random Forest Regression, XG Boost Regression, and Gradient Boosting Regression. The priceofthecarwillbedeterminedbyanumberoffactors. Regression algorithms are used because they produce a continuousvalueratherthanacategorizedvalue,allowing ustopredicttheactualpriceofacarratherthantheprice range of a car. A user interface has also been created that takesinputfromanyuserandshowsthepriceofacarbased ontheinputs.

value:

2. LITERATURE REVIEW

[1] Variousstudieshavebeenconductedinordertopredict the price of used cars. Researchers regularly anticipate productpricesusingpastdata.Pudaruthpredictedcarprices inMauritius,andthesecarswerenotnew,butratherusedto predicttheprices,heemployedmultiplelinearregression,knearest neighbours, Naive Bayes, and decision tree techniques. When the prediction results from various strategieswerecompared,itwasdiscoveredthattheprices fromthesemethodsarequitesimilar.However,thedecision treetechniqueandtheNaveBayesapproachwereprovento be incapable of classifying and predicting numeric values. AccordingtoPudaruth'sresearch,thesmallsamplesizedoes notgivegoodpredictionaccuracy.

[2] Kuiper, S. (2008) demonstrated a multivariate regression model that helps in classifying and predicting valuesinnumericformat.Itdemonstrateshowtoapplythis multivariateregressionmodeltoforecastthepriceof2005 GeneralMotors(GM)vehicles.Thepricepredictionofcars does not require any special knowledge. So, the data availableonlineisenoughtopredictprices.Theauthorofthe article did the same car price prediction and introduced variableselectiontechniquesthathelpedinfindingwhich variablesweremorerelevantforinclusioninthemodel.

[3] In 2019, Pal et al discovered as a methodology for predictingusedcarspricesusingRandomForest.Thepaper evaluated usedcar price prediction using Kaggle data set whichgaveanaccuracyof83.62%fortestdataand95%for train-data. The most relevant features used for this predictionwereprice,kilometer,brand,andvehicletypeand identifiedbyfilteringoutoutliersandirrelevantfeaturesof the data set. Being a sophisticated model, Random Forest providedgoodaccuracyincomparisontopriorworkusing thesedatasets.

[4] Gegic,E.etal.(2019)demonstratetheneedtocreatea modeltoforecastthecostofsecondhandcarsinBosniaand Herzegovina.Theyusedmachinelearningtechniquessuchas artificial neural networks, support vector machines, and randomforests.However,theaforementionedmethodswere usedinconcert.Thewebscraper,whichwascreatedusing thePHPprogramminglanguage,wasusedtogatherthedata from the website autopijaca.ba for the forecast. Then, to determinewhichmethodbestsuitedtheprovideddata,the respective performances of various algorithms were compared.AJavaapplicationcontainedthefinalprediction

7.529 | ISO 9001:2008 Certified Journal | Page355

Eesha Pandit1 , Hitanshu Parekh2 , Pritam Pashte3 , Aakash Natani4

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 09 Issue: 12 | Dec 2022 www.irjet.net p-ISSN: 2395-0072

model. Additionally, the model's accuracy of 87.38% was determinedwhenitwasverifiedusingtestdata.Dholiyaet al.demonstratedamachinelearning-basedmethodforauto resalesin2019.

[5] ThegoalofthesystemthatDholiya,M.,etal.developed is to give the user a realistic estimation of how much the vehicle might cost them. Based on the specifics of the automobile the user is looking for, the system, which is a webapplication,mayalsooffertheuseralistofoptionsfor variouscarkinds.Itassistsinprovidingthebuyerorseller withusefulinformationonwhichtobasetheirdecision.This system makes predictions using the multiple linear regression algorithm, and this model was trained using historicaldatathatwasobtainedoveranextendedperiodof time. The raw data was initially gathered using the KDD (KnowledgeDiscoveryinDatabases)process.Afterward,it underwentpreprocessingandcleaninginordertoidentify patterns that are valuable and then derive some meaning fromthosepatterns.

[6] Richardson conducted his analysis under the presumptionthatautomakersaremoreinclinedtoproduce cars that don't lose value quickly. He demonstrated, in particular,thathybridcarsarebetterequippedtomaintain theirvaluethanconventionalvehiclesbyutilisingmultiple regression analysis. This is perhaps because there are increasingconcernsabouttheenvironmentandtheclimate, as well as because it uses less gasoline. In this study, the significanceofadditional variables includingage, mileage, make, and MPG (miles per gallon) was also taken into account. All of his information was gathered from several websites.

[7] Listianipublishedanotherstudythatiscomparableand uses Support Vector Machines (SVM) to forecast lease car pricing . This study demonstrated that when a very large data set isavailable,SVMis significantly more accurate at pricepredictionthanmultiplelinearregression.SVMisalso superiorathandlinghighdimensionaldataandsteersclear of both under- and over-fitting problems. Finding crucial featuresforSVMisdoneusingageneticalgorithm.However, themethoddoesnotdemonstratewhySVMissuperior to basic multiple regression in terms of variance and mean standarddeviation.

3. TECHNOLOGY USED

Pythonismainlyusedinthisprojecttoimplementmachine learningalgorithmssinceitcontainsalotofbuilt-inmethods in the form of packaged libraries and modules. During projectimplementation,Python,Pandas,NumPy,Matplotlib, Seaborn,Scikit-Learn,Plotly,andPicklelibrarieswereused.

The following technologies were used to build the web application:HTML,CSS,Flask,Jsonify,andRequests.

4. METHODOLOGY

Figure 1: WorkflowofStudy

4.1 Data Gathering

ThesourceofthedataisthewebportalKaggle.com,where vehicledatasetsareprovidedbyCardekhoforthesaleand purchase of cars. The dataset contained the following features: car name, year, selling price, present or current price, kilometres driven, fuel type: diesel, petrol, or CNG (compressednaturalgas),sellertype:dealerorindividual, transmission: automatic or manual, owner (number of previousowners).

4.2 Create Environment

AnenvironmentiscreatedusingtheAnacondaprompt.This environmentwouldseparateourprojectareafromtheother default environment (base) or other previously created environments.Allthepackages,libraries,andmodulesthat we need can be manually installed in the environment createdinthisway,makingitanadvantageousstep.Insuch an environment, we can make changes according to our needs.

4.3 Data Reading

The first step is to import and read the csv file for the research. The dataset is extensively examined in terms of null values, shape, columns, numerical and categorical features, dataset columns, unique values of each feature, datainformation,andsoon.

4.4 Data Pre-processing

Someofthedatafeatureswererenamedforclarity(Present Price=InitialPrice,Owner=PreviousOwners),andsome featuresthatwerenotimportantforanalysiswereremoved. Inexploratorydataanalysis,weusestatisticalgraphicsand other visualisation techniques to describe the important aspects of data. Top Selling Vehicles, Year vs. Number of AvailableVehicles,SellingPricevs.InitialPrice,VehicleFuel Type,TransmissionType,SellerType,Age,SellingPricev/s Age, Selling Price v/s Seller Type, Selling Price v/s Transmission,SellingPricev/sFuelType,SellingPricev/s PreviousOwners,InitialPricevsSellingPrice,SellingPrice v/s Kilometers Driven, pairplot, heatmaps, and other visualisations are used to gain a better understanding of data.FollowingEDA,OneHotEncodingapproachisusedto deal with the dataset'scategorical features.Afterthat, the dataset's correlation characteristics are generated and

| Page356

2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 09 Issue: 12 | Dec 2022 www.irjet.net p-ISSN: 2395-0072

thoroughlyanalysedbyvisualisingseveralplots.Thenthe featuresallocationofdataiswherethedependentfeature (Selling Price) and independent features (Initial Price, Kilometers Driven, Previous Owners, Age, and so on) are thenallocatedforfurtherprocessing.

4.5 Train-Test Split

Once the dependent and independent features have been assigned,weproceedwiththesplittingofthedatasetinto trainingandtestingdata.Weuse80%ofthedatatotrainour modeland20%totestit.

4.6 Model Building

Following the Train-Test split, data modeling is complete, andtheprocessofbuildingthemodelbegins.Themodelis defined, along with a few parameters, for future implementation.Afterthemodelisbuilt,variousalgorithms areusedtocreatethefinalresults.Afterbuildingthemodel, thefollowingalgorithmsareusedforpredictiveanalysis.

Linear Regression: Itisalinearapproachinstatisticsfor modelingtherelationshipsbetweenascalarresponseand dependentandindependentvariables.Inlinearregression, relationships are modelled using functions such as linear predictor, and unknown model parameters are estimated fromdata.

Lasso Regression: Itisasortoflinearregressioninwhich the data values are shrunk towards a data point in the center,or,insimplerterms,themeanofthedata.TheLasso procedure supports simple and sparse models with fewer parameters. When a model has a high amount of multicollinearity,thisregressionprovidesthebestfitforthat model. This approach can also be used if some aspects of model selection, such as variable selection or parameter elimination,needtobeautomated.Theabbreviation'LASSO' standsforLeastAbsoluteShrinkageandSelectionOperator.

Ridge Regression: It is a regression approach used for tuning a model and analyzing multicollinear data. This functionimplementsL2regularization.Themulticollinearity ofthedataresultsinunbiasedleast-squares,ahugevariance, andhencethepredictedvaluesareconsiderablyfarfromthe actualvalues.

Bayesian Ridge Regression: This regression is used to estimate any probabilistic model of any regression issue using linear regression formulation with the use of probability distributors, providing a natural process that survivesdatainsufficiencyorpoordatadistribution.

Random ForestRegression: RandomForestisaSupervised Learning Algorithm that employs the ensemble learning approachforclassificationandregression.Randomforests aremadeupoftreesthatrunparalleltoeachotherandhave nointeractionwhiletheydevelop.RandomForestisameta-

estimator that aggregates the outcomes of several predictions.Italsoaggregatesnumerousdecisiontreeswith certainmodifications.

XGBoost Regression: XGBoostisaverypowerfultechnique for creating supervised regression models. XGBoost is an ensemblelearningstrategythatincludestrainingindividual models and then merging them (base learners) to get a singleprediction.

Gradient Boosting Regression: Thisisamachinelearning approach used to construct a prediction model for regressionandclassificationproblems.Thepredictionmodel generatesanensembleofweakpredictionmodels,whichare oftendecisiontrees.Thismethodoutperformstherandom forestmethodinmostcases.

5. IMPLEMENTATION

AddinganewfeatureAge,whichdeterminesthenumberof yearsthevehiclehasbeenused,isstoredinthefinaldataset, andtheyearattributeisdropped.

5.1 Exploratory Data Analysis

Inthisstage,wesummarizethemajorcharacteristicsofdata using statistical graphics and other visualization tools. Various graphs and charts are plotted to gain a better understandingofthedatasetandtherelationshipsbetween itsfeatures.

Figure 2: Countw.r.tAge

Vehiclecount in relationto vehicleage:Thefollowingbar graphdepictsthenumberofvehiclesofacertainage.

Figure 3: Sellingpricev/sAge

Comparisonofeachvehicle'ssellingpricevs.age:Thechart belowdepictsthesellingpriceandageofacertaincar.Andit iseasytoconcludethatthesellingpriceishighforacarofa youngage.

Certified Journal | Page357

2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 09 Issue: 12 | Dec 2022 www.irjet.net p-ISSN: 2395-0072

Figure 4:Initialpricev/sSellingprice

Comparison of Initial Price and Selling Price: The graph below demonstrates the direct proportionality between InitialPriceandSellingPrice,whichsuggeststhatahigher initialpricewillresultinahighersellingprice.

Figure 6: CorrelationHeatmap

Heatmap of Correlation Features for the Final Dataset: A dataset'scorrelationfeaturesdefinehowclosetwovariables aretohavingalinearrelationshipwitheachother.Features with a high correlation are more linearly dependent and have the same effect on the dependent variable. If two variableshaveahighcorrelation,wecanalwayseliminate one of them. The heatmap of correlation is shown below, withdarkercolorsrepresentinghighcorrelationandlighter colorsrepresentinglowcorrelation.

Figure 5: KilometersDrivenv/sSellingPrice

ComparisonofKilometersDrivenvs.SellingPrice:Thegraph aboveshowsthatavehiclewithahighnumberofkilometers drivenhasalowersellingpricethanonewithalownumber ofkilometersdriven.

5.2 One Hot Encoding

The one hot coding approach is used to deal with the categorical variables in the dataset. It generates a sparse matrix or a dense array based on the parameters while creating a binary column for each category or parameter. Fuel Type, Seller Type, and Transmission were the three categorical variables in our dataset. Following one hot encoding,thesevariablesaregivenabinaryrepresentation, so that for a car with a Fuel Type of Diesel, the value of Fuel_Type_Diesel is a binary 1 and the value of Fuel_Type_Petrol is a binary 0. The same procedure is appliedfortheremainingcategoryvariables.

Figure 7: FeatureImportance

Feature Importance of dataset: The feature importance techniqueprovidesascoretofeaturesinafeaturesetbased ontheirusefulnessinpredictingthetargetvariable.Initial Price isthe most relevant featurein the provided dataset, whilePreviousOwnersistheleastimportant.

5.3 Model Building

Afterthetrain-testsplitofthedataset,modelingiscomplete, and the process of building the model begins. For final implementation,themodeliscreatedwithafewparameters, suchasthealgorithm,xtrain,ytrain,xtest,andytest.After thecompletionofthemodel,variousalgorithmsareusedto generatethefinalresults.

7.529 | ISO 9001:2008 Certified Journal | Page358

Factor value:

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 09 Issue: 12 | Dec 2022 www.irjet.net p-ISSN: 2395-0072

5.4 Developing a Web Application:

A web application is then made using HTML, CSS, and JavaScriptinthefrontendandusingtheFlaskFrameworkof Pythoninthebackend. Thiswebapplicationallowsanyuser toenterparametersandcalculatetheestimatedsellingprice ofausedcar.Toviewtheresults,theusermustentervalues forvariablessuchasyear,initialprice(inlakhs),kilometers driven, and previous owners, as well as select options for parameterssuchasfueltype,transmissiontype,andseller type.

Ridge Regression

6 RESULTS

Figure 8: WebApplication

TheDecisionTreeAlgorithmhasthebestr_2scoreof0.9544 when all regression methods' r_2 scores are compared, whichsimplyimpliesthattheDecisionTreeAlgorithmhas deliveredthemostaccuratepredictionswhencomparedto theotheralgorithms.

Table 1. EvaluationMetricsofAlgorithms

Figure 9: Originalv/sPredictionDecisionTreeRegression

In the graph above, where the red line represents the original values of the dataset and the blue line shows the valuespredictedusingDecisionTreeRegression,wecansee thatbothlinesareprettyclosetoeachother,indicatingthat thepredictionsarehighlyaccurate.

7. CONCLUSION

Predictingusedcarpricesisadifficulttaskduetothelarge numberoffeaturesandparametersthatmustbeexamined inordertogetreliablefindings.Thefirstandmostimportant phaseisdatacollectionandpreprocessing.Themodelwas thendefinedandbuiltinordertoimplementalgorithmsand generate results. After executing various regression algorithmsonthemodel,itwasconcludedthattheDecision TreeAlgorithmwasthetopperformer,withthegreatestr2 score of 0.95, implying that it provided the most accurate predictions, as shown by the Original v/s Prediction line graph.Asidefromhavingthehighestr2score,theDecision TreealsohadthelowestMeanSquareError(MSE)andRoot MeanSquareError(RMSE)scores,indicatingthattheerrors

2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page359

Afterapplyingregressionalgorithmstothemodel,ther_2 scoresandotherassessmentmetricssuchasmeanabsolute error, mean squared error, and root mean squared error were obtainedforcomparisonofthe performanceof each method.

Algorithm R_2 Scores Mean Absolute Error (MAE) Mean Squared Error (MSE) Root Mean Squared Error (RMSE)

Regression 0.8576

2.6763 1.6359 Linear Regression 0.8625 1.0998 2.9823 1.7269 Ridge Regression 0.8634

Random Forest

0.7583

1.1080 2.9632 1.7214 Lasso Regression 0.8659 1.0934 2.9071 1.7050 Bayesian 0.8695 1.0750 2.8302 1.6823

XG Boost Regression 0.8958 0.6822 2.2584 1.5027 Gradient Boosting Regression 0.9355 0.6378 1.4111 1.1878 Decision Tree Regression 0.9544 0.6711 1.3139 1.1462

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

in predictions were the lowest of all and that the results obtainedwereveryaccurate.

8 FUTURE SCOPE

Thedevelopedmachinelearningmodelcanbeexportedasa "Pythonclass"anddeployedasanopensource,ready-to-use pricepredictormodel,whichcanthenbeeasilyintegrated with third-party websites. The model can be greatly optimised by using neural networks by designing deep learning network topologies, employing adaptive learning rates, and training on data clusters rather than the entire dataset.

9. REFERENCES

[1] Pudaruth, S. (2014) ‘Predicting the Price of Used Cars usingMachineLearningTechniques’,InternationalJournalof Information&ComputationTechnology,4(7),pp.753–764. Availableat:http://www.irphouse.com.

[2]Kuiper,S.(2008)‘IntroductiontoMultipleRegression: How Much Is Your Car Worth?’, Journal of Statistics Education,16(3).doi:10.1080/10691898.2008.11889579.

[3] Pal, N. et al. (2019) ‘How Much is my car worth? A methodologyforpredictingusedcars’pricesusingrandom forest’,AdvancesinIntelligentSystemsandComputing,886, pp.413–422.doi:10.1007/978-3-030-03402-3_28.

[4]Gegic,E.etal.(2019)‘Carpricepredictionusingmachine learning techniques’, TEM Journal, 8(1), pp. 113–118. doi: 10.18421/TEM81-16.

[5]Dholiya,M.etal.(2019)‘AutomobileResaleSystemUsing Machine Learning’, International Research Journal of EngineeringandTechnology(IRJET),6(4),pp.3122–3125.

[6]Richardson,M.(2009)DeterminantsofUsedCarResale Value.TheColoradoCollege

[7]Listiani,M.(2009)SupportVectorRegressionAnalysis forPricePredictioninaCarLeasingApplication,Technology. HamburgUniversityofTechnology.

[8] https://www.jigsawacademy.com/popularregression-algorithms-ml/

[9] https://www.simplilearn.com/10-algorithmsmachine-learning-engineers-need-to-know-article

[10] https://www.javatpoint.com/machine-learning-lifecycle

[11] https://www.simplilearn.com/tutorials/machinelearning-tutorial/machine-learning-steps