CLUSTERING MODELS FOR MUTUAL FUND RECOMMENDATION
Aayush Shah1, Aayushi Joshi2, Dhanvi Sheth3, Miti Shah4, Prof. Pramila M Chawan51,2,3,4 B.Tech Student, Dept. of Information Technology, VJTI College, Mumbai, Maharashtra, India
5Associate Professor, Dept. of Computer Engineering and IT, VJTI College, Mumbai, Maharashtra, India ***
Abstract - The mutual fund industry has expanded significantly, providing investors with several investment options. Mutual fund information is necessary for investors to make prudent investments. Yet, novice investors may find the financial environment to be complex owing to the abundance of information. A mutual fund recommendation system based on machine learning and data analytics overcomes this issue. We have proposed a clustering models for recommending mutual funds by analyzing theories regarding mutual fund investments and returns.
Key Words: Mutual funds, Clustering models, K-means, DBSCAN, Hierarchal, Agglomerative
1. INTRODUCTION
Mutualfundinvestinghasbecomeanintegralcomponentof portfolio management for investors and financial institutions.Yet,choosingthebestmutualfundstoinvestin maybedifficultowingtothevastnumberofpossibilitiesand thecomplexityoftheelementsthataffecttheirperformance. It is essential to accurately forecast the performance of mutualfundsinordertomakeeducatedinvestingselections. In this paper, we have described clustering models for recommending mutual fund investments. The suggested model takes into consideration a number of implicit and explicitparameters,suchasexpenseratios,fund manager experience,pastperformance,andnetassetvalues,inorder tocreateinvestmentrecommendationsthatcorrespondto aninvestor'spreferencesandriskprofile.Themodelssuch as K-means, hierarchical clustering, and DBSCAN group mutual funds based on their comparable traits and performance. This allows the models to offer suggestions basednotjustonthecharacteristicsofindividualfunds,but also on the performance and behavior of funds with comparable characteristics. Using cutting-edge clustering techniques, our models provides a complete solution for investorsseekingtoimprovetheirmutualfundinvestments.
2. PROBLEM
2.1 Problem statement
To propose clustering models for recommending mutual funds. Today, there is a lack of personalized and accurate recommendations for investors due to the vast amount of dataandthecomplexnatureofmutualfunds.Theexisting approachesarelimitedandmaynotprovideasatisfactory solutionfornoviceinvestors.Hence,thereisaneedforan
efficient and reliable recommendation system that can consider the individual preferences and risk tolerance of investorstoprovidetailoredrecommendationsformutual fundinvestments.
2.2 Problem elaboration
With the rise of online trading platforms, retail investors nowhaveaccesstoawidervarietyofinvestmentoptions, but the sheer number of options can be overwhelming. Additionally,manyinvestorsmaylackthefinancialexpertise toevaluatetherisksandreturnsofdifferentmutualfunds effectively.
A mutual fund recommendation system could provide personalizedinvestmentadvicebasedonauser'sinvestment goals, risk tolerance, and other relevant factors. However, designing an effective system would require addressing severalchallenges.Oneoftheprimarychallengesisbuilding a model that can accurately predict the performance of different mutual funds based on historical data. This requiresidentifyingrelevantfeaturesthatarepredictiveof mutual fund returns and developing algorithms that can effectivelylearnfromthisdata.
Anotherchallengeisensuringthatthesystemcanprovide personalized recommendations that reflect each user's unique investment goals and preferences. This requires developingeffectivemethodsforcapturinguserpreferences andincorporatingthemintotherecommendationprocess.
Finally, it is important to ensure that the system is transparentandeasytousefornoviceinvestors.Thismeans designing an intuitive user interface that explains the rationalebehindeachrecommendationandprovidesusers withtheinformationtheyneedtomakeinformeddecisions.
Overall, a mutual fund recommendation system has the potential to empower novice investors and help them navigate the complex world of mutual fund investments. However,designinganeffectivesystemrequiresaddressing severaltechnicalanduser-facingchallenges.
3. DATA
3.1 Data collection
We acquired our data from the Value Research Online website. It is a well-established website that provides financial information and analysis to help investors make
informed decisions about their investments. The website offers a wide range of services, including mutual fund research. Additionally, the website follows strict editorial policiestoensuretheaccuracyandreliabilityofitscontent.
The data provided includes critical attributes for equity, debt, and hybrid mutual fund types, which are financial vehicles that investors can use to invest in the financial markets.Equityfundsinvestinstocksandhavehigherrisk andreturnpotential,whiledebtfundsinvestinfixedincome securities with a fixed rate of return and lower risk comparedtoequityfunds.Debtfundsarefurtherclassified basedonthedurationofbondstheyinvestin.Hybridfunds invest in a mix of equity and debt securities, offering a balancedmixofriskandreturnpotential.
3.2 Data preprocessing
Thedatacontainedalotofshortcomingsthatneededtobe dealtwithbeforepassingittothemachinelearningmodel. Data was scattered in separate databases with different schemas. Several records had null values. Hence, data integrations along with data cleaning steps had to be performed. Thus, to make the data more disposable, followingdatapreprocessingstepshadtobeapplied.
1) Dataintegration
Forseparatefeatures,datawasextractedinaseparatecsv file. These columns were different for 3 kinds of mutual funds, i.e., Equity, Hybrid and Debt. Hence, we created a commonschemawasnecessarytounifytheserecordsunder acommondataset.
2)
Featureselection
Basedon the relevance ofall features,onlythosefeatures wereselectedthatmayhelpinpredictingthemutualfunds.
Datacleaning
a) Dealingwithnullvalues:
ï· Recordsmissingcriticalfeatures:
There are several records in the dataset where critical features such as Sharpe Ratio, Standard Deviation,andSortinoRatioaremissing.Itisdifficult to evaluate risk involved without these features. Hence,recordswithoutthesefeatureswerediscarded completely.
ï· Recordsmissinganon-criticalvalue:
Such features were filled with the average value (mean)ofthewholecolumn.
b) Dealingwithduplicates:Duplicatesweredeleted.
c) HandlingOutliers:Weusedgraphicalmethodssuchas boxplotsandwhiskerplotstodeterminetheoutliers.
3) FeatureExtraction
a) To make the data more expressive, we converted a fewcategoricalcolumnswithonlyafewvalues,into onehotencodedvector.Clusteringalgorithmsusually usenumericaldataandrawformofcategoricaldata mightbeerroneous.Hence,inorderfortheclustering algorithmstoworkmoreefficientlyandremoveany bias,weconvertedcolumnssuchasfundcategoryand fundstyle.
b) Afewnewfeatureswereaddedtoextractvaluable information from the existing columns. For example,thecolumncalledâdateâwasconvertedto âage_in_monthsâ by applying appropriate mathematicalfunctions.
To work with manager_tenure, only primary manager tenure was extracted from an array of managers.
4) Exploratorydataanalysis
This step involved analyzing data and comparing different gestures with each other. This resulted in a correlationmatrixbetweenallthefeatures.Usingthis matrix,featureswhichvaluesextremelycorrelatedto eachotherhadhadtoberemovedinordertoremove the bias. Hence columns such as NAV_latest, NAV_previous had a correlation of 1. These columns were combined to form only 1 column called NAV_latest.
5) Scalingdata
Beforepassingthedatatothenextstep,thedataneeds tobenormalizedorscaledsothatbiggervaluesdonât skew the clustering output. All the numerical values werescaled
Byimplementingthesesteps,wecanensurethatthedataset is cleaned, filtered, and transformed into a more useful formatforrecommendationmodeling.
Finally,afterperformingallthesepre-processingsteps,the datacontainedattributesdenotingfundtypelikeequitydebt or hybrid, fund performance metrics like expense ratio, returns and fund manager tenure, fund style like growth, valueor blendandseveral othernumerical attributeslike riskfactor,netassetvalue,standarddeviation,Sharperatio andstandarddeviation.
Volume: 10 Issue: 04 | Apr 2023 www.irjet.net
4. CLUSTERING MODELS
Weproposefourclusteringmodels:
1) K-means: It is used to cluster and partition data into groupsbasedonsimilaritiesbyminimizingthesumof squareddistancesbetweencentroidsanddatapoints.
2) Hierarchical:Itisusedtogroupdataintoclustersina hierarchical manner, based on the distance between datapoints,withoutneedingtospecifythenumberof clustersbeforehand.
3) Agglomerative:Itisahierarchicalclusteringalgorithm that starts with each point as a single cluster and graduallymergesthemintolargerclusterswithmore pointsbasedontheirsimilarity,untilallpointsbelong toasinglecluster.
4) DBSCAN:Itisadensity-basedclusteringalgorithmthat groupsdatapointstogetherthatarecloselypackedand separatedfromotherclusters,basedonauser-defined minimumnumberofpointsandamaximumdistance betweenthem.
The process of analyzing and clustering data involves varioustechniquesthatcanassistinidentifyingpatternsand structureswithinthedata.Onesuchtechniqueisscalingthe datatonormalizeandstandardizeittoensurethatdifferent featuresorvariablesarecomparableandeasiertointerpret.
Scaled dataset was used to implement these four types of clustering algorithms, i.e. Agglomerative, DBSCAN, Hierarchy,andK-means.Theeffectivenessofthedifferent clustersformedusingthesealgorithmswasevaluatedand checked against two metrics, which were inertia and silhouette.
Inertiameasuresthesumofsquareddistancesbetweeneach pointanditsassignedcentroidinthecluster.Alowerinertia valueindicatesthattheclustersaremoretightlypackedand well-separated, which is a desirable outcome. Silhouette score measures how well each data point fits into its assigned cluster, by comparing the distance between the pointandotherpointsinitsowncluster(cohesion)tothe distance between the point and points in the nearest neighboringcluster(separation).
Ahighsilhouettescore(closerto1)indicateswell-separated clusters, while a low score (closer to -1) indicates poorly separatedclusters.
Bycomparingtheresultsoftheclusteringalgorithmsagainst thesemetrics,itwasdeterminedwhichalgorithmproduced themostoptimalandaccurateclusters.
We defined hyperparameter search dictionaries for these clustering algorithms. The parameters for each algorithm wasspecifiedwithrangesofpossiblevalues.Additionally,a dictionarycontainingalistoffeatureswascreatedtousein thegridsearch.
5. OUTPUT
For each combination of model and hyperparameters, clusteringhasbeenperformedandtheresultsarerecorded. We compare these models on the basis of the silhouette score. Fig.1showsthetopK-meanssilhouettescoreswith maximumscoreof0.256forming2clustershavingcountsof 598and328. SimilarlyFig.2,Fig.3andFig.4showsthetop scores for Hierarchical, Agglomerative and DBSCAN clustering models respectively along with their cluster counts.
5 1 Critical clustering features
In order to determine which aspects of the clustering approach were the most important, we constructed a RandomForestClassifiermodel.
We used hyperparameters like Gini index and entropy to identifythekeyfeaturesthatdrivetheformationofdistinct clustersinaclusteringalgorithm.
Fig. 5 shows that the most effective feature while using agglomerative clustering is âEquity_fund_style_Growthâ followed by âStandard_Deviationâ and âCategory_Equityâ. Likewise, Fig. 6, Fig. 7 and Fig. 8 show the most effective featuresintheHierarchical,K-meansandDBSCANmethods respectively.ItisclearfromtheobservationsthatâCategoryâ columns play a major role in almost all the clustering algorithmstodividethemutualfundsintoclusters.
6. CONCLUSION
In this study, we successfully implemented various clustering algorithms, including k-means, DBSCAN, Hierarchical,andAgglomerative,toeffectivelyclustermutual funds. We evaluated the performance of these algorithms using parameters such as Silhouette score and Inertia, allowing for a comprehensive comparative analysis to identify the optimal method for clustering mutual funds. Additionally,weemployedtheRandomForestalgorithmto determinethemostinfluentialfeaturesthatcontributedto theclusteringresults.Thisinsightfulanalysisrevealedthe order of importance of the features in the mutual fund clustering process, providing valuable insights for future researchandinvestmentdecision-making.
7. FUTURE SCOPE
Developing an efficient clustering model to analyze and categorize users into distinct clusters based on the similaritiesfoundintheirdatapointswillbethenextstep The suggested methods must be further analyzed to determinewhichamongstthemgivesthebestresultonthe givendataset.Thebestclusteringalgorithmcaneffectively group users together based on shared features or characteristicswithinagivenfeaturespace.Onceusersare assigned to their respective clusters, a personalized and effective recommendation can be generated based on the cluster to which the user belongs. Importantly, this recommendation is tailored while taking into careful consideration the unique constraints and limitations that apply to each user, ensuring that it aligns with their
preferences,requirements,andotherrelevantfactors.This approachensuresthattherecommendationsprovidedare highlyrelevantandvaluable,providinguserswithasuperior experience while accommodating their specific needs and constraints.
Furthermore,asophisticatedrecommendationsystemcan be built that takes into account individual investor characteristics such as investment horizon, risk profile, investment type, minimum investment and so on to recommendthebestpossiblemutualfundschemestothat particularinvestorthatcanaidnoviceaswellasexperienced investorsinchoosingthebestschemetoinvestinoutofthe thousandsavailabletoday.
REFERENCES
[1] Aayush Shah, Aayushi Joshi, Dhanvi Sheth, Miti Shah, Prof, Pramila M Chawan, âMutual fund recommendation system with personalized explanationsâ, published in International Research Journal of Engineering and TechnologyVolume9Issue11,November2022
[2] Pei-Ying Hsu, Chiao-Ting Chen, Chin Chou & Szu-Hao Huang,âExplainablemutualfundrecommendationsystem developed based on knowledge graph embeddingsâ, publishedinAppliedIntelligenceVolume52Issue9on1st July2022
[3] Li Zhanga, Han Zhanga, SuMin Hao, âAn equity fund recommendationsystembycombingtransferlearningand theutilityfunctionoftheprospecttheoryâ,publishedinthe Journal of finance and data science on Volume 4, Issue 4, December2018
[4]Chae-eunPar,Dong-seokLee,Sung-hyunNam,Soon-kak Kwon, âImplementation of FundRecommendationSystem UsingMachineLearningâpublishedinJournalofmultimedia informationsystem,Sept30,2021
[5]PremSankarCa,R.Vidyarajb,K.SatheeshKumarb,âTrust based stock recommendation system - a social network analysisapproachâ,publishedinInternationalConferenceon InformationandCommunicationTechnologies-ICICT2014
[6]NusratRouf,MajidBashirMalik,TasleemArif,Sparsh Sharma,SaurabhSingh,SatyabrataAichandHee-CheolKi, âStock Market Prediction Using Machine Learning Techniques: A Decade Survey on Methodologies, Recent Developments,andFutureDirectionsâpublishedinMDPI, Nov8,2021
[7]NghiaChu,BinhDao,NgaPham,HuyNguyen,HienTran âPredicting Performances of Mutual Funds using Deep LearningandEnsembleTechniquesâpublishedinarXiv.org SchoolofStatisticalFinance,CornellUniversityarchive,Sept 18,2022
[8] K. Pendaraki, Grigorios Beligiannis, A. Lappa, âMutual fundpredictionmodelsusingartificialneuralnetworksand geneticprogrammingâ
[9] Krist Papadopoulos âPredicting Mutual Fund RedemptionswithCollaborativeFilteringâ
[10] Yi-ChingChoua, Chiao-TingChen, Szu, HaoHuang, âModeling behavior sequence for personalized fund recommendationwithgraphicaldeepcollaborativefilteringâ publishedinExpertSystemswithApplicationsVolume192, April15,2022
[11]GiridharMaji,DebomitaMondal,NilanjanDey,Narayan C.Debnath,SoumyaSen,âStockpredictionandmutualfund portfolio management using curve fitting techniquesâ publishedinJournalofAmbientIntelligenceandHumanized Computing,Jan2,2021
BIOGRAPHIES
Aayush N Shah, B. Tech Student, Dept. of Computer Engineering and IT, VJTI College, Mumbai, Maharashtra, India.
Aayushi Joshi, B. Tech Student, Dept. of Computer Engineering and IT, VJTI College, Mumbai, Maharashtra, India.
Dhanvi Sheth, B. Tech Student, Dept. of Computer Engineering and IT, VJTI College, Mumbai, Maharashtra, India.
Miti Shah, B.TechStudent,Dept.ofComputerEngineering andIT,VJTICollege,Mumbai,Maharashtra,India.
Prof. Pramila M. Chawan,isworking as an Associate ProfessorintheComputerEngineeringDepartmentofVJTI, Mumbai.ShehasdoneherB.E.(ComputerEngineering)and M.E.(Computer Engineering) from VJTI College of Engineering, Mumbai University. She has 30 years of teaching experienceandhas guided 85+M. Tech. projects and130+B.Tech. projects.Shehaspublished148papersin the International Journals, 20 papers in the National/InternationalConferences/Symposiums.Shehas worked as an Organizing Committee member for 25 International Conferences and 5 AICTE/MHRD sponsored Workshops/STTPs/FDPs. She has participated in 17 National/InternationalConferences.WorkedasConsulting Editor on â JEECER, JETR, JETMS, Technology Today, JAM&AEREngg.Today,TheTech.WorldEditorâJournalsof ADRReviewer-IJEF,Inderscience.ShehasworkedasNBA Coordinator of the Computer Engineering Department of VJTIfor5years.ShehadwrittenaproposalunderTEQIP-Iin
June2004forâCreatingCentralComputingFacilityatVJTIâ. Rs.EightCroreweresanctionedbytheWorldBankunder TEQIP-Ionthisproposal.CentralComputingFacilitywasset upatVJTIthroughthisfundwhichhasplayedakeyrolein
improvingtheteachinglearningprocessatVJTI.Awardedby SIESRPwithInnovative&DedicatedEducationalistAward Specialization: Computer Engineering & I.T. in 2020 AD Scientific Index Ranking (World Scientist and University Ranking2022) â 2ndRank-BestScientist,VJTIComputer Science domain 1138th Rank- Best Scientist, Computer Science,India.