Issuu

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 10 Issue: 05 | May 2023

p-ISSN: 2395-0072

www.irjet.net

Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms 1S.Jayabharathi, 2Dr.M.Logambal, 1Research Scholar, Department of Computer Science with Data Analytics, Vellalar College for Women, Thindal,

Erode, Tamil Nadu, India.

2Assistant Professor, Department of Computer Science with Data Analytics, Vellalar College for Women, Thindal,

Erode, Tamil Nadu, India. ------------------------------------------------------------------***--------------------------------------------------------------------

Abstract:

scanning your video. Over 400 years every day [2][3]. Dealing with this avalanche of data requires powerful knowledge discovery tools. Data mining techniques are known knowledge discovery tools for this purpose [4]. One of them, clustering, is defined as a method of dividing data into groups so that objects within each group are more similar than other objects within other groups. Data clustering is a well-known technique in many areas of computer science and related fields. Data mining can be seen as the main origin of clustering, but it is widely used in other research fields such as bioinformatics, energy research, machine learning, networks and pattern recognition, so much research has been done in this field. [5]. From the beginning, researchers explored clustering algorithms to manage complexity and computational load, resulting in increased scalability and speed.

Higher dimensional information is characterized by enormous dimensionality of structure, spreads a high degree of difficulty, and must be understood in all these times. As the dimensionality of the dataset increases, the model data representation becomes sparse and the domain density increases, which becomes an additional task. However, when dealing with high dimensional data, it is not possible to achieve good results. However, the dimensional subspace falloff leads to a very difficult problem as well. This broadside offers limited knowledge of effective clustering. Big data analysis and processing requires a lot of effort, tools, and equipment. Hadoop, Apache, and Spark framework software use MapReduce models to perform large-scale data analysis through parallel processing and retrieve results as fast as possible. However, in the era of big data, traditional data analysis methods may not be able to manage and process large amounts of data. In order to develop efficient processing of big data, this paper improves the use of map-reduce techniques for processing big data in machine learning algorithms.

The concept of machine learning is not new in the field of computing, but it has emerged as an entirely new "avatar" for the ever-changing demands of today's world. Everyone is now talking about ML-based solution strategies for a particular problem. ML is a subset of artificial intelligence that uses computer algorithms to learn autonomously from data and information. With the advent of the Internet, a lot of digital information has been created. This means more data for machines to analyze and “learn” [6]. As a result, we are seeing a resurgence in machine learning. Today, machine learning algorithms enable computers to communicate with humans, drive self-driving cars, write and publish sports game reports, and spot terrorist suspects. Machine learning (ML) is the fastest growing field in computer science [7]. Classification [8], regression [9], topic modeling [10], time series analysis, cluster analysis, association rules, collaborative filtering, and dimensionality reduction are some of the common machine learning techniques/methods. [11][12][13]14]15]. Big data is a large collection of data sets that are complex to process. Organizations struggle with creating, manipulating, and managing large datasets [16]. This data can be analyzed using software tools as part of advanced analytics capabilities such as predictive analytics, data mining, text analytics, and statistical analytics. Examples of such large amounts of data are

Keywords: Big Data, Map Reduce, clustering, machine learning, KNN,SVM,K-Means, Naïve Bayes, FCM

I.INTRODUCTION Big data is an important concept for industry, academics, and researchers in the fields of computing, economics, and software development. It represents the raw materials used for processing and analysis, and the information obtained represents the results after these processing. The context of big data concerns his 7Vs (Quantity, Value, Velocity, Variety, Validity, Truth, and Visualization) of digital data generated and collected from various sources [1]. The question turns to the question of what to do with these huge amounts of data. Scientists and researchers consider big data to be one of the most important topics in computer science today. Social networking sites such as Facebook and Twitter have billions of users and generate hundreds of gigabytes of content every minute. Retailers continuously collect data from their customers. On YouTube he has 1 billion unique his users generating 100 hours of video every hour. The Content ID service is

Impact Factor value: 8.226

ISO 9001:2008 Certified Journal

Page 858