International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395-0056
Volume: 12 Issue: 11 | Nov 2025
p-ISSN: 2395-0072
www.irjet.net
DATA SCIENCE AND BIG DATA IN DATA MINING ALGORITHMS Vanitha S1, Miss Sangeetha A2 1PG Student ,Department Of Computer Applications, Jaya College Of Arts and Science, Thiruninravur,
Tamilnadu,India
2Assistant Professor, Department Of Computer Applications, Jaya College Of Arts and Science, Thiruninravur,
Tamilnadu,India ---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - The exponential growth of data volume,
petabytes of data efficiently. Data Science emerges as the interdisciplinary field that combines statistical analysis, machine learning, domain expertise, and computer science to extract knowledge and insights from data. It provides the necessary toolkit to bridge the gap between Big Data challenges and effective data mining. By leveraging distributed computing frameworks like Hadoop and Spark, Data Science enables the scaling of data mining tasks across clusters of computers.
velocity, and variety—collectively known as Big Data—has presented both unprecedented opportunities and significant challenges for traditional data mining algorithms. These algorithms, designed for smaller, structured datasets, often struggle with scalability, computational efficiency, and extracting meaningful patterns from massive, unstructured data streams. This paper explores the critical intersection of Data Science, Big Data, and data mining. We begin by reviewing the limitations of existing data mining methodologies (e.g., Apriori, k-Means, C4.5) in a Big Data context. We then propose a novel, integrated framework that leverages distributed computing paradigms, specifically Apache Spark's MLlib, to enhance the scalability and performance of classical algorithms. The proposed methodology is implemented through distinct modules for data ingestion, pre-processing, distributed model training, and result visualization. Experimental results on a largescale dataset demonstrate a significant reduction in processing time and improved model accuracy compared to traditional single-node implementations. The paper concludes that the synergy between Data Science principles and Big Data technologies is essential for the next generation of efficient and powerful data mining solutions, and suggests future research directions in real-time streaming analytics and AutoML integration.
2. LITERATURE REVIEW 2.1 Traditional Data Mining Algorithms: Traditional algorithms have been the backbone of knowledge discovery for decades. Apriori Algorithm: Used for frequent itemset mining and association rule learning. Its main drawback is the need for multiple database scans and the generation of a huge number of candidate sets, which becomes computationally prohibitive with large datasets [1]. k-Means Clustering: A popular centroid-based clustering algorithm. It suffers from sensitivity to initial centroid selection and high computational complexity O(n * k * I * d) for large n (number of points) [2]. C4.5 (Decision Trees): An algorithm used to generate a decision tree. While interpretable, building trees on massive datasets can lead to memory overflow and long training times.
Key Words : Data Science, Big Data, Data Mining, Distributed Computing, Apache Spark, Scalability, Machine Learning, Apriori Algorithm.
2.2 Big Data Technologies: To handle Big Data, new distributed frameworks have been developed.
1. INTRODUCTION The 21st century is characterized by data deluge. From social media interactions and IoT sensor readings to genomic sequences and financial transactions, we are generating data at an unprecedented scale. This phenomenon, termed "Big Data," is defined by its 3Vs: Volume, Velocity, and Variety. While this data holds the potential to unlock valuable insights for business, science, and society, its sheer scale and complexity render traditional data analysis tools inadequate. Data Mining, the core process of discovering patterns and knowledge from large amounts of data, is at the heart of this challenge. Classical data mining algorithms like association rule mining (Apriori), clustering (k-Means), and classification (Decision Trees) were not designed for distributed environments and often fail to process terabytes or
© 2025, IRJET
|
Impact Factor value: 8.315
processing
Hadoop MapReduce: A programming model for processing large data sets with a parallel, distributed algorithm on a cluster. It is highly scalable but suffers from high disk I/O latency as it writes intermediate results to disk [3]. Apache Spark: An open-source distributed computing system that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Its in-memory processing capability makes it significantly faster than Hadoop MapReduce for iterative algorithms, which are common in data mining [4]. 2.3 Integration of Data Mining and Big Data Platforms: Recent research has focused on adapting data mining algorithms for Big Data environments. For instance, MLlib
|
ISO 9001:2008 Certified Journal
|
Page 57