Issuu

International Research Journal of Engineering and Technology (IRJET) Volume: 10 Issue: 05 | May 2023

www.irjet.net

e-ISSN: 2395-0056 p-ISSN: 2395-0072

Big Data Testing Using Hadoop Platform Tushar Kumar Sharma1, Chirag Jindal2, Akhil Saini3, Satyam Gupta4 Computer Science and Engineering, Chandigarh University, Mohali India -------------------------------------------------------------------------***-----------------------------------------------------------------------Yarn, Pig, MRjob, Zookeeper, Hive, Apache Spark, Hive, Abstract — Big data analysis has emerged as a crucial HBase

technology in recent years due to the exponential growth of data generated from various sources. This data can come in structured, unstructured, or semi-structured formats and is generated from diverse channels such as social media platforms, smart city sensors, ecommerce websites, and numerous applications. These vast amounts of data encompass a wide range of formats, including text, images, audios, and videos. Hadoop provides a comprehensive ecosystem of tools and frameworks that enable efficient storage and processing of big data. One of the key components of Hadoop is the Hadoop Distributed File System (HDFS), which is designed to store and manage data across a cluster of commodity hardware. HDFS breaks down large files into smaller blocks and replicates them across multiple nodes, ensuring data reliability and availability. MapReduce allows for parallel computation of data across a cluster of machines, making it suitable for processing large-scale datasets. By breaking down complex tasks into smaller subtasks and distributing them across multiple nodes, MapReduce enables faster and more efficient data processing.As big data analysis continues to evolve, Hadoop has expanded its ecosystem with various technologies to enhance its capabilities. These include YARN (Yet Another Resource Negotiator), which serves as the cluster resource management framework, enabling efficient allocation of computing resources for different applications. Pig, an abstraction layer on top of Hadoop, provides a high-level language called Pig Latin for expressing data analysis tasks. MRjob is a Python framework that simplifies the development of MapReduce jobs. Zookeeper is a centralized service for maintaining configuration information, synchronization, and naming services. Hive offers a data warehouse infrastructure and a query language called HiveQL for querying and analyzing data stored in Hadoop. Apache Spark, a fast and generalpurpose data processing engine, is integrated with Hadoop to provide faster in-memory computation capabilitiesIn this paper, our focus is on exploring big data analysis and demonstrating how Hadoop, along with its associated technologies, can be used for analyzing, storing, and processing large volumes of data. By leveraging the power of Hadoop's distributed architecture and the complementary tools within its ecosystem, organizations can effectively harness the potential of big data and derive valuable insights for various applications and industries.

Recently, information technology systems have been playing a major role in handling and giving insights to organization’s business. This information can come from education, traffic, healthcare, or commerce sectors. This data can be structured or semi-structured and can be in the form of text, images, audio, video, and log files. The amount of data to be stored and processed is often in terabytes.[15] It becomes extremely hard for a single system to manage such big data. 3 V’s, on the basis of which Big Data is defined includes– volume, velocity and variety. Volume – The size of data from organizations in social media, healthcare, education, and business section comes is ever increasing. Big companies like Google and Facebook process information in petabytes. The IOT (sensors data) is also increasing day by day. So, it becomes difficult to manage such big data using traditions systems. Variety – data in today’s world is divided into structured (schema, columns), semi-structured (json, emails, xml) and unstructured (images, videos, audio) categories. This data can also be raw and requires heavy system work to convert it into useful information using traditional analytical systems. Velocity – this concept defines the speed at which the data arrives from source destination and the speed with which it is processed. The size of the incoming data is huge andit needs to be processed at a similar speed. Companies like Google, Facebook process petabytes of data on a daily basis.[23] There is certain software that come in use to operate and manage this big data. This paper focuses one such software called Hadoop and how it is used in big data testing. The paper first describes what is big data and then moves to technologies like Hadoop and its components –HDFS and MapReduce. There are several budding technologies which come into play like YARN, Pig, Spark, Zookeeper, Hive, Apache Spark and HBase. Yarn helps in enabling the processing and running of batch, stream, interactive, and graph data stored in HDFS.[27] Hive and Pig are two integral

Impact Factor value: 8.226

INTRODUCTION

ISO 9001:2008 Certified Journal

Page 1520