A simulation-based approach for straggler tasks detection in Hadoop MapReduce by IRJET Journal

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 09 Issue: 08 | Aug 2022

p-ISSN: 2395-0072

www.irjet.net

A simulation-based approach for straggler tasks detection in Hadoop MapReduce Vivek M1, Annappa Swamy D R2 1Masters

in Technology Student, Dept. of Computer Science and Engineering, Mangalore Institute of Technology and Engineering, Moodbidri. 2Associate Professor, Dept. of Computer Science and Engineering, Mangalore Institute of Technology and Engineering, Moodbidri. ---------------------------------------------------------------------***--------------------------------------------------------------------improved in light of these difficulties. The processing of this Abstract - Hadoop MapReduce splits an incoming job into

enormous amount of data makes use of distributed and simultaneous processing techniques. The Map-Reduce programming model is one of the most used methods. MapReduce breaks down a job into smaller jobs that run concurrently on many computers.

smaller individual tasks that can execute in parallel on several servers. These servers can be combined to form massive clusters with heterogeneous system configurations. The resource manager is in charge of allocating these actual tasks among the cluster's nodes in order to finish the work. When all of these distinct tasks have finished their work, the job is considered to be accomplished. Straggler tasks are those that require an extended period of time to perform. These delayed tasks will have an impact on the total MapReduce Job completion time. Straggler tasks might emerge due to a variety of factors, including hardware, network, or configuration difficulties.

Hadoop's job execution is built on one fundamental idea: bringing data towards computation is more expensive than moving computation towards data. As a result, in a networkconstrained scenario, Hadoop attempts to organise map tasks to complete its work near to the input data chunks. Straggler tasks degrade the overall performance of Hadoop MapReduce processes. Straggler tasks take significantly longer to complete than other tasks in the server. Hadoop employs speculative execution to minimise work reaction time by concurrently initiating straggler tasks on other nodes. Because sending out a faulty speculative execution can waste cluster assets and extend job completion time. Therefore, in order to enhance the MapReduce job completion time, we suggest the use of a machine learning approach called the k-means clustering algorithm to efficiently classify fast and slow running tasks.

Hadoop offers speculative execution to shorten the time it takes for a job to respond by assigning straggler tasks to other nodes in the cluster to complete them. We can't anticipate a faster task completion time in the Hadoop cluster with more speculative tasks. Due to the possibility that the node already handling the straggler task will finish it before the node that just started executing the straggler task. In that situation, the freshly launched straggler will be rejected, wasting the time and resources of that specific node. Therefore, we propose a simulation-based approach that uses machine learning to identify straggler tasks based on their expected completion time. The suggested approach has the potential to considerably increase the overall performance of Hadoop MapReduce job execution time.

2. EXISTING SYSTEM In the Hadoop framework, straggler task speculation is based on a simple profiling method that compares the progress score of each task to the average of all the tasks that are currently executing in a node. Hadoop uses the progress score parameter to estimate how much work each job has completed. A progress score is used to assess their success; it runs from 0 to 1. The amount of work accomplished for the input data read determines a map task's progress score. A Reduce task progress score is comprised of three stages, each of which contributes one-third to the overall result. If a task has been running for at least one minute and its progress score is significantly below the category average score minus 0.2, it is deemed to be a straggler. The existing technique assumes that all data centres have the same hardware configuration. However, data centres are varied in nature, with various CPU kinds, memory capacity, and other factors. Furthermore, it assumes that all activities will continue at almost the same rate, even though for running

Key Words: Resource Manager, Straggler task, Speculative execution, MapReduce, Machine learning

1. INTRODUCTION Due to the growing usage of the web in everything currently, a lot of information is created and reviewed. Each user's behaviour on social media and web crawlers are tracked and analysed in order to optimise website design, fight spam and extortion, and discover new marketing opportunities. Facebook's petabyte-scale data warehouse houses the 4 petabytes of information it generates every day. This rapid expansion of information has made it more challenging to organise, analyse, interpret, utilise, and make judgement. Big Data management techniques need to be continually

Impact Factor value: 7.529

ISO 9001:2008 Certified Journal

Page 685