An Improved De-Duplication Technique for Small Files in Hadoop

Page 1

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 04 Issue: 07 | July -2017

p-ISSN: 2395-0072

www.irjet.net

An Improved De-Duplication Technique for Small Files in Hadoop Ishita Vaidya1, Prof. Rajender Nath2 1M.Tech

Student , Department of Computer Science and Applications , Kurukshetra University, Kurukshetra Professor, Department of Computer Science and Applications , Kurukshetra University, Kurukshetra -----------------------------------------------------------------***-------------------------------------------------------------------2

Abstract: HDFS works as one of the core component of

the small file problem in HDFS can be explained in the following manner.

the Hadoop ecosystem, as it stores the large data sets in a master slave architecture using the commodity hardware. To store files in HDFS, many de-duplication techniques are given but the existing techniques do not merge the unique small files together to improve the Name node storage efficiency. To address the problem of small files an improved de-duplication technique is proposed in this paper that eliminates the files redundancy by using hash index, it allows to merge the unique small files only. The proposed technique is experimentally found better than the original HDFS technique in writing time and the overall storage efficiency.

The meta data of the files are stored in the name node and when large number of small files are stored in the data node, the meta data of each small file takes a huge amount of space in the name node, which decreases the storage efficiency of the name node. (ii) The map file needs a lot of seeks and hops to map the large number of small files instead of mapping small number of large files which actually concludes for a heavy traffic in the Hadoop distributed file system.

(i)

De-Duplication: Data De-duplication is a productive

Keywords: De-Duplication, Hadoop, Hadoop distributed file system, Small File Problem, Small File Storage.

approach to avoid the redundant data in the big data technologies and thus reduce the network traffic by avoiding duplicate data over transmissions. The deduplication strategy is used by using different hash functions such as MD5, SHA1, SHA256, SHA512, RipeMD 160, Tiger128, Tiger160, Whirlpool etc.

I. INTRODUCTION Hadoop, an open source software framework is an important tool to manage and store big data using the commodity hardware. The large companies like Facebook, Netflix, Yahoo and Amazon uses Hadoop to manage the unstructured large data sets [1]. Hadoop uses two main components, the Map reduce framework and the Hadoop distributed file system as the basic tool for the big data analytics. The map reduce framework use the mappers and reducers to organize and process the data in the multiple computing nodes. The mapreduce and the HDFS work on the same data nodes. The distributed file system of Hadoop works as the database of the large files and stores the data in the data nodes with the metadata in the name node. HDFS works on the commodity hardware and thus to improve the fault tolerance of the system, the data is stored with 3 replicas on the same name node and another name node. This makes the HDFS as Fault Tolerant, scalable and low cost system to store the large data sets.

The workflow of the de-duplication process depends upon the type of data as the data is divided into chunks or the files are directly used to create the hash values and check for the duplicate data. The de- duplication phenomena had been used with Hadoop Distributed File System earlier to reduce the duplicate entries in the data nodes which can be described through some literature survey in the section followed. In this paper the combination of the file merging and the de-duplication technique is used to increase the efficiency of the Hadoop distributed file system. The basic approach is to remove the duplicate entries of the same files and index the map of the duplicate files to the duplicate index of the proposed system. A file merging technique is also used to merge the small files together on the basis of the incoming files and then a merged file index is added using the start position and the end position of the small files to access the small files from the merged file.

Small File Problem in HDFS : HDFS file system is

designed to work on large datasets. Small files on the other hand imposes a heavy burden on the Hadoop distributed file system. The files less than the default block size are considered as the small files in HDFS [1]. The small files do not decrease the storage efficiency of the data node which can be explained as if the size of small file is 4MB and the data block size is 64MB , the 4MB file takes only the required space in the data block and the rest is free for the further files to be stored. So

Š 2017, IRJET

|

Impact Factor value: 5.181

The rest of the paper is organized as: Section 2 discusses the work that has been carried out in small files and deduplication, section 3 describes the proposed technique with its architecture, section 4 describes the experiments and results carried out through the technique and finally section 5 concludes the paper with the future scope of this method.

|

ISO 9001:2008 Certified Journal

|

Page 2040


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.