International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395 -0056
Volume: 04 Issue: 04 | Apr -2017
p-ISSN: 2395-0072
www.irjet.net
Read Write Ser-De for JSON data in MapReduce Abstraction B. Nandan1, K. Sai Kiran2, O. Sai Krishna3, K. Pradeep4, K. Vishnu Vardhan5 1Associate
Professor, Department of CSE, Gurunanak Institutions, Ibrahimpatnam, Hyderabad, India student, Department of CSE, Gurunanak Institutions, Ibrahimpatnam, Hyderabad, India ---------------------------------------------------------------------***--------------------------------------------------------------------2,3,4,5B.Tech
Abstract - In this Advanced Generation of Technology
Big data can be better understood by these characteristics - volume, velocity, variety, variability and veracity.
data is generated in huge amounts. Data generated is generally in Structured and Unstructured format. Preserving and Analysis of such data is very essential. There are different methods to store and analyse data but analysis of unstructured data is one of the major problems faced by many Tech Giants. Unstructured data is not organised, text heavy and has many irregularities. About 70-80% of the data generated by companies is generally Unstructured. Hence in this project the focus is on Analysis of Unstructured data. The unstructured data is first converted into semistructured data using Hive and SerDe after which analysis is performed using HIVEQL. Analysis of the data is carried out in a cluster, Serialisation and Deserialisation of data is done so that it can be easily transferred from the Master to all the Slaves in the cluster. Finally the converted SemiStructured data is analysed by the slaves.
1.2 Hadoop Hadoop has been a revolution in the field of big data analytics. Hadoop was able to analyse, clean and present large data sets in a proper manner with simple queries. Hadoop uses MapReduce algorithm to smaller units and then carried onto the slaves in the cluster which follows a Master-Slave architecture. The data is stored in a separate filesystem called HDFS (Hadoop Distributed File System) which helps with aggregating bandwidth across cluster.
1.3 Hive Apache hive is one of the data warehouses of Hadoop used for data analysis, query writing and summarisation. Hive provides an interface similar to SQL to analyse large data sets. Hive also has its own query language almost similar to SQL which is very easy to run and implement queries on. Hive provides the necessary abstraction required to integrate HIVEQL queries into low level Java API. Hive also has its own built in user defined functions to manipulate strings, date and similar mining tools.
KeyWords: Unstructured data, Semi-Structured data, HIVEQL, Cluster, Serialisation, Deserialisation
1. INTRODUCTION Data on analysis gives results which are of huge value and essential to run any organisation. Storing such huge amounts of data and analysing them is a tedious task but in recent times technology has evolved so much that processing speeds have exceeded human expectations. Currently Unstructured data holds about 70-80% in the data generation process. It is not preferred to store this data which is difficult to analyse and store as well. This project will analyse Unstructured JSON data which will bring about results which have never been expected before.
1.4 SerDe SerDe is expanded as Serialization and Deserialization. Serialization is the conversion of objects into byte streams and Deserialization is the exact opposite of that. Data is transferred the fastest when it is in a byte stream format and hence SerDe comes into application. Data objects are seldom complex and are difficult to convert to byte stream and hence Serialization will help convert these complex data objects into byte stream to transfer to all the slaves in the cluster and then deserialization is performed and the byte stream is converted back to data objects to work on.
1.1 Big data Big data is referred to large data sets which are too complex to be analysed using traditional formats of data processing. Predictive, behavioural and many other advanced data analytics methods are used to retrieve data from the large data sets and convert it into a size suitable to use. Big data as the name suggests is based on its size which is generally ranging from petabyte to yottabyte. The data is too complex to be understood and is too diverse to find a pattern. For big data to be actually used, right queries must be posed and also the data should be easily analysed and cleaned.
Š 2017, IRJET
|
Impact Factor value: 5.181
1.5 JSON data JSON also called JavaScript Object Notation is a lightweight data interchange model. It is the type of data which is easiest for humans to read and understand and also generating JSON data is quite simple. JSON data is said to be the subset for JavaScript programming language. The property that this data doesn’t depend on the language at all but uses protocols that are similar to that of the C
|
ISO 9001:2008 Certified Journal
|
Page 1624