Log Analysis Engine with Integration of Hadoop and Spark

Page 1

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395 -0056

Volume: 04 Issue: 03 | Mar -2017

p-ISSN: 2395-0072

www.irjet.net

Log Analysis Engine with Integration of Hadoop and Spark Abhiruchi Shinde1, Neha Vautre2, Prajakta Yadav3 , Sapna Kumari4 1Abhiruchi

Shinde, , Dept of Computer Engineering, SITS, Maharashtra, India Vautre, , Dept of Computer Engineering, SITS, Maharashtra, India 3Prajakta Yadav, Dept of Computer Engineering, SITS, Maharashtra, India 4Sapna Kumari , Dept of Computer Engineering, SITS, Maharashtra, India Prof. Geeta Navale Dept of Computer Engineering, SITS, Maharashtra, India 2Neha

---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - Log file or logs in computing are the files for

keeping record of the events that occur in the operating system or communication between the users or operating systems. Log files contains large amount of valuable information about the system operation status, usage, user behavior analysis etc. Due to extensive use of digital appliances in today’s modern era log file analysis has become a necessary task to track system operation or user behavior and acquire important knowledge based on it. These kinds of files are generated at stupendous rate and to analyze them is tedious task and a burden to corporations and various organizations. In order to analyze large dataset, and to store it efficiently, economically and effectively we need to have an effective solution which needs not only the massive and stable data processing ability but also the adaptation to a variety of scenarios under the requirement of efficiency. Such capabilities can’t be achieved from standalone analysis tools or even single cloud computing framework. The main objective of the proposed system is to design an application for log analysis and applying the data mining algorithm to get the results which will be useful for system administrator to take proper decisions. The combination of Hadoop, Spark and the data warehouse and analysis tools of Hive and Shark makes it possible to provide a unified platform with batch analysis and in-memory computing capacity in order to process log in a high available, stable and efficient way. Statistics based on customer feedback data from the system will help in greater expansion of business and a company that will have such data to its disposal, and ready to use.

Key Words: Log, Weblog, Hadoop, Spark, Log analysis.

1. INTRODUCTION

client, formation of various log analysis tools or platforms can be targeted and brought in to use as per the compatibility of the system. There have been some free powerful log analysis tools like Webalizer , Awstats , and Google Analytics. But they are either standalone or have the limitation of data scale. With the rapid development of Internet technology, the scale of log data is sharply increasing. How to deal with large-scale data becomes a new challenge. However, the emergence of cloud computing with batch process capacity provides a solution to solve this kind of problem. Hadoop is a popular open source distributed computing framework, providing a distributed file system named HDFS and distributed computing framework which is called Map/Reduce. Hive is an analytical tool of data warehouse based on Hadoop, which converts SQL statement to Map/Reduce job to execute. Hadoop and Hive mainly deal with the processing of large data and data storage. The systems designed just based on Hadoop Map/Reduce or even combination of Hive have solved the large scale data processing and storage problems but are not suitable for a class of applications like interactive query and iterative algorithm which is common in analysis system. Spark is designed to speed up data analysis operation. Spark is suitable for the treatment of iterative algorithm (such as machine learning, graph mining algorithm) and interactive data mining. It also provides a data analysis tool Shark, which is compatible with Hive. It provides another choice for large-scale data processing to us. In this paper, we propose a system for log analysis which will overcome the standalone system issues and data scale problems of the previous versions of log analysis tools. In this paper, we will have a generalized view on web server logs, the role of Hadoop and Hive including Spark and Shark in the proposed system, the flow of log analysis strategy for single node as well as multi cluster over a distributed environment

There are various types of logs as database logs, binary logs, etc. Based on the type and the requirement of the Š 2017, IRJET

|

Impact Factor value: 5.181

|

ISO 9001:2008 Certified Journal

|

Page 1671


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.