Research Paper
Computer Science
E-ISSN No : 2454-9916 | Volume : 3 | Issue : 10 | Oct 2017
CONCEPTUAL STUDIES OF BIG DATA ANALYSIS
Nagendra Kumar Sahu Research Scholar, Mats University, Raipur (C.G.) ABSTRACT Big Data Analytics may be the key to fighting cyber crime. Using big data to combat cyber crime is becoming a decisive strategy for businesses willing to stay secure. With security risks becoming larger, from structured and unstructured data inside the network servers to smart phones, businesses need to be extremely alert due to tremendous increase in cyber threats. Several organizations are leveraging big data analytics for supporting their business processes. However, there are only few organizations that have realized the potential benefits of analytics towards ensuring information security. INTRODUCTION: Big data is a term that is used to describe data that is high volume, high velocity, and/or high variety; requires new technologies and techniques to capture, store, and analyze it; and is used to enhance decision making, provide insight and discovery, and support and optimize processes. Big data analytics is the process of examining large and varied data sets -- i.e., big data -- to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful information that can help organizations make more-informed business decision Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy .While big data is the darling of some Science, Technology, Engineering and Mathematics programs, most mainstream high school students have no idea what it is or how it’s transforming the business world – and your opportunities for employment. While the industry has only existed for a decade, big data is everywhere
built and used by a diverse group of developers, users and contributors cutting across nationalities under the auspices of the Apache Foundation. Hadoop is currently governed under the Apache License 2.0. Hadoop operates on thousands of nodes that involve huge amounts of data and hence during such a scenario the failure of a node is a high probability. So the Hadoop platform is resilient in the sense that The Hadoop distributed file systems immediately upon sensing of a node failure divert the data among other nodes thus allowing the whole platform to operate without any interruptions. The MapReduce framework is based on the fact that most of the information processing tasks consider a similar structure, i.e. the same computation is applied over a large number of records; then, intermediate results are aggregated in some way. As it was previously described, the programmer must specify the Map and Reduce functions within a job. Then, the job usually divides the input dataset into independent subsets that are processed in parallel by the Map tasks. Map Reduce sorts the different outputs of the Map tasks which become the inputs that will be processed by the Reduce task. The main components of this programming model, that were previously illustrated in Fig. 1, are the following ones: Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. METHODOLOGIES: Confidentiality preserving data mining techniques
Fig. 1: Big Data Characteristics TECHNOLOGIES: 2.1 Introduction to MapReduce The best known example of Big Data execution environment is probably Google MapReduce (Dean, J. and Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107-113) (the Google's implementation of the MapReduce programming model) and Hadoop, its open source version (Lam, C. (2011). Hadoop in action. Manning, 1st edition). This environment aims at providing elasticity by allowing the adjustment of resources according to the application, handling errors transparently and ensuring the scalability of the system. this programming model is built upon two "simple" abstract functions named Map and Reduce, which are inherited from the classical functional programming paradigms. Users specify the computation in terms of a map (that specify the per-record computation) and a reduce (that specify result aggregation) functions, which meet a few simple requirements. For example, in order to support these, MapReduce requires that the operations performed at the reduce task to be both "associative" and "commutative." 2.2 Hadoop Ecosystem Apache Hadoop is a Big Data framework that is part of the Apache Software Foundation. Hadoop is an open source software project that is extensively used by some of the biggest organizations in the world for distributed storage and processing of data on a level that is just enormous in terms of volume. That’s the reason the Apache Hadoop runs its processing on large computer clusters built on commodity hardware. Some of the features of the Hadoop platform are that it can be efficiently used for data storage, processing, access, analysis, governance, security, operations and deployment. Hadoop is a top level project that is being
Text Data Mining Textual data represents rich information, but lacks structure and requires specialist techniques to be mined and linked properly as well as to reason with and make useful correlations. A set of techniques will be developed for extracting entities, relations between them, opinions and other elements for use to support semantic indexing and visualization and anonymisation. [Dr Udo Kruschwitz, Professor Massimo Poesio, Professor Maria Fasli, Dr Beatriz de la Iglesia] 1.
Machine learning and transactional data: Investigate machine learning and other methods for identifying stylised facts, seasonal, spatial or other relations, patterns of behavior at the level of the individual, group, or region from transactional data from business, local government or other organizations. Such methods can provide essential decision support information to organizations in planning services based on predicted trends, spikes or troughs in demand. [Professor Maria Fasli, Dr Beatriz de la Iglesia]
2.
Developing methods to evaluate, target and monitor the provision of care: Models and statistical methods for the analysis of local government health and social care data will be developed alongside new data mining and machine learning algorithms to identify intervention subgroups, and new joint modelling methods to improve existing predictive models with a view to evaluate, target and monitor the provision of care. [Professor Abdel Salhi, Professor Berthold Lausen, Professor Elena Kulinskaya]
Data quality grading and assurance This research will develop new and adapt existing methodologies for merging data from multiple sources. It will also develop robust techniques for data quality grading and assurance providing automated data quality and cleaning procedures for use by researchers. [Beatriz de la Iglesia] 3.
Identifying "unusual" data segments: Methods will be developed to automatically identify "unusual" data segments through an ICMetrics-based technique. Such methods will be able to alert researchers of specific data seg-
Copyright© 2017, IERJ. This open-access article is published under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License which permits Share (copy and redistribute the material in any medium or format) and Adapt (remix, transform, and build upon the material) under the Attribution-NonCommercial terms.
International Education & Research Journal [IERJ]
3