An Overview of Data Lake

Page 1

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 09 Issue: 07 | July 2022

p-ISSN: 2395-0072

www.irjet.net

An Overview of Data Lake Pragati Kumai1, Smitha G R2 1,2 R.V

College of Engineering, Bengaluru, India ---------------------------------------------------------------------***--------------------------------------------------------------------A data warehouse is a database designed for the analysis of Abstract - Data Lake is one of the contentious concepts that

relational data from corporate applications and transactional systems. In order to optimise for quick SQL queries, the data structure and schema are set in advance. The results are often utilised for operational reporting and analysis. In order for data to serve as thei"single source of truth" that users can rely on, information is cleansed, enriched, and transformed.

emerged during the Big Data era. The idea for Data Lake came from the business world rather than the academic world. As Data Lake is a newly conceived concept with revolutionary concepts, its adoption presents numerous challenges. The potential to change the data landscape, on the other hand, makes Data Lake research worthwhile and the data lake is a highly flexible repository that can store both structured and unstructured data and employs a schema-on-read strategy. It is an effective approach for today's problem on Big Data Storage. However, it has a few defects, such as inadequate security andiauthenticationimechanisms. Apache Hadoop is usually recognized as a data lake industry standard. Its parallel processing mechanisms enable rapid processing of huge amounts of data. Many businesses have attempted to create wrappers for Hadoop in order to address issues about its raw state and poor data security. This includes platforms such as Amazon Web Services (AWS) Data Lake and Azure Data Lake. AWS Data Lakes offer simple solution with safeguards to prevent data loss, whereas Azure Data Lakes offer greater adaptability and organisation-level security.

2. DATA LAKE CONCEPTS The fundamental goal of a data lake is to ingest unprocessed data and process it later. Therefore, data lakes keep all the information and have a good flexibility. However, data lakes, which include a large number of datasets lacking precise models or descriptions, are prone to quickly becoming invisible, impenetrable, and inaccessible. establishing a metadata controlisystem for DL is therefore required. In fact, many articles have underlined the value of metadata [10]. Big data can also refer to technological advancements in data processing and storage that enable handling of exponential growth in data volume in any type of format [3]. Thei3-V model [7], which consists of three dimensions of issues in data growth: volume, velocity, and variety, is the foundation for another widely accepted definition of big data. The increasing amount of data is referred to as volume. Theiterms "velocity" and "accessibility" refer to the rates at which new data are generated and made available for further analysis. The range of various data types and sources is characterised by variety.Ref [9] proposed more Data Lake specifications, particularly from the perspective of the business domain rather than the scientific community.

Key Words: Big Data, Data Warehouse, OLAP, OLTP, Data Lake, Apache Hadoop

1. INTRODUCTION A data lake offers businesses a scalable and safe platform that enables them to: ingest any data from any system at any speed,iwhether it originates from on-premises, cloud, or edge computing systems; store any type or volume of data in full fidelity; process data in real time or batch mode; and analyse data using SQL, Python, R, or any other language, third-party data, or analytics applications. Businesses that successfully get business value from their data will perform better than their competitors. According to an Aberdeenistudy, businesses who used data lakes outperformed comparable businesses in terms of organic revenue growth by 9%. These leaders were able to use fresh data from sources including log files, click-stream data, social media, and internet-connected devices housed in the data lake to perform new forms of analytics like machine learning. This made it easier for them to recognise and take advantage of business growth prospects by bringing in and keeping clients, increasing productivity, maintaining equipment proactively, and making wise judgments. A typical organisation will need both a data warehouse and a data lake, depending on the requirements,ias they each fulfil different needs and use cases.

© 2022, IRJET

|

Impact Factor value: 7.529

The data is loadedifrom the source systems 

No data is rejected.

At the leaf level, data is stored in an untransformed or in an untransformed state.

A DL can be accessed by multiple people and ingests and stores a variety of data kinds. Numerous problems with accessing, searching, and analysing data may arise if properimanagement techniques are not in place [2]. Governance for data lakes is necessary in this situation. To uncover some partial solutions in the cutting-edge work. [6] suggests a "Just-enough Governance" for DLs with policies for data quality, data on-boarding, metadata management, andicompliance & audit. According to [4], data governance must guarantee data availability and quality throughout the

|

ISO 9001:2008 Certified Journal

|

Page 2241


Turn static files into dynamic content formats.

Create a flipbook