Issuu

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 12 Issue: 05 | May 2025

p-ISSN: 2395-0072

www.irjet.net

Unified Log Management: Kafka Connect and Data Lakes for Advanced System Analysis and Machine Learning Kuriens Shaji Maliekal1 1Staff Software Engineer, Walmart Global Tech, USA

---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - The exponential growth in modern systems has

performance and user interaction. They are invaluable for diagnosing issues, auditing activities, and powering advanced analytics such as predictive modelling in Machine Learning (ML). However, the sheer volume and velocity of logs generated by distributed systems and micro services present significant challenges in aggregation, storage and analysis. Traditionally, log management depended on centralized logging system that gathered logs from multiple sources and saved in relational database or file systems. Although effective for smaller scale applications, these methods faced challenges with the requirements of contemporary distributed systems. Relational databases frequently turned into bottlenecks because of their restricted scalability and lack of capacity for large-scale realtime data ingestion. File-based systems were not advanced enough for structured queries and complex analytics. Additionally, conventional systems were poorly suited to process unstructured or semi-structured log data produced by contemporary applications. These restrictions hindered immediate insight into system performance and postponed essential decision-making processes [1].

significantly transformed how organizations monitor, optimize, and understand their operations. Logs, which record critical facts concerning system performance, user interaction, and infrastructure behavior, play a critical role in diagnosing problems, auditing activities, and facilitating advanced analyses such as predictive modeling for machine learning. Nevertheless, with the rising velocity and volume of logs coming from distributed systems and micro services, conventional log management techniques struggle to aggregate, store, and analyze logs. Previously, centralized logging systems based on relational databases or file storage were employed to collect logs. Although these systems served for small-scale applications, they turned into bottlenecks against the demands of modern distributed systems. Relational databases have scalability challenges, whereas file-based systems don't support complicated analytics or structured queries. Such restrictions hinder real-time insights and consequently influence decision-making processes within time. Integrating Apache Kafka with data lakes has proven to solve such constraints by providing real-time log streaming and affordable and scalable storage. Yet, the problems of data quality, governance, and consistency still exist.

The trend of late has been to increasingly focus on smarter, more scalable, more efficient possibilities for managing logs. One of the emerging fundamental components is Apache Kafka, which has become the cornerstone standard for realtime data streaming and log aggregation. With its distributed architecture, Kafka can handle the massive amounts of logs generated by today's applications, as it can process logs quickly at high throughput, while providing reliability, data integrity and scalable solutions for more processing workers [2]. Kafka can allow users to have logs streamed to downstream applications for processing or storage with real-time ingestion from various sources or application pods in Kubernetes clusters which have generated logs. While Kafka is optimized for real-time ingestion and streaming, it was not designed for storage or complex analytics of the ingested logs. These constraints can be addressed by the idea of combining Kafka with data lakes [3].

This study proposed a unified log management framework leveraging Kafka connect to stream logs into data lakes while ensuring data quality and governances. By implementing automated de duplication and schema evolution within ETL pipelines, this approach addresses key challenges and enables both real time processing and long-term analytics. It unified real time monitoring with predictive maintenance and anomaly detection, optimizing resource utilization in cloud native environment. Future advancement in schema management, data validations and machine learning (ML) integration helps streamline real-time insights and predictive capabilities for improved decision making and operational efficiency. Key Words: Log Management, Kafka, Kafka Connect, Data Lake, ML, Machine Learning, Distributed Systems

Data lakes allow for a cost-efficient and scalable store where raw logs can be kept in their original form. Amazon S3, Google Cloud Storage (GCS), and Hadoop based environments provide virtually unbounded capacity at comparatively affordable prices. Using Kafka Connect, a Kafka framework for external system integration, logs can be natively streamed from Kafka topics to data lakes. This not only centralizes storage but also separates ingestion from

1. INTRODUCTION The exponential growth of data in modern systems has transformed the way organizations understand, monitor, and optimize their operations. Logs, as a critical subset of this data, serve as the pulse of application and infrastructure, capturing important details about system behavior,

Impact Factor value: 8.315

ISO 9001:2008 Certified Journal

Page 723