Issuu

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 11 Issue: 11 | Nov 2024

p-ISSN: 2395-0072

www.irjet.net

LLM-based Data Operators for Data Processing Deepak Raj Iti1, Vandana Jada2, Praveen Kandukuri3, Divyanjali Peraka4 1Student, Dept. of Electronics Engineering, Indian Institute of Technology (ISM), Dhanbad 2Student, Dept. of Data Science, Texas A&M University, College Station

3Student, Dept. of Metallurgical Engineering, National Institute of Technology, Tiruchirapalli 4Operations Research, UC Berkeley, Berkeley

---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - Data processing is essential in machine learning

To support the growing demand for data transformation, user-defined functions (UDFs) are commonly utilized. These functions clean, transform, and model data within a data warehouse or data lake[3]. A typical UDF template is written in a Pythonic style, allowing users to import runtime dependencies, implement processing logic, and handle input data. When a UDF is applied to a database, following a narrow transformation model in Spark [4], it processes each row of data individually, storing the transformed rows.

pipelines to ensure data quality. Many applications use userdefined functions (UDFs) for this, offering flexibility and scalability. However, the rising demands on these pipelines present three challenges: low-code limitations, dependency issues, and a lack of knowledge awareness. To tackle these, we propose a new design pattern where large language models (LLMs) serve as a generic data operator (LLM-GDO) for effective data cleansing, transformation, and modeling. In this pattern, user-defined prompts (UDPs) replace specific programming language implementations, allowing LLMs to be easily managed without runtime dependency concerns. Finetuning LLMs with domain-specific data enhances their effectiveness, making data processing more knowledge-aware. We provide examples to illustrate these benefits and discuss the challenges and opportunities LLMs bring to this design pattern.

The UDF design pattern offers three significant advantages in large-scale data processing: 1. Flexibility: Users can implement their own data processing logic, even if it isn't supported by built-in functions. 2. Modularity: UDFs provide abstraction, facilitating better understanding, debugging, and reusability of code.

Key Words: Large Language Models, Data Modeling, Data Cleansing, Data Transformations, Design Pattern

3. Scalability: UDFs can easily scale using big data processing engines like Spark.

1.INTRODUCTION

Despite these advantages, UDFs also face several challenges:

Machine learning (ML) drives a variety of data-driven applications across different use cases. A typical machine learning pipeline comprises several steps: data processing, feature engineering, model selection, model training, hyperparameter tuning, evaluation, testing, and deployment [1]. Many of these steps require high-quality data to ensure that machine learning applications perform as expected, and they often benefit from large volumes of data to support effective training[2].

1. Not low-code or zero-code: Users must possess substantial programming skills and experience to create UDFs. 2. Not dependency-free: UDFs can require complex runtime environments, making dependency management difficult during development and deployment. If UDFs have completely non-overlapping runtime dependencies, separate pipelines are needed for each. 3. Not knowledge-aware: It is challenging to incorporate prior knowledge into UDFs for data processing tasks. This knowledge is often task-specific; for instance, classifying ecommerce item categories requires extensive domain expertise to identify item attributes. Integrating such knowledge deterministically into UDFs is complicated due to the vast combinations of item attributes.

However, most reliable datasets are generated through human annotations, a process that is both expensive and time-consuming, making it difficult to scale. As machine learning models become more complex and involve increasing numbers of parameters, the demand for vast amounts of high-quality data for training continues to grow. Data processing tasks must also adapt to this increasing need for effective data cleansing, transformation, and modeling. Therefore, this work focuses primarily on the transformation process as defined in a typical ETL (Extract, Transform, Load) framework in data warehousing.

Impact Factor value: 8.315

Recently, the development of artificial intelligence (AI) has made significant progress with the emergence of Large Language Models (LLMs). Models like Llama2 and GPT-4 have demonstrated their effectiveness in addressing a wide range of downstream tasks, such as question answering and multi-step reasoning, thanks to their emergent abilities. This

ISO 9001:2008 Certified Journal

Page 255