Issuu

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 12 Issue: 12 | Dec 2025

p-ISSN: 2395-0072

www.irjet.net

Cross-Cloud ML Pipeline Optimization for Big Data and LLM Workloads Aswathnarayan Muthukrishnan Kirubakaran1, Akshay Deshpande2, Siva Kumar Chintham3 , Adithya Parthasarathy4, Ram Sekhar Bodala5, Nitin Saksena6 1IEEE Senior Member, USA

2,3,4 Independent Researcher, USA 5Amtrak, USA

6Albertsons, USA

---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - Cloud providers now offer specialized

Emerging industrial practice indicates that organizations increasingly distribute workloads across cloud vendors to benefit from specialized hardware availability, regional coverage, data management flexibility and pricing variation. AWS offers mature big data services such as S3 and EMR [2]. Azure provides high performance H100 GPU networks with strong inference characteristics [3]. GCP offers TPU infrastructure tailored for deep learning training workloads. A cross-cloud design can therefore reduce bottlenecks, mitigate region capacity constraints and improve economic efficiency.

accelerator hardware and scalable analytics frameworks, yet the behavior of end-to-end machine learning pipelines across heterogeneous environments remains poorly understood. This paper presents a quantitative multi-cloud study evaluating Spark ETL processing, distributed transformer training, and large language model inference across Amazon Web Services, Microsoft Azure, and Google Cloud Platform. Under controlled dataset scale, model configuration, and GPU/TPU resource equivalence, our cross-provider routing strategy reduced full pipeline operating cost by 38.7% and improved average accelerator utilization by 34.2% without accuracy degradation. We introduce a reproducible orchestration framework, report detailed behavioral differences between A100, H100, and TPU execution paths, and provide evidence-driven guidance for large-scale deployment decisions. Results show that AWS storage locality delivers superior ETL throughput, Azure H100 nodes offer leading model training and inference efficiency, and TPU workloads require additional adaptation overhead. This work establishes the first end-toend cross-cloud empirical benchmark for large model workloads and demonstrates conditions where multiprovider pipeline optimization offers measurable value.

Despite these developments, scholarly publications continue to focus primarily on intra-cloud optimization techniques [4]. Recent studies on federated and multimodal learning across distributed devices further highlight the growing need for coordinated cross-platform execution, yet these efforts primarily focus on model-level collaboration rather than end-to-end pipeline orchestration across heterogeneous cloud environments [5]. Existing literature emphasizes algorithmic speed-ups, single-vendor hardware comparisons or infrastructure cost modeling without assessing the end-to-end performance behavior of multi-provider ML pipelines under realistic deployment conditions [6]. As a result, there is limited scientific evidence to support architectural decision making for multi-cloud model training and inference.

Key Words: Multi-cloud computing, Distributed ML pipelines, LLM inference, Big Data, Spark ETL optimization, Cross-cloud orchestration.

This paper addresses this gap by conducting a controlled empirical study across AWS, Azure and GCP using identical datasets, models, and cluster configuration and performance metrics. The research quantifies the benefits and limitations of routing ETL workloads to AWS, training tasks to Azure and inference workloads to Azure, guided by observed cost and efficiency differences.

1. INTRODUCTION Contemporary machine learning and large language model systems operate at a scale that exceeds the capabilities of monolithic infrastructure. Modern pipelines incorporate sequential stages such as raw data extraction, feature engineering, training, validation, deployment and inference. Each of these stages utilizes different hardware and processing patterns. GPU-based training requires high memory bandwidth, distributed ETL operations rely on storage locality and inference depends on accelerator speed and latency [1]. Consequently, single vendor environments cannot always provide optimal conditions across all stages.

Impact Factor value: 8.315

This research makes the following key contributions:

First empirical end-to-end cross-cloud ML pipeline benchmark: We evaluate Spark ETL, distributed model training, and LLM inference across AWS, Azure, and GCP within a unified experimental framework.

Quantitative multi-provider cost and runtime analysis: We demonstrate that workload stage

ISO 9001:2008 Certified Journal

Page 1256