Issuu

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 12 Issue: 11 | Nov 2025 2025

p-ISSN: 2395-0072

www.irjet.net

Fault-Tolerant Distributed Training System for Multi-Modal Medical Disease Prediction Jatin Shihora1, Yukti Bandi2 1Undergraduate Student, Dwarkadas J. Sanghvi College of Engineering, Mumbai, India 2Professor, Dwarkadas J. Sanghvi College of Engineering, Mumbai, India

--------------------------------------------------------------------***---------------------------------------------------------------------

Abstract: Training machine learning models on largescale multimodal medical datasets introduces significant distributed system challenges, including heterogeneous data formats, missing modalities, and failures during multi-day training runs. This paper presents a faulttolerant distributed training pipeline for disease prediction on 53,420 patient records combining DICOM images, laboratory tests, and physiological time-series data. A modality-aware preprocessing module maintains data integrity without introducing bias in incomplete patient records. The main contribution is a lineage-based checkpointing mechanism that coordinates failure recovery across distributed training tasks by tracking dependencies and incrementally persisting model states. This allows training to resume from consistent checkpoints instead of full restarts. Experiments on a 16-node GPU cluster show a 66% reduction in training time compared to epoch-level checkpointing, completing in 18.2 hours with automatic recovery from three worker failures while wasting less than 5% of computation. The system achieves 87.3% overall accuracy and 82.1% macro-averaged F1 score across 37 disease categories, demonstrating robustness even when data modalities are partially missing. Keywords: Distributed Machine Learning, Fault Tolerance, Checkpointing, Multi-Modal Learning, Medical Diagnosis, Healthcare Systems

Introduction Modern healthcare institutions generate massive volumes of heterogeneous patient data including radiological images in DICOM format, structured laboratory results, and continuous physiological monitoring streams. Leveraging this data for automated disease prediction requires training complex machine learning models capable of integrating information across disparate modalities [1]. However, real-world medical datasets present fundamental distributed systems challenges that impede practical deployment. Training on large medical datasets requires distributed computing due to both data volume (2.4TB in our study) and model complexity (31.2M parameters). Unlike idealized research datasets, clinical data exhibits several characteristics that complicate distributed training:

Impact Factor value: 8.315

Heterogeneous data formats: DICOM images require different preprocessing than tabular laboratory values; time-series vital signs need temporal feature extraction 2. Missing modality scenarios: Not every patient has all data types, some have X-rays but incomplete lab work; others have vital signs but no imaging (38% missing imaging, 22% partial labs) 3. Class imbalance: Rare diseases appear in fewer than 5% of records, making balanced training difficult 4. Distributed training failures: Multi-day training runs on compute clusters fail due to transient hardware issues, network partitions, or out-of-memory errors on individual workers Traditional approaches to distributed training assume either complete data availability or rely on naive checkpointing strategies that waste significant computation on failure recovery [2]. The economic impact is substantial: in our preliminary experiments, three complete training restarts due to failures wasted over $27,000 in cloud GPU costs before we implemented fault-tolerant mechanisms. This paper makes the following contributions: 1. A modality-aware data processing pipeline that handles missing patient data through learned embeddings rather than imputation or record deletion, preserving 40% more training data than standard approaches 2. A lineage-based checkpointing system that tracks computational dependencies between distributed tasks and enables fine-grained recovery from failures with less than 5% wasted computation 3. A coordinated recovery protocol that allows training to continue automatically after worker failures without human intervention 4. Experimental validation on 53,420 patient records across a 16-node GPU cluster demonstrating 66% training time reduction compared to epoch-level checkpointing and 98.7% cost savings compared to naive training without fault tolerance The remainder of this paper is organized as follows: Section II reviews related work in distributed ML systems and medical diagnosis. Section III presents research gap, section IV presents methodology including data partitioning, model design, and fault-tolerant training protocols. Section IV provides experimental results

ISO 9001:2008 Certified Journal

| Page 106