International Research Journal of Engineering and Technology (IRJET) Volume: 12 Issue: 11 | Nov 2025 www.irjet.net
e-ISSN: 2395-0056 p-ISSN: 2395-0072
LLM-Assisted Swin Transformer Framework for Enhanced Adenocarcinoma Lung Cancer Classification and Interpretability in Histopathology 1Soomro Sarwan, School of Software, Northwestern Polytechnical University, Xi’an, China 2Gul Sheeraz, School of Computer Science, Northwestern Polytechnical University, Xi’an, China
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - We developed a new framework that combines a Swin Transformer for image analysis with a LLMa2 to achieve high classification accuracy and provide textual explanations for its predictions. Our model classifies lung adenocarcinoma subtypes with 98.69% accuracy and a near-perfect AUC of 0.9997. It performs consistently well across all five cancer subtypes, and demonstrated robustness to class imbalance. We also found that using 20x magnification provides an optimal balance of diagnostic power and computational efficiency. Furthermore, the integrated LLM acts as an intelligent assistant, generating textual explanations of the AI's decisions by listing salient morphological features, and flagging low-confidence predictions for pathologist review. This combined approach gives clinicians a highly accurate and interpretable tool for histopathology. KeyWords: Large Language Model , Swin Transformer, Lung Cancer Diagnosis, Histopathology, Resolution Selection, Multi-Resolution Analysis
1.INTRODUCTION In digital pathology, Transformers are particularly wellsuited for analyzing the entire context of a histopathological image at once, modeling dependencies between distant nsformers directly solve, allowing them to model the morphological patterns characteristic of different cancer subtypes. Large language models (LLMs) have the potential to automatically extract clinical information, aid in diagnosis and treatment, and support full-cycle lung cancer care, according to a systematic review of 28 studies. However, bias control and data security limitations still exist [1]. Moreover, LLMs and vision-language models when combined, provide strong multimodal AI capabilities for diagnosis, prognosis, and image analysis in the treatment of lung cancer; however, ethical, legal, and validation issues restrict their clinical application[2]. However, LLMs are not currently approved for this sensitive task due to government regulations and patient privacy laws that differ across countries, hospitals, and demographics. Transformer-based analysis can use LLMs' predictive ability to generate descriptive text based on the image analysis. These models show promising venue for research in a controlled and secure manner without violating ethical, legal, or regulatory constraints, even though LLMs
© 2025, IRJET |
Impact Factor value: 8.315
are currently not permitted as medical devices and cannot directly affect clinical care[3]. Medical image analysis relies heavily on resolution; deep learning models use patch-based processing, multi-resolution inputs, and super-resolution techniques to improve feature extraction, classification accuracy, and diagnostic reliability in ultrasound imaging and histopathology. Likewise, multiresolution multiple-instance learning techniques in wholeslide histopathology utilize slide-level supervision to pinpoint diagnostically relevant areas, eliminating the need for pixel-level annotations and thereby improving grading accuracy and clinical reliability[4 5 6]. Therefore, we propose a framework that uses a Swin Transformer for histopathological classification and integrates an LLM in a post-hoc manner to enhance the interpretability and clinical utility of the predictions. While pathologists naturally choose the best magnifications, modern systems obtain multi-resolution whole slide images that require significant resources.
1.1 Related Work Transformer architectures now provide a method for diagnosing lung cancer by using self-attention mechanisms to model global histological patterns across entire images [7]. Likewise, these models can detect subtle long-range dependencies that are potentially useful for cancer detection without the need for explicit segmentation. Talib et al. [8] proposed a framework that integrates transformers and CNNs. [8], that combines a CNN for tissue type classification with a TransSegNet for lesion segmentation via a vision transformer. Similarly, Srinivas et al. [9] introduced BoTNet, a hybrid architecture replacing spatial convolutions in the final three bottleneck blocks of ResNet with multi-head self-attention (MHSA). However, such a technique lacks LLM-aware assistance, and creates Bottleneck Transformer (BoT) blocks that preserve residual structure. Similarly. Chen et al. [10] proposed Visformer, a hybrid architecture that systematically transitions from a Transformer (DeiT) to a CNN (ResNet). It integrates convolutional operations such as stage-wise downsampling, Batch Normalization, and 3×3 local convolutions in early layers, retaining self-attention in later stages. Wang et al. [11] also proposed a hybrid CNN– Transformer (HCT) model for NSCLC N-staging and survival prediction from CT scans. The model integrates a 3D ResNet for local feature extraction and a Transformer
| ISO 9001:2008 Certified Journal |
Page 83