International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395-0056
Volume: 12 Issue: 08 | Aug 2025
p-ISSN: 2395-0072
www.irjet.net
COMPARATIVE STUDY OF MASK R-CNN AND VISION TRANSFORMER FOR DAMAGE DETECTION IN RETROFITTED CONCRETE STRUCTURES Priyansh Verma¹, Prof. Abhishek Mishra², Prof. Sachin Kumar Singh³ ¹M.Tech Scholar, Department of Civil Engineering, I.E.T Lucknow, India ²Assistant Professor, Department of Civil Engineering, I.E.T Lucknow, India ³Assistant Professor, Department of Civil Engineering, I.E.T Lucknow, India ---------------------------------------------------------------------***--------------------------------------------------------------------urbanization, aging infrastructure, and the growing demand Abstract - In the context of modern civil infrastructure,
for sustainability. Despite extensive retrofitting interventions — including jacketing, grouting, FRP wrapping, and strengthening of joints — concrete structures remain vulnerable to environmental exposure and operational stresses. Over time, signs of distress such as cracks, delamination, and corrosion emerge, warranting consistent inspection and evaluation.
structural health monitoring (SHM) has become a key element in ensuring safety and durability. Retrofitted concrete structures, while rehabilitated for enhanced performance, are still prone to long-term deterioration due to factors like environmental exposure, workmanship quality, and hidden internal stress. This paper presents a detailed comparative analysis of two state-of-the-art deep learning models — Mask R-CNN and Vision Transformer (ViT) — for automated detection and classification of surface-level damage in retrofitted concrete elements.
Conventional damage detection techniques rely heavily on visual inspection and non-destructive testing (NDT) methods like ultrasonic pulse velocity, rebound hammer, and ground-penetrating radar. While effective, these techniques are often costly, time-consuming, and subjective, especially over large structures or difficult-to-access areas. To address these limitations, the integration of computer vision and machine learning has received increasing attention in civil engineering. Specifically, deep learning approaches have demonstrated their capability in automatically detecting structural anomalies from images with high accuracy and speed. Among the most popular models are Convolutional Neural Networks (CNNs), which have been successfully applied to tasks like crack detection and corrosion classification.
A dataset of 6400 images was prepared containing five practically encountered classes of structural defects: cracks, spalling, rust stains, water leakage marks, and efflorescence. Each image was preprocessed and augmented to reflect realistic site conditions such as poor lighting, noise, and irregular angles. The models were trained and evaluated using standard performance metrics including precision, recall, F1-score, mAP (for Mask R-CNN), and confusion matrix analysis. While the Mask R-CNN model exhibited superior performance in pixel-level segmentation and damage area quantification, the ViT model demonstrated faster and more accurate classification due to its global attention mechanism. The results suggest that ViT is well-suited for quick field inspections, while Mask R-CNN is preferred for detailed damage reporting and spatial analysis. This study not only reinforces the relevance of deep learning in structural engineering but also provides practical insights for deploying AI-based inspection systems in real-world retrofitting projects.
However, traditional CNNs suffer from local receptive field limitations, often failing to capture global spatial context — a critical requirement when damages are diffuse, overlapping, or visually similar (e.g., water leakage vs. efflorescence). To overcome this, two advanced models are considered:
KEYWORDS
Retrofitting, Structural Damage Detection, Mask R-CNN, Vision Transformer, Concrete Defects, Deep Learning, Crack Detection, Efflorescence, Image Segmentation, Smart Infrastructure.
This paper aims to assess these models on a practical, fielddriven dataset of 6400 annotated images, capturing common post-retrofitting defects. The goal is to identify the strengths and limitations of both architectures in order to recommend a viable model or hybrid solution for real-time deployment.
1. INTRODUCTION The rehabilitation and retrofitting of aging concrete structures have become a global necessity due to increased
© 2025, IRJET
|
Impact Factor value: 8.315
Mask R-CNN, capable of object-level instance segmentation and bounding box localization. Vision Transformer (ViT), which uses global selfattention mechanisms to classify complex damage patterns.
|
ISO 9001:2008 Certified Journal
|
Page 163