International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395-0056
Volume: 12 Issue: 10 | Oct 2025
p-ISSN: 2395-0072
www.irjet.net
A Survey on Sign to Text Translation using Deep Learning Techniques Veeransh Shah1, Moksh Shah2, Purab Tamboli3, Prof. Pramila M Chawan4 11B. Tech Student, Dept of Computer Engineering, and IT, VJTI College, Mumbai, Maharashtra, India 21B. Tech Student, Dept of Computer Engineering, and IT, VJTI College, Mumbai, Maharashtra, India 31B. Tech Student, Dept of Computer Engineering, and IT, VJTI College, Mumbai, Maharashtra, India
Associate Professor, Dept of Computer Engineering and IT, VJTI College, Mumbai, Maharashtra, India ---------------------------------------------------------------------***--------------------------------------------------------------------scale MS-ASL dataset to train a robust and accurate Abstract – The project presents an automated system for 4
translation model.
translating American Sign Language (ASL) into English in real-time. The system is designed to bridge the communication gap for the deaf community in various social and professional settings. Leveraging advancements in deep learning, this work utilizes the MS-ASL dataset and a pre-trained I3D model to achieve accurate sign language recognition. The methodology involves a comprehensive data pre-processing pipeline including video trimming, cropping, and frame conversion. The I3D model is fine-tuned using transfer learning to adapt to the specifics of ASL recognition. This project lays the groundwork for a practical and scalable solution to facilitate seamless communication for the deaf and hard of hearing.
The complexity of sign language extends beyond simple hand gestures; it is a complete visual language incorporating facial expressions, body posture, and the speed and rhythm of movements to convey meaning and grammatical structure. Traditional interpretation methods, while invaluable, are not always accessible or practical for spontaneous daily interactions. An automated system offers the promise of immediate, on-demand translation, reducing reliance on human interpreters and empowering individuals with autonomy. This work seeks to harness the power of deep neural networks to learn these intricate visual patterns directly from video data, creating a system that not only recognizes gestures but also understands the temporal dynamics that define them, paving the way for more natural and fluid communication.
This research specifically focuses on word-level sign recognition, establishing a foundational vocabulary that can be accurately translated. The core challenge addressed is the model's ability to discern subtle spatio-temporal variations between different signs, a task for which the I3D architecture is exceptionally well-suited. By successfully implementing this system, we demonstrate the viability of using pre-trained action recognition models as a powerful starting point for the more nuanced domain of sign language. The ultimate vision is a low-latency, highly accurate translation tool that can be integrated into everyday communication platforms, empowering users with greater independence and fostering more inclusive interactions.
2. LITERATURE REVIEW 2.1 Deep Learning Algorithms 2.1.1 3D Convolutional Neural Networks (3D CNNs): These model extend the concept of 2D CNNs to the time domain, making them ideal for video analysis. 3D CNNs use 3D filters to convolve over both spatial and temporal dimensions of video data, allowing them to learn features that represent both the appearance of handshapes and the motion of gestures.
Key Words: Deep Learning, I3D Model, MS-ASL Dataset, Transfer Learning
Unlike hybrid approaches that use a 2D CNN to extract features from each frame and then feed them into a separate temporal model like an RNN, 3D CNNs perform spatio-temporal feature learning in a unified, hierarchical manner. The 3D convolutional kernels slide across a cube of video frames, enabling the network to directly learn motion primitives, such as the arc of a hand movement or the change in a handshape over time, within its early layers. This integrated approach often leads to more robust representations of dynamic gestures compared to methods that decouple spatial and temporal feature extraction.
1.INTRODUCTION Communication is a fundamental aspect of human interaction. However, for the deaf community, communication with the hearing world can be a significant challenge. This project aims to address this issue by developing an automated system that translates American Sign Language (ASL) into english text in real-time. By leveraging recent advancements in deep learning and computer vision, we aim to create a practical solution that can be used in various scenarios, such as doctor’s appointments, educational settings, and conferences. Our work builds upon existing research and utilizes the large-
© 2025, IRJET
|
Impact Factor value: 8.315
2.1.2 Recurrent Neural Network (RNNs) with LSTMs: These models are designed to process sequential data. In sign language recognition, LSTMs can be combined with
|
ISO 9001:2008 Certified Journal
|
Page 139