Text Extraction from Video using Deep Learning by IRJET Journal

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 11 Issue: 04 | Apr 2024

p-ISSN: 2395-0072

www.irjet.net

Text Extraction from Video using Deep Learning ASSOC. PROF. L. RASIKANNAN1, K. GUNAL2, S. SABARINATHAN3, S. VIGNESWARAN4 1234 Dept. of Computer Science and Engineering, Government College of Engineering Srirangam, Tamilnadu, India

---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - In the modern-day virtual landscape, video

algorithms, we can potentially improve the accuracy and efficiency of text extraction from video content. Where OCR is used for text detection from frames and segmentation, image and CNN are used for text recognition. CNNs trained on large datasets can generalize well across different fonts and writing styles, making the combined approach versatile and applicable to a wide variety of video content.

content has expanded unexpectedly in quantity across diverse structures, imparting a wealth of records. However, extracting textual records from these videos accurately and efficaciously remains a great assignment. This paper proposes an approach to extract textual content from video content by way of employing a combination of Optical Character Recognition (OCR) algorithms and Convolutional Recurrent Neural Network (CRNN). By leveraging the strengths of both OCR and CRNN, our approach ambitions to beautify the accuracy and performance of text extraction from video lectures, tutorials, and educational content material. The extracted textual content serves as a precious reference for college students, enriching their mastering enjoy. This study contributes to unlocking the untapped potential of textual records embedded within video content material, thereby facilitating better access to knowledge inside the virtual age.

1.1 RELATED WORK K. S. Raghunandan and Palaiahnakote Shivakumara [1] addresses robust text detection and recognition in multiscript-oriented images. Previous research has employed techniques including bit plane slicing, Iterative Nearest Neighbor Symmetry (INNS), Mutual Nearest Neighbor Pair (MNNP) components, character detection using fixed windows, contourlet wavelet features with SVM classifier, and Hidden Markov Models (HMM) for recognition. Xu-Cheng Yin and Xuwang Yin [2] presents a method for accurate text detection in natural scene images. Their approach employs a fast-pruning algorithm to extract Maximally Stable Extremal Regions (MSERs) as character candidates, followed by grouping them into text candidates using single-link clustering. Automatic learning of distance metrics and thresholds is incorporated, and posterior probabilities of text candidates are estimated using a character classifier to eliminate non-text regions. Evaluation of the ICDAR 2011 Robust Reading Competition database demonstrates an f-measure of over 76%, surpassing state-ofthe-art methods, with further validation on various databases confirming its effectiveness.

Key Words: Optical Character Recognition (OCR), Convolutional Recurrent Neural Network (CRNN), Text Extraction, Video Content, Information Retrieval

1.INTRODUCTION In today's digital age, video content has become ubiquitous on different platforms and offers a large storage of information to users around the world. However, despite the wealth of knowledge contained in these videos, accessing and utilizing the textual information they contain remains a challenging challenge. Accurate and efficient text extraction from video content is key to increasing the accessibility and usability of this information, especially in educational contexts such as lectures, tutorials, and instructional videos.

Pinaki Nath Chowdhury, and Palaiahnakote Shivakumara [3] presents a new method for detecting text on human bodies in sports images, addressing challenges such as poor image quality and diverse camera viewpoints. Unlike conventional methods, it employs an end-to-end episodic learning approach that detects clothing regions using a Residual Network (ResNet) and Pyramidal Pooling Module (PPM) for spatial attention mapping. Text detection is performed using the Progressive Scalable Expansion Algorithm (PSE). Evaluation of various datasets demonstrates superior precision and F1-score compared to existing methods, confirming effectiveness across different inputs.

Traditional methods of extracting text from videos often rely on optical character recognition (OCR) techniques, which can struggle with low-quality images, distorted text, and complex backgrounds. OCR algorithms can struggle to accurately recognize text that deviates from standard fonts or styles, such as handwritten text, artistic fonts, or distorted characters. This limitation may lead to errors or incomplete text extraction. Additionally, OCR alone may not capture context or structure well, especially in scenarios where text is embedded in dynamic visuals. On the other hand, convolutional neural networks (CNNs) have shown remarkable success in image recognition tasks, including text detection and extraction. By utilizing both OCR and CNN

Impact Factor value: 8.226

2 METHODOLOGIES Our proposed approach for extracting text from video content leverages the strengths of optical character

ISO 9001:2008 Certified Journal

Page 947