Issuu

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 10 Issue: 05 | May 2023

p-ISSN: 2395-0072

www.irjet.net

MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE Dr.N.V Sailaja, Billakanti Sushma, Aredla Likitha Reddy, Charitha Parachuri, Chandra Akash 1Assistant professor, Dept. Of Computer Science Engineering, VNRVJIET college, Telangana, India 2,3,4,5 Student, Dept. Of Computer Science Engineering, VNRVJIET college, Telangana, India

---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - Internet has taken the place of traditional

pre-trained models on a particular dataset. When trained on domain-specific datasets, this method has been proven to greatly increase the accuracy of voice recognition systems.

communication channels. The use of online communication has resulted in a rapid rise in audio and visual information. Although it has been beneficial for the majority of people, those with special needs, such as the deaf, have few resources at their disposal. A speech-to-text Conversion programme is written.The audio input is converted to text using speech recognition technology. Algorithms for Natural Language Processing are used to extract root words and segment words and translate to different languages.

In this study, our efforts with honing Hugging Face models for voice recognition were discussed. Outlined the training dataset, the fine-tuning procedure, and the assessment measures employed to gauge the model's effectiveness. Also included front end to record voice from microphone in order to test the model’s efficiency and accuracy. The front-end enables easy use of the model and provides good userexperience.

Our goal is to create a multilingual speech-to-text conversion system employing Hugging Face for hearing-impaired people. The technology will make it possible for deaf individuals to instantly translate spoken language into text, assisting them in a variety of tasks including listening to lectures, participating in meetings, or even having conversations with others. Hugging Face's[11] cutting-edge neural network models and natural language processing algorithms will improve the suggested system's precision and effectiveness. The system will also support many languages, enabling a wide range of people throughout the world able to utilise it. Overall, the suggested technology would enable seamless communication and engagement in many contexts, greatly enhancing the quality of life for those with hearing impairments.

Our findings show the value of fine-tuning pre-trained language models like Hugging Face[11] for speech recognition, and thought this strategy has a great deal of promise for raising the precision and efficiency of speech recognition systems across a variety of domains.

2.LITERATURE SURVEY Applications that target rescue were the ones that were most prevalent when some comparable ones entered the research sector. They are: In [1], The model proposed in this paper uses standard input speech to text Conversion engine to take the input speech. With the use of cutting-edge artificial intelligence methods like Deepnets and Q learning, the search space is reduced using the HMM (Hidden Markov) model, and the optimised results are then used in real time. The accuracy of the suggested phonetic model is 90%.

Key Words: transformers, whisper, Machine Learning, Automatic Speech recognition, web page, tokenizer, pipeline.

1.INTRODUCTION With the development of machine learning and deep learning algorithms, automated voice recognition has become a major study area. Using pre-trained language models, like Hugging Face, for fine-tuning is one such method.

In [2], The Proposed approach deals with recognition of two different languages – kannada and English. A Deep learning voice recognition model is used in conjunction with a word prediction model. When evaluated on a multilingual dataset, the accuracy is 85%, and when the user tests it in real time, the accuracy is 71%. Cosine similarity model was employed. When predictions were made using the average similarity of each class, a 59% average accuracy for the cosine similarity model was attained.

Hugging Face, a well-known NLP library, offers pre-trained models for a variety of NLP tasks, including text categorization, question-answering, and language production [11]. Access to pre-trained Transformer architecture-based speech recognition models is also made available by the library.

In [3] This paper presents a complete speech-to-text conversion system for Bangla language using Deep Recurrent Neural Networks. Possible optimization such as Broken Language Format has been proposed which is based on

In order to increase the model's precision and performance on a certain speech recognition task, fine-tuning Hugging Face models [11] for voice recognition entails training the

Impact Factor value: 8.226

ISO 9001:2008 Certified Journal

Page 1782