Translation Ally: Document and Audio Translator

Page 1

e-ISSN:2395-0056

Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN:2395-0072

Translation Ally: Document and Audio Translator

1,2,3,4 Undergraduate, IT Engineering, Atharva College of Engineering, Maharashtra, India

5Professor, IT Department, Atharva College of Engineering, Maharashtra, India ***

Abstract - For years, language has been a barrier for many companies, and people especially for companies and employees, many companies cannot extend their businesses, and many employees are not able to work in specific countries and specific companies, just because of different languages. Obstacles or issues that prevent information from flowing between a sender and a receiver cause the communication process to fail and are referred to as barriers to effective communication. If someone’s words don’t make sense to us, every conversation, email, report, and memo will be unproductive. Simple daily tasks might be made difficult by language limitations. As more businesses move overseas, linguistic hurdles may become a worldwide problem. This creates a hindrance in conducting business smoothly. Our project will remove language barriers in business communication worldwide, it will be advantageous for multinational companies and businesses all over the world. Our project will help to extend their business overseas without any language hindrance. Our project will be a platform where one can convert all types of text documents, audio, and video transcripts to any other language. We make use of python libraries and django for translating documents and audio files. This web application will prevent miscommunications, misunderstandings, and conflicts. It will convey thoughts, ideas, and instructions moreeffectively.

Key Words: Translation, document, audio, python, django.

1. INTRODUCTION

Human beings experience a language barrier when they are unable to communicate using a particular language. There are several causes of language barriers. When people speaking in different languages interact with each other, they do not understand each other so there is no point of communication. According to some statistics, if more than 10,000 people speak 121 different languages throughoutthenation,itisnotnecessaryforsomeoneelse to be able to understand that specific language. English is not the firstlanguage for mostof the people in the world. Therefore,itisessentialtohaveatranslator.

The conversation is meaningless if the speaker and the recipient do not use the same words and language. Communicationcanbecomeineffectiveandmessagesmay notgetacrossifcertainwordsarenotusedthattheother personcancomprehend.

Language barriers can be challenging to surmount, whether you're traveling and attempting to understand a restaurant menu or working in a multilingual company. And work is no different. In reality, being unable to accurately follow a conversation in a business situation only makes it more scrutinized and potentially embarrassing.Therevolutioninremoteworkhasenabled companies to look beyond their physical borders and penetrate untapped markets. This has caused a sudden revelationtosweepthebusinessworld.

Translation Ally will enable you to translate any kind of documentandaudioinanylanguagetoanyotherlanguage and in turn remove the language barriers that are caused in businesses. Suppose an employee is transferred from onestateorcountrytoanotherinthiscasetheycanmake useofthiswebsiteandtranslatethedocumentsandaudio which are in a language that is unknown to them to a languagewhichtheycanunderstand.

Considering this problem of language barrier in business communication, our main aim of the project is to create a web application that can overcome language barrier problems around the globe. To help bridge the language barrier, document translators will effectively translate business reports, excel sheets, letters, etc from one language to another language without losing the formatting of the document. The employees who have to visit other countries for work purposes and don’t know the language of that country can use our website to translate all the work related documents and instructions provided to them. They don’t have to specifically learn a foreign language to communicate with their colleagues. Our project will help them to adjust to the new environmentandmakethemfeelatease.

The solution translates different formats of documents suchas.txt,.docx,.xlsx,.csvandaudiofilessuchasmp3 are translated as per users' needs. First the user will have to

International Research Journal of Engineering and Technology (IRJET) © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page173

uploadthefileoraudioandselectthesourcelanguageand desiredlanguage.

1.1 Software Overview

Our project will help in translating different formats of documents used in businesses. Users will get an interface/home page where the user can upload a document and also gets the option to select languages Userswillhavetoselectthesourcelanguagein whichthe original document is written. Then the user will have to selectthedestinationlanguagevizinwhichtheywantthe output of the uploaded document. Once a user uploads their file and selects the languages, the web application generates a translated file for the user. Files can be of various forms such as docx, excel, csv, mp3 etc. For docx files the translated file will preserve all the contents and styling of the original file like tables, images and fonts. Aftera fewsecondsorminutes,theuserwillgeta popup messageifthefileisreadyfordownload,andafterthatthe user can click on the download button given and the file willbereadytodownload.

2. LITERATURE SURVEY

1) The web application, Document Segmentation and Language Translation Using Tesseract-OCR by S. Thakare, A. Kamble, V. Thengne and U. R. Kamble. This paper explains an application that accepts image documents as aninput,auserdefinesanimagefilecontainingtextinany language available in the Python-tesseract library and does its translation in any supported languages using GoogleTranslator.

2) An overview of The impact of language barriers on businessesbyHeleneTenzer,MarkusPudelko&Anne-Wil Harzing. This paper discussesthat ascorporations extend their businesses over the seas there is hindrance of language. They need to overcome this barrier in order to maketheirbusinesssuccessful.

3) Language translation of web-based content by Bart Kalher,BrainBacher,K.C.Jones.Thispapersummarizesa project that can translate websites and help people surf the web without any boundaries. It provides adequate conversion of foreign languages to one's native tongue; however, dialects, slang, and character conversion errors resultinpartiallysuccessfultranslations.

4)AnefficientEnglishtoHindimachinetranslationsystem using hybrid mechanism by J. Nair, K. A. Krishnan and R. Deetha. This paper discusses that as the majority of Indians, especially those living in distant villages, cannot

read, write, or understand English, an effective language translatormustbeused.

2.1 Comparative Analysis

In this section a comparison between the existing system andproposedsystemisconducted.

Table-1: Comparative Analysis

Current System Proposed System

It supports upto 67 languages. It supports upto 107 languages.

System is not able to preserveformatting. System preserves formatting.

Audio translation is not includedinthesystem

Audio translation is implemented

3. SURVEY OF TYPES OF DOCUMENTS USED

According to a survey conducted us, there are three commondocumentformatsusedin business:

Fig -1:Mostusedformatusedinbusinesses

The survey was conducted in August 2022. The survey was conducted by sharing google forms with the people currently in co-operates and working in different industries.

This survey form contained a few questions like which documents their company uses for different documentation purposes and for legal documents. As per the responses, it was found that the top three documents are.pdf,.docxand.xlsxformat.

PDF (Portable Document Format): Thisformatiswidely used in business for documents such as contracts, proposals, and reports. PDFs are widely supported and

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN:2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page174

maintain formatting across different devices and operatingsystems.

Microsoft Word: Word is the standard word processing program used by many businesses. It allows for easy collaborationandformatting,andcanbeusedforavariety ofdocumentssuchasletters,memos,andreports.

Excel (or CSV): Excel is used for spreadsheets and data analysis. It can be used for financial analysis, budgeting, and inventory management. CSV (Comma-Separated Values) is a simpler format used for exporting and importingdatabetweendifferentprograms.

We have tried to include the translations for these documents in our project. This will in turn help the users in translating important documents like contracts, proposals,andreports.

4. TECHNOLOGIES USED

This project was developed by using the combination of various technologies, languages and algorithms which madeittowheretheoptimumfunctionalitywasachieved.

4.1 Python-Django

Python is an open source, extensible, scalable, and heavy processingcapableprogramminglanguagethatweusedto develop the backend of our application. As it supports variouslibrariesandfeatures,thiswasthebestchoicewe can make. With reference to that, the very popular framework of Python i.e. Django was used. This is a framework that helps the process of developing web relatedapplicationseasyandhard-codefree.

4.2 HTML, CSS and JavaScript

HTML isa hyper-text markuplanguage. Itisfor designing web pages. On the World Wide Web, it is utilized for material presentation and structuring. It combines scriptinglanguageslikeJavaScriptandtoolslikeCascading StyleSheets(CSS).

5. PYTHON LIBRARIES USED

Python supports a variety of libraries to ease programmers work and provide a wide range of functionality.

5.1 BeautifulSoup

TheBeautifulSoupPythonlibraryisusedtoanalyzeHTML and XML texts. For parsed sites, it generates a parse tree thatcanbeusedtoextractHTMLdataforwebscraping.In

this project, it is used for reading an html document file which wasa .docxfileconvertedto HTML,forpreserving thestylingandformatwhiletranslation.

5.2 NLTK

The Natural Language Toolkit, or more simply NLTK, is a collection of Python-coded tools and applications for naturallanguageprocessingofEnglish.

5.3 Mammoth

It is one of the libraries used to convert .docx formatted files into HTML files. It is used while translating .docx, to convert it to HTML for preserving fonts, styles, colors, tablesandformattinginadocumentandgivingitprecisely backtotheuser.

5.4 Pandas

It is one of the powerful packages of python for handling data.Itisasimple,fast,expressivelibrary.Itisusedinthis projectforreadingandwritingof.csvand.xlsxfiles.

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN:2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page175
6. SYSTEM FLOW Fig -2:Flowchart

This Web application will help in translating different formats of documents and audio such as docx, excel, csv, mp3etcusedinbusinessesandinanyotherfield.

1.TheUsershavetoselectadocumentwhichtheywantto translate

2. The users will have to select the source language in which the original document is written. Even if the user does not select the source language or selects incorrect language then it will automatically detect the source language.Thentheuserwillhavetoselectthedestination languageinwhichtheywanttheuploadeddocumenttobe translated.

3.Theuploadedfilecan be anaudiofileora document. If the uploaded file is an audio file then text will be extracted from the audio file and will be stored in a text file.

4. The translation of the text will take place with the help ofgoogletranslibraryandoutputwillbegenerated.

5. According to the needs of the user, the output will be providedinaudioortextformat.Fordocxandpdffilesthe translatedfilewill preserve all thecontentsandstyling of the original file like bold, underline, tables, images and fonts.

6. METHODOLOGY

When a user uploads a document the document gets storedinthe backend.Userscanuploaddifferenttypes of fileslikedocx,excel,csv,text.Dependingonthetypeoffile there are different methods for processing and extracting textfromtheuploadedfile.

6.1. Text File

For text files the file is opened and read using normal file handlingfunctionsprovidedbypython.Thesentencesare separated using the nltk library and are stored in a list. The list isiteratedand each sentence istranslated one by one.Thetranslatedsentencesareoverwritteninthesame file.

6.2. Docx File

The docx file is converted to an html file. This conversion isdonetopreservetheformatting ofthe docxfile,so that while translating the file the images, tables and various otherstylings arenotlost.Theimagesinthedocxfileare encoded to base64 format so that they can be recreated whenthefileisconvertedbacktotheoriginalformat.The conversion of docx file to html is done by a library called

mammoth. The contents of the docx fileare wrapped into various tags and the text in the <p>, <li> , <td>,etc tags is extracted and is translated one tag at a time. The parsing ofthehtmlfileisdoneusingBeautifulsouplibrary.Itwill create a parse tree for all parse pages that can be used to extractdatafromHTML,whichisusefulforwebscraping. Theoriginaltextisreplacedwiththetranslatedtextandin thiswayallthetextintheoriginal fileisreplacedwithits correspondingtranslation. From the translated html file a docx file is created containing the same format as that of theuploadedfile.Thebase64imagesaredecodedandthe originalimageisobtained.

6.3. Csv File

Csv files are comma separated files which are generally usedtostorealargeamountofdata.Thecsvfilesareread asnormaltextfiles,sincepythontakestheleastamountof timetoprocesstextfiles.Sotodecreasetheresponsetime andincrease the efficiency ofthe website the csv files are read using python file handling functions. Each row is taken from the csv file and given as an input to the translator function. The translated text is overwritten in thesamefile.

6.4. Excel file

Excel sheets are most commonly used in corporations for storing business data. Excel files are generally larger in size compared to csv files and hence take a lot of processingtime. Sotodecreasethisprocessing time excel files are converted to csv files which are we can say a compressed version of the excel files. The conversion of the excel file to csv is done using pandas library. The csv fileisthentranslatedonerowata timeandoriginal rows areoverwrittenbythetranslatedrows.Afterthecsvfileis translateditisconvertedbacktoexcelfile.

7. RESULT

This is the document in english language which we uploadedintheprojectfortranslation.

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN:2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page176

8. CONCLUSIONS

Language can result in the biggest barrier in the business world and can also create miscommunication and misinterpretation among the workers. Each and every business wants to grow and expand across the globe and as different languages are used in different states and countries so to understand a particular file or document the user does not need to learn that particular language insteadcanusethisprojecttotranslatethefile.

This is very simple and easy to use for anyone, the user will get the translated file in a few clicks and inputs. The user can get the translated files in a few seconds and for free.Filecanbeofdifferentformatssuchas.txt,.docx,.csv, .xlsx, etc and also the translated file will be of the same format. In simple words document translation is the process of converting the text from one language to another. Depending on the industry in which the user operates, any number of documents and content can require translation. Other than business documents in differentfieldssuchashealthcare,government,lawandin manyothersitcanbeusedtotranslatethefile,thebenefit ofthisdocumenttranslationisveryvast.

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN:2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page177
Fig -3:UploadedDocument(BeforeTranslation) Below is the translated document which got downloaded andgottranslatedinHindiwithallthestyling,imagesand tablespreservedwithformatting. Fig -4:DownloadedDocument(AfterTranslation)

ACKNOWLEDGEMENT

We would like to express our profound gratitude to Prof Deepali Maste, HOD of Information Technology Department, and Dr Shrikant Kallulkar Principal of Atharva college of engineering for their contributions to thecompletionofourprojecttitledDocumentTranslator.

Wewouldliketoexpressourspecialthankstoourproject guide and Major project co-ordinator Prof Renuka Nagpure for her time and the efforts she provided throughout the year. Your useful advice and suggestions were really helpful to us during the project’s completion. Inthisaspect,wearereallygratefultoyou.

REFERENCES

[1] S. Thakare, A. Kamble, V. Thengne and U. R. Kamble, "Document Segmentation and Language Translation Using Tesseract-OCR," 2018 IEEE 13th International Conference on Industrial and Information Systems (ICIIS), 2018, pp. 148-151, doi: 10.1109/ICIINFS.2018.8721372.

[2] B. Kahler, B. Bacher and K. C. Jones, "Language translation of web-based content," 2012 IEEE National Aerospace and Electronics Conference (NAECON), 2012, pp.40-45,doi:10.1109/NAECON.2012.6531026.

[3] J.Nair, K. A. KrishnanandR.Deetha,"An efficient English to Hindi machine translation system using hybrid mechanism," 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2016,pp.2109-2113,doi:10.1109/ICACCI.2016.7732363.

[4] Pudelko, Markus, and Helene Tenzer. "Boundaryless careers or career boundaries? The impact of language barriers on academic careers in international business schools." Academy of Management Learning & Education 18.2(2019):213-240.

[5] U.KheradiaandA.Kondwilkar,”SpeechToSpeech Language Translator”, International Journal of Scientific and Research Publications, Volume 2, Issue 12, December 20121ISSN2250-3153.

[6] A.Waibel,A.Badran,A.WBlack,R.Frederking,D. Gates, A. Lavie, K. Lenzo, L. Tomokiyo, J. Reichert, T. Schultz, D. Wallace, M. Woszczyna and J. Zhang, “Speechalator: two-wayspeech-to-speechtranslation ona consumerPDA”,EUROSPEECH2003–GENEVA.

[7] B.Turovsky,“Foundintranslation:Moreaccurate, fluent sentences in Google Translate”, Published Nov 15, 2016.

[8] R. Sennrich, B. Haddow and A. Birch, “Neural MachineTranslationofRareWordswithSubwordUnits”, Submitted on 31 Aug 2015 (v1), last revised 10 Jun 2016 (this version, v5)) , The research presented in this publication was conducted in cooperation with Samsung Electronics Polska sp. z o.o. -Samsung R&D Institute Poland.

[9] Prof.N.R.Ingale, Ashish Suman, Aniruddha Patil, Suhasini Raina, “Text Fetching App by Image Processing” PublishedinInternationalResearchJournalofInnovations inEngineeringandTechnology,Volume4,Issue5,pp5154,May2020.

[10] G. Lample, L. Denoyer and M. Ranzato, “UNSUPERVISED MACHINE TRANSLATION USING MONOLINGUAL CORPORA ONLY”, Under review as a conferencepaperatICLR2018.

BIOGRAPHIES

Miss Zalak Gandhi, resident of Mumbai, Maharashtra, India, Student of IT Engineering from MumbaiUniversity,

Miss Saloni Joshi, resident of Mumbai, Maharashtra, India, Student of IT Engineering from MumbaiUniversity,

Miss Mansi Kargutkar, resident of Mumbai, Maharashtra, India, Student of IT Engineering from MumbaiUniversity,

Miss Khushi Pal, resident of Mumbai, Maharashtra, India, Student of IT Engineering from MumbaiUniversity,

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN:2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page178

Turn static files into dynamic content formats.

Create a flipbook