Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator Application
Abhishek Venkata Shiva Siripalli1 , Nikhil Shinde2 , Prof. Lovenish Sharma31Student, School of Engineering, Ajeenkya DY Patil University, Pune, Maharashtra, India
2Student, School of Engineering, Ajeenkya DY Patil University, Pune, Maharashtra, India
3Professor, Ajeenkya DY Patil University, Pune, Maharashtra, India
Abstract - In the anti-establishment world, there is a firstrate extent in the utilizationofdigitaltechnologicalknow-how to be aware of how and a vary of methods are on hand for a character to catch images. Such images may additionally comprise necessarytextualinformationthat the customer may additionally desire to edit or store digitally. This can be completed the utilization of Optical Character Recognition with the help of Tesseract OCR Engine. OCR is a branch of artificial Genius that is used in features to apprehend textual content material from scanned documents or images. The recognized textual content materialcanmoreover be changed to audio sketch to aid visually impairedhumanbeings hear the data that they wish to understand and additionally to the illiterate. So, truly at the existing day purposes convert image to textual content, picture to handwritten notes and later provide its audio contents is the use of Optical Character Recognition (OCR) tool.
Now, we additionally introduced new attribute like image to text, textual content material to speech, and we can convert the textual content material to any language as per individual requirement, it will be increased available and accustomed way to do. All the journal, have reply in addition we’re alongside with translator that canbe google translatedbundle deal for our project. In this we will be exploring wonderful bundle and mission will comprise web page the region customer can add photograph and in the returned of at the backend it will process enter and ship lower back aspect in form of API. This utility can be used for character focus from scanned archives so that information can be digitalized. Also, the data can be converted to audio form to aid visually impaired people obtain the records easily. In this, we can prolong the utility to that is can apprehend greater languages, one of a form fonts. Various accents can moreover be delivered for audio files in the upcoming future.
Key Words: OCR (Optical character recognition), translator, Hand written notes, Tesseract, Text-toSpeech (TTS), Tesseract, OCR Engine.
1.INTRODUCTION
Audio computing Text and Image Synthesizer makes it doabletoextracttextualcontentmaterialfrompicturesto automate the processing of texts on images, videos, and scanneddocuments.Inthis,weshowupathowtomanner
an image to textual content material with React and Tesseract.js(OCR), pre-process images, and deal with the obstaclesofTesseract(OCR)andlaterprovideanoutputin audiostructurewhichcanbedownloadedandsavedforthe futurepreference.Textiswithoutproblemsonhandinmany belongingsinthestructureofdocuments,newspapers,faxes, printed information, handwritten notes, etc. Many people sincerely scan the report to preserve the records in the computers.Whenadocumentisscannedwithascanner,itis savedintheshapeofimages.Butthesephotographs areno longereditableanditisveryhardtofindoutwhattheman or woman requires as they will have to go via the entire image,inspectingeachlineandphrasetodetermineifitis relevant to their need. Images moreover take up more residence than phrase archives on the computer. It is fundamentaltobeinaroletomaintainthisrecordsinsucha waysothatitwillenduplessdifficulttosearchandeditthe data. There is a growing demand for features that can apprehend characters from scanned archives or captured photographsandmakethemeditableandbesidestroubles reachable[1]
Asanalyzingisofexcessivemagnitudeinthedaywiththe aidofdayhobbies(textbeingcurrentinalllocationsfrom newspapers,commercialenterpriseproducts,sign-boards, digital shows etc.) of mankind, visually impaired human beings face a lot of difficulties. Our software assists the visually impaired by way of the usage of reading out the textualcontenttothemandadditionallytotheilliterate[2].
This utility can be useful in many methods they are as follows;
1.1 Digitalizing Documents
An OCR application can convert printed or handwritten archives intodigital text format, makingitless difficultto store,edit,andsharetheinformation
1.2 Saves Time
Rather than manually typing out textual content material fromadocument,anOCRapplicationcanunexpectedlyand exactlyextractthetext,savingtimeandreducingthehazard oferrors.Itadditionallyoffersanoutputofanaudiofilethat can be downloaded and pay attention when in your free time.
1.3 Accessibility
For visually impaired individuals, an OCR application can converttextualcontentmaterialfromaphotointoadesign that can be learned about aloud via way of text-to-speech convertorapplication.
1.4 Language translation
Forpeoplewhoareilliterate,OCRapplicationcantranslate textual content material from one language to another, makingitmuchlesstoughforhumanbeingstounderstand and speak with others who talk unique languages.
1.5 Data Extraction
An OCR application can be used to extract precise information from documents, such as names, dates, and addresses,makingitmuchlesschallengingtoanalyzeand puttogetherthedata.
improveperformanceandshortendevelopmenttimes,such as automatic code splitting, hot module replacement, and optimised image loading. It also has a sizable and vibrant communityandavarietyofpluginsandlibrariesthatmaybe usedtoincreaseitscapability.
2.3 Flask
A well-liked open-source Python web framework called Flaskenablesprogrammerstocreatewebappsquicklyand effortlessly.SinceFlaskisamicro-framework,itissmalland doesn't need any specialised libraries or tools to operate. Developers can select the tools and libraries they want to utilise, making it flexible and simple to use. Flask offers a wide range of plugins and extensions, making it simple to addfunctionalitytotheapplication[5].
2.4 Firebase (Cloud Storage)
GoogleoffersdeveloperstheFirebaseCloudStorageservice, a cloud-based storage solution that enables them to store andserveuser-generatedmaterialincludingphotographs, videos, and audio files. Built on Google Cloud Storage, a dependableandscalableobjectstorageservice,isFirebase Cloud Storage. It is simple to use and integrate Firebase CloudStorageintoonlineandmobileapplications.Itoffersa straightforward API that enables developers to handle metadataandaccesscontrolaswellasuploadanddownload files.Inordertoprotectuserdata,FirebaseCloudStorage additionallyoffersbuilt-insecuritymeasureslikeencryption atrestandintransit.
2.5 googletrans
2. DETAIL DESCRIPTION OF TECHNOLOGY USED
2.1 Tesseract
Tesseractisanopticalcharacterrecognition(OCR)engine developed by Google. Its primary purpose is to recognize text embedded in images and convert it into machinereadabletextformat.Tesseract'sproficiencyatquicklyand accuratelyidentifyingwrittentextinavarietyoflanguages, suchasEnglish,French,Spanish,German,andmanymore,is well-known. Applications for Tesseract include data extraction,documentmanagement,andmachinetranslation. It is simple to use and implement because it can be integrated into several computer languages, including Python, Java, and C++. Tesseract can recognise text from scanned documents and supports a number of image formats,includingJPG,PNG,andTIFF[7].
2.2 Next.js
Thewell-knownopen-sourcewebframeworkNext.js,which isbuiltonReact,aidsprogrammersincreatingserver-side rendered (SSR) and statically generated web apps. Additionally,Next.jshasanumberofbuilt-incapabilitiesthat
UsingGoogleTranslatetotranslatetextismadesimplewith the help of the googletrans Python package. Text across differentlanguagesistranslatedusingtheGoogleTranslate API. Python application developers may rapidly and efficientlytranslatetextacrosslanguagesusinggoogletrans. More than a hundred languages are supported, including widelyusedoneslikeEnglish,Spanish,French,German,and ChineseaswellasuncommononeslikeAfrikaans,Bengali, andIcelandic.Thesimplicityanduseofgoogletransaretwo ofitsmainadvantages.ItoffersastraightforwardAPIthat enablesprogrammerstotranslatetextusingverylittlecode. Thelibrarytakescareoftherestafterdevelopersspecifythe sourceandtargetlanguages.
2.6 gTTS
UsingGoogle'sText-to-SpeechAPI,developerscantranslate textintospokenlanguageusingthegTTS(GoogleText-toSpeech)Pythonpackage.Itoffersasimpleuserinterfacethat makesitpossibletocreateaudiofilesfromtextinarangeof languages. Python programmers may rapidly and simply generatespokenlanguagefilesfromwrittentextwithgTTS. It is capable of speaking a broad variety of dialects and languages,includingwidelyusedoneslikeEnglish,Spanish,
French, German, and Chinese as well as less widely used oneslikeBengali,Gujarati,andSwahili.Thesimplicityand usabilityofgTTSaretwoofitsmainadvantages.Itoffersa straightforward API that enables programmers to create text-to-speechconversionsusingverylittlecode[4].
2.7 PyWhatKit
Python'spywhatkitpackagedealgivesasimpleinterfacefor turning text into handwritten notes. It creates images of handwrittennotesthatappearlikehandwritingbyusingthe Pillowlibrary.Itiseasyfordeveloperstousebecauseitis constructed on pinnacle of some of the most widespread Python modules, together with PyAutoGUI, Pillow, and Paperclip.ThesimplicityanduseofPyWhatKitaretwoofits primary advantages. It affords a simple API that enables programmers to complete difficult duties with a little quantity of code. Developers can use pywhatkit to, for instance, send emails with attachments, convert textual contenttohandwriting,oreventakescreenshots.
3. METHODOLOGY
OCR (optical character recognition) is a technology that converts embedded texts from images into a text. Using, Tesseract-OCRlibraryitcanextracttextfromimagesandthat textcanbesavedincloudthatisFirebasecloudstorage.It can take one text input and convert that text into another languagetextbyselectingthelanguagefromdropdownusing thegoogletransAPIprovidedbygoogle.Andwecanalsothe textinFirebasestorage.Inthenextpart,theapplicationis convertingthetextintoaudiobyusinggTTSthatisGoogle texttospeech,saveitinFirebaseanddisplayitonUI(user interface).Andthen,lateritcanconverttextintohandwritten notes by using PyWhatKit library and display it on application.
3.2 Select Language
To obtain different languages from text we have to select languagestoconvertthetextinvariouslanguagesobtained fromtheimage.
3.3 Image Pre-Processing
This step consists of shade to grey scale conversion, part detection, noise removal, warping and cropping and thresholding.Thephotographistransformedtogreyscaleas manyOpenCVfeaturesrequiretheinputparameterasagrey scaleimage.Thispermitsustobecomeawareofandextract solely that location which carries textual content and eliminates the undesirable background. In the end, Thresholding is accomplished so that the picture appears likeascanneddocument.Thisiscarriedouttopermitthe OCRtoeffectivelyconvertthephotographtotext.
3.4 Image to Text Convertor
In the given figure(fig.3) suggests the go with the flow of Text-To-Speech. The first block is the photograph preprocessing modules and the OCR. It converts the preprocessedimage,whichisin.png/jpg/jpegform,toa.txtfile. WearetheuseoftheTesseractOCR.
Fig -2:WorkingFlowchart
3.1 Image Input
Toconvertandimagetotextwerequireddataasitcontains imagesinvariousformat.Theveryfirststepistouploadthe imageforpre-processingtoextracttextcontent.
3.5 Text to Audio-Convertor
Inthebelow(fig4)itconvertsthe.txtfiletoanaudiooutput. Here,thetextualcontentistransformedtospeechtheuseof a speech synthesizer. This Audio file can be generated in variousotherlanguages.
RESULTS
Text extraction from photos is really useful in many true worldapplications.Thedatathatissavedintextualcontentis huge and there is favour to store this statistics in such a manner that can be searched except problem every time required.Eliminationoftheuseofpaperisoneofthestepsto improvement in the course of a world of electronics. Also, recordsthatcanbechangedtoaudioshapeisawaytoease the lives of visually impaired people. Likewise, the textual contentmaterialidentifiedcanbetranslatedintoavarietyof languagesandcanbeprocessedinthechosenlanguageinto speechordocument.Theknowledgeablestatisticsiscreated forallonhandfontsandhandwrittentextsinEnglishsothat the OCR will be capable to convert any textual content reachable in the photo into text. The computing device moreover acknowledges textual content in one-of-a-kind
patternsorfontsandtechniquesittobereachableforprementioned elements such as conversion to speech or documentandmoreoverhelpstranslation.
Thewebsiteadditionallyacknowledgeshandwrittentextual content and strategies it to be on hand for pre-mentioned points such as conversion to speech or file and also helps translation. The method of the extraction of the textual contentmaterialcanbeconvertedintoaudioitsbeaccuracy oftheextractionisextrastudytoanyothertechniqueitsbe veryspeedytocarriedoutandusetotheandroidutilityitcan beused.ThesoundexcellentoftheuseofTTSitsbegood.
This application lets in its user’s to understand textual contentfrompicsandconvertitintodocumentandspeech. Thetextualcontentmaterialcanbeofaquantityoflanguages anditcanmoreoverbetranslatedtoa rangeoflanguages. Thequintessentialfunctionofthesystemisitspotentialto convert written textual content material into handwritten noteswhichcanlaterbetransformedtoanyone-of-a-kind languageorintoaudiofile.Theconversionofmassivevolume ofimagesintotextualcontentwill makeitlessdifficultfor translationandcanbeusedtoconverttoaudiofileasnicely asinthestructureofhandwrittennotesasshowinthefig.5
4.1 Performance
Theprecision-recallcurveandF1scoreareusedtovisualise the precision-recall curve and determine the model's performance. A dataset of 100 photos containing ground truthtextandassociatedOCRoutputfromTesseractisused toevaluatethemodel.
ToevaluatetheperformanceofOCRandcalculatetheF1 score.
Truepositive(TP):TheOCRresultagreeswiththe sourcetext.
Falsepositive(FP):TheOCRoutputdiffersfrom thetextusedasthebasisforcomparison.
Falsenegative(FN):TheOCRfailedtorecognise thegroundtruthtext.
Themodelrecognisedthetextproperlyin85oftheimages (TP),wronglyin5oftheimages(FP),andnotatallin10 images(FN)asshowninthefig.6
Precision=TP/(TP+FP)=85/(85+15)= 0.9444
Recall=TP/(TP+FN)=85/(85+10)=0.8974
F1score=2*(precision*recall)/(precision+ recall)=2*(0.9444*0.8974)/(0.9444+0.8974) =0.9189
A confusion matrix is a method for assessing how well a categorizationmodelisworking.Itdisplayshowmanytrue positives,falsepositives,falsenegatives,andtruenegativesa modelcorrectlypredictedforacertaincollectionofdataas showinthefig.7
6. CONCLUSIONS
Our application helps the users to understand textual content of wide number of languages from images and convertthemintoTextandlatertospeech.Italsoconsistsof the characteristic of translation of textual content into a variety of languages. Many famous written works can be translatedintoanumberlanguagesforthemtoattainspecial people. This approach can take a look at textual content material from a range of sources, and even generate synthesizedspeechbyusingaudio.Italsoconvertstextual contentintotheformofhandwrittennotestounderstand easilyasshowninfig(5).Itismoreconvenienttouse,itis highlysecure,canbeusedanywhereandaccurate.
REFERENCES
[1] NishaPawar,ZainabShaikh,PoonamShinde,Prof.Y.P. Warke, "Image to Text Conversion Using Tesseract," International Research Journal of Engineering and Technology(IRJET)Feb02,2019
[2] AshaG.Hagargund,SharshaVanriaThota,MitadruBera, EramFatimaShaik,"Imageto SpeechConversionfor Visually Impaired," International Journal of Latest ResearchinEngineeringandTechnology(IJLRET),ISSN: 2454-5031,Volume03,June062017
[3] Nivetha.S, Kameshwari.S, "Image to Text and Speech Converter," International Research Journal of Engineering and Technology (IRJET), e-ISSN: 23950056; p-ISSN: 2395-0072, Volume 07, Issue Nov 11 2020
[4] Arjun Pratap, Kunal Wavhule, Viraj Patil, Vaibhav Narawade, "OCR-WRITTEN TEXT TO AUDIO CONVERTER"IJARIIE,ISSN(O)-2395-4396,2022
[5] Umatia, S., Varma, A., Syed, A., Tiwari, K., & Shah, F. (2022, November 30). Text Recognition from Images. International Journal for Research in Applied Science and Engineering Technology, 10(11), 1003–1009. https://doi.org/10.22214/ijraset.2022.47498
5. FUTURE SCOPE
Inthefuture,wecanprolongtheapplicationviaincluding morelanguages,exceptionalfontsandimprovehandwritten notes. Various accents can moreover be added for audio data.Initially,wetakeonlyoneimageatatimeasanenterin thefuturewecanaddamultiplenumberofimagesforpreprocessing.Notonlyimageswecantakeanytypeofvideos andbreakdownintoframesandthatimageobtainedfrom the video can additionally be processed. This will help in makingsubtitles.
[6] KumarGarai,Sayan,OjaswitaPaul,UpayanDey,Sayan Ghoshal, Neepa Biswas, and Sandip Mondal. "A Novel Method for Image to Text Extraction Using TesseractOCR."AmericanJournalofElectronics&Communication 3,no.2(2022):8-11
[7] Lestari,IkhaNovieTri,andDadangIskandarMulyana. "ImplementationofOCR(OpticalCharacterRecognition) UsingTesseractinDetectingCharacterinQuotesText Images." Journal of Applied Engineering and TechnologicalScience(JAETS)4,no.1(2022):58-63
[8] Patil, Shruti, Vijayakumar Varadarajan, Supriya Mahadevkar, Rohan Athawade, Lakhan Maheshwari,
ShrushtiKumbhare,YashGarg,DeepakDharrao,Pooja Kamat, and Ketan Kotecha. 2022. "Enhancing Optical CharacterRecognitiononImageswithMixedTextUsing SemanticSegmentation"JournalofSensorandActuator Networks 11, no. 4: 63. https://doi.org/10.3390/jsan11040063
[9] KarthikeyanG,BharanidharanG,JeevanandhamD,and BalajiBG.2022.“TextRecognitionImagesUsingOCR”. InternationalJournalofProgressiveResearchinScience andEngineering3(05):57-60.