Artificial Intelligence Powered PDF Summarization with Audio and Multilingual Text Translation

Page 1


International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056 p-ISSN:2395-0072

Volume:12Issue:02|Feb 2025 www.irjet.net

Artificial Intelligence Powered PDF Summarization with Audio and Multilingual Text Translation

Dr. Pankaj Kumar1 , Purnesh S Gowda2, Dushyanth L3 ,Pavan B4,Tilak R5

1 Assistant Professor, ISE, Acharya Institute Of Technology, Karnataka, India

2 B.E Student, ISE, Acharya Institute Of Technology, Karnataka, India

3 B.E Student, ISE, Acharya Institute Of Technology, Karnataka, India

4 B.E Student, ISE, Acharya Institute Of Technology, Karnataka, India

5 B.E Student, ISE, Acharya Institute Of Technology, Karnataka, India

Abstract - In this project, we present an automated pipeline for extracting text from PDF documents, translating it into a target language, and exporting the translatedoutputinastructuredJSONformat.Thesystem leverages the capabilities of the PyPDF2 library for extracting textual content from PDF files, and Google’s Translator API for performing accurate and efficient translations into diverse languages, such as Kannada. The translatedtextisthensavedinJSONformat,ensuringeasy integration with other applications or workflows. This approachstreamlinestheprocessofhandlingmultilingual textualdatafromPDFs,makingitparticularlyvaluablefor researchers, educators, and organizations working with diverse linguistic datasets. The system’s modular design allows for adaptability across domains, enabling seamless customization for additional features such as summarization or audio conversion. The proposed workflowsignificantly reducesthe manual effortinvolved in translating and managing multilingual content while maintaininghighaccuracyandscalability.

Key Words: NaturalLanguageProcessingNLP,PDF Text Extraction, Multilingual Translation, AutomatedContentProcessing

1.INTRODUCTION

The rapid growth of digital content has transformed how information is distributed, consumed, and processed across various sectors. Among the most widely used formats for sharing and storing textual information, Portable Document Format (PDF) documents remain a prevalentchoiceduetotheircross-platformcompatibility and ability to preserve formatting. However, despite their widespread adoption, extracting useful information from PDFs can be an arduous task, particularly when the content is large, complex, or requires multilingual handling. The inherent challenges in dealing with PDF documentsarisefromthenon-linearstructureofthe text, embedded graphics, and formatting elements that make automated extraction processes difficult. Text extraction from PDFs typically relies on Optical Character Recognition(OCR)forscanneddocumentsorbecauseitis free from linguistic ambiguity. Natural Language

Processing (NLP) is a computer science and linguistics field that deals with the interactions between computers and natural languages. It is originated as a branch simple extraction methods for text-based PDFs. While OCR technologies have advanced significantly, they are often computationallyexpensive andproneto errors, especially whendealingwithcomplexlayoutsorlowqualityimages. Ontheotherhand,traditionaltextextractiontools,suchas PyPDF2, are limited to text-based PDFs and may struggle to handle cases where the text is embedded in nonstandard formats or non-selectable fonts. Additionally, once the text is extracted, translating the content into multiple languages introduces further complexity, especiallywhenconsideringthediverselinguisticnuances andcontext-specifictranslationrequirements.

Manual methods of handling such data require separate stepsforextraction,translation,andstoragearebothtime consuming and prone to human error. For organizations or researchers working with large datasets that require multilingual processing, these manual workflows can become a significant bottleneck, limiting the ability to scaleandquicklyrespondtodataneeds.Furthermore,the lack of integration between various tools, such as PDF extraction and translation services, increases the complexityandinefficiencyoftheoverallprocess.

This project presents an integrated, automated pipeline that addresses these challenges by combining text extraction, multilingual translation, and structured data storage into a seamless workflow. The proposed system leverages the PyPDF2 library for efficient extraction of textual data from PDF files and employs Google’s Translator API to translate the extracted content into a target language, such as Kannada. By using a structured JSON format for storing the translated data, the system ensures easy integration with other applications or services for further processing, such as summarization or text-tospeechconversion.

Themodulardesignofthesystemallowsforflexibilityand scalability, enabling users to extend its functionality accordingtospecificneeds.

Volume:12Issue:02|Feb 2025 www.irjet.net

The goal of this project is to reduce the manual effort and errors traditionally associated with multilingual content processing while maintaining the integrity and accessibility of the extracted data. By automating the workflow, we aim to empower organizations to efficiently manage large-scale textual data, allowing for faster decision-making and enhanced information retrieval. In thesubsequentsectionsofthepaper,wedetailthesystem architecture, its core components, and the results of preliminary experiments that demonstrate its effectivenessinreal-worldscenarios.Thefindingsindicate that the proposed pipeline not only improves the efficiencyofhandlingmultilingualPDFdocumentsbutalso ensures that the processed content remains accurate, scalable,andeasytointegratewithothersystems.

2. BACKGROUND

The field of Natural Language Processing (NLP) has witnessed significant advancements in recent years, especially with the introduction of transformer-based models. The foundational work on transformers was introduced in the paper "Attention Is All You Need" by Vaswani et al. (2017), which proposed a model that relies solely on self-attention mechanisms, enabling more efficient parallelization and a deeper understanding of contextual relationships in sequences of text. This breakthrough laid the groundwork for numerous models, including BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformers), which achieved state-of-the-art results in various NLP tasks such as language understanding, generation,andtranslation.

Following the success of BERT and GPT, the field saw the development of specialized models like BART (Bidirectional and Auto-Regressive Transformers), which combined elements from both BERT and GPT to handle tasks in text generation and comprehension. BART, introduced by Lewis et al. in 2020, applies a denoising autoencoderframeworkthatiseffectiveinbothgenerative and discriminative tasks, such as summarization and translation. The model performs exceptionally well in abstractive summarization, where it is able to generate human-like summaries, as well as in translation and text comprehension tasks. BART’s architecture, which uses a combination of bidirectional and autoregressive components,enablesittobefine-tunedforawiderangeof NLP applications. This makes it an ideal choice for tasks such as document summarization in the context of largescaletextextractionprojects.

Astransformermodelsgainedpopularity,severallibraries emergedtomaketheirusagemoreaccessible.

The Hugging Face transformers library, launched in 2018, has become a cornerstone of modern NLP

applications. This open-source library provides pretrained models and easy-to-use APIs for tasks like text classification, translation, summarization, and question answering.TheHuggingFacelibrarydemocratizedtheuse of transformer models, allowing researchers and developers to fine-tune pre-trained models on their own datasets with minimal effort. Other libraries, such as PyPDF2 for PDF extraction and gTTS for text-to- speech synthesis,emergedtosolvespecificchallenges inworking with documents and processing text. These tools, along with advances in deep learning and cloud- based APIs for text translation (such as Google Translate), have enabled the

3. ALGORITHM

TextExtractionAlgorithm: TheprojectusesPyPDF2and PDFMiner to extract text from machine-readable PDFs by parsing their structure while maintaining sentence order. For scanned or image-based PDFs, Tesseract OCR is employedtoperformOpticalCharacterRecognition(OCR), converting images of text into digital format. This ensures that both text based and scanned PDFs can be processed effectively. The extracted text is then cleaned and preparedforsummarization.Byintegratingthesemethods, the system provides a versatile and accurate text extractionsolution

Text Summarization Algorithms: The project employs bothextractiveandabstractivesummarizationtechniques to generate concise summaries of the extracted text. Extractive summarization is implemented using the TextRank algorithm, whichrepresentssentencesasnodes inagraphandcalculatestheirimportanceusingPageRank basedonsentencesimilarity.Themostrelevantsentences are then selected to form the summary while maintaining their original structure. In contrast, abstractive summarization is performed using BART (Bidirectional and Auto-Regressive Transformer), a transformer based encoder-decoder model that generates new sentences instead of simply extracting them. The encoder captures deep contextual meaning from the input text, while the decoder generates a human-like, coherent summary by rephrasingandrestructuringcontent.Thiscombinationof extractive and abstractive summarization ensures both accuracy and readability, making the system suitable for varioustextsummarizationtasks.

MachineTranslationAlgorithm: Theprojectincorporates the Google Translate API, which utilizes Neural Machine Translation (NMT) to convert summarized text into multiple languages. NMT, based on transformer architectures, processes entire sentences rather than word-by-word translations, ensuring better fluency and contextual accuracy. This allows the system to generate summaries in different languages while preserving their original meaning. By enabling multilingual support, the

Volume:12Issue:02|Feb 2025 www.irjet.net

project enhances accessibility, making it useful for users acrossdiverselinguisticbackgrounds.

Text-to-Speech Conversion Algorithms: The project implements text-to-speech (TTS) conversion using both online and offline methods to enhance accessibility. gTTS (Google Text-to-Speech) is an online TTS engine that utilizes WaveNet deep learning models trained on large speech corpora to generate natural-sounding voices with accurate phonetics. This enables users to listen to summaries in multiple languages. For offline speech synthesis, the project employs pyttsx3, which uses the system’s native TTS engine (such as SAPI for Windows or NSSpeechSynthesizer for macOS) to generate speech. Unlike gTTS, pyttsx3 does not require an internet connection and allows users to adjust voice parameters likespeedandpitch.Byintegratingbothonlineandoffline TTSsolutions,theprojectensuresflexibilityenvironments.

Preprocessing Algorithms: The project incorporates preprocessing techniques to refine the extracted text before summarization, ensuring efficient and accurate processing. Tokenization is used to split the text into wordsorsentences,makingitmanageableforNLPmodels. This is achieved using pretrained tokenizers from models like BART and T5 or 02 standard NLP libraries such as NLTK and SpaCy. Additionally, stopword removal is applied to eliminate commonly used words like "and" or "the," which do not contribute significant meaning to the summary. This is implemented using NLTK’s predefined stopword list or custom filters. By applying tokenization and stopword removal, the system ensures that only meaningful content is processed, leading to more concise andrelevantsummaries.

4.METHODOLOGY

The methodology outlines a multi-step process that integrates various technologies and tools to extract, process,summarize,translate,andconverttextintospeech from PDF documents. The pipeline is designed to work efficiently and automatically, allowing users to input PDF documents and receive summarized, translated, and spoken versions of the content in a seamless workflow. Belowarethestepsinvolved:

Text Extraction from PDF: The first step involves extracting text from a PDF document. Since PDFs can containbothtextbasedcontentandimages,thePyPDF2or pdfminer.six libraries will be used to extract text from text- based PDFs. For scanned or image-based PDFs, OCR (OpticalCharacterRecognition)toolslikeTesseractcanbe applied to convert images into text. Once the text is extracted, it is cleaned and pre-processed to remove any irrelevant or non-textual data, ensuring that only the relevantcontentisretainedforfurtherprocessing

Text Summarization: The next step is text summarization, where the extracted text is condensed

into a shorter, more concise version while retaining the key ideas and important information. Hugging Face’s transformers library, specifically the BART model, will be utilized for this task. BART is a state of-the-art transformer-based model that is pre-trained for both extractive and abstractive summarization. The extracted text will be passed through the BART model, which will generate a summarized version of the content. This step reduces the length of the document, making it more accessibleandeasiertodigestfortheend-user.

Language Translation: Once the text has been summarized, it can be translated into different languages for multilingual accessibility. Google Translate API or the googletrans library will be employed to translate the summarized text into the desired language(s). The user willhavetheoptiontospecifythetargetlanguage,andthe system will automatically translate the text into that language. This step ensures that the summarized content canbeunderstoodbyawideraudience,regardlessoftheir nativelanguage.

Text-to-Speech Conversion: The final step involves convertingthetranslatedtext(ortheoriginalsummarized text)intospeechforuserswhopreferaudiooverreading. gTTS (Google Text-to Speech) or pyttsx3 (an offline TTS engine) will be used for this task. gTTS requires an internet connection but provides high quality speech synthesis in various languages, while pyttsx3 operates offline and allows greater control over the voice and speech rate. The summarized and translated text will be converted into an audio file (MP3 format)which can be played by the user to listen to the content. The audio file willalsobesavedforlateruse,providingconvenience.

Output Generation and Delivery: The system delivers processed content in both audio (MP3) and text-based (PDF) formats, allowing users to access information conveniently. It starts by extracting text from uploaded PDFs, followed by summarization using an advanced NLP model. If needed, the text is translated into the user’s selected language via the Google Translate API. The processed text is then converted into speech using gTTS (online) or pyttsx3 (offline), generating an MP3 file. Additionally,thesummarizedandtranslatedtextissavedas a PDF using FPDF for easy reading and sharing. Users can interact with the system through a web interface, where they can upload PDFs and select desired actions. For developers,anAPIisavailabletointegratethefunctionality intoexistingapplications.Theentireprocessis automated, ensuring efficiency and ease of use. This makes the systemhighlyversatile,cateringtodiverseuserneeds.

Volume:12Issue:02|Feb 2025

Fig-1:SystemArchitecture

3. CONCLUSIONS

The results of the system's performance were evaluated across several components, including text extraction, summarization, translation, text-to-speech conversion, and the final output generation. Each component showed promising results in terms of functionality, but also highlighted areas for potential improvement. Below, we present a detailed discussion of the results for each process and offer insights into their effectiveness in realworld scenarios. The text extraction process was handled by PyPDF2 and pdfminer libraries, which successfully extracted text from machine-readable PDFs. For standard PDFs with well-defined text layouts, such as reports, articles,andmostdocumentswithminimalformatting,the system worked efficiently. In these cases, the extracted text was accurate, and the system processed the content quickly, providing a reliable foundation for further tasks like summarization and translation. However, the extraction process encountered challenges with scanned image based PDFs, where the text was embedded as an image rather than text. For such documents, the system struggled to provide accurate results, often leading to incompleteorgarbledtextextraction.Additionally,layouts, or non-standard fonts, complex multi-column formats sometimescausedmisalignmentormissingcontent.03ii.

Thetextsummarizationprocesswascarriedoutusingthe BART model from Hugging Face’s transformers library. The BART model, a transformer-based architecture finetuned for summarization tasks, was able to distill large documents into concise summaries, typically reducing lengthycontenttoafewparagraphshighlightingthemain points. For general content such as news articles or straightforward reports, the summaries were coherent and well-organized, capturing the essence of the documents. However, when applied to technical or highly specialized documents The system offered two text-tospeechoptions:pyttsx3forofflineuseandgTTSforonline conversion. Pyttsx3 provided an offline solution, where the text was converted into speech and saved as an MP3 filewithout requiringan internet connection. Thisfeature proved useful in situations where internet access was unavailable, making the system versatile for both online and offline scenarios. However, the quality of the speech produced by pyttsx3 was relatively mechanical and robotic, lacking the natural intonations and smooth delivery found in more advanced systems. On the other hand, gTTS provided an online text-to-speech service that generated much more natural-sounding speech, with a smootherandmore

REFERENCES

[1] Neto, Joel & Freitas, Alex & Kaestner, Celso. (2002). Automatic Text Summarization Using a Machine Learning Approach. 2507. 205-215. 10.1007/3-54036127-8_20.

[2] M. A. K. Raiaan et al., "A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges," in IEEE Access, vol. 12, pp.2683926874,2024,doi:10.1109/ACCESS.2024.336 5742.

[3] Venkateswarlu, S. & Duvvuri, Duvvuri B K Kamesh & Jammalamadaka,Sastry & Rani, Chintala. (2016). Text to Speech Conversion. Indian Journal of Science and Technology.9.10.17485/ijst/2016/v9i38/102967.

[4] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.2019,arXiv:1910.13461v1

[5] K. Yamamoto, H. Banno, H. Sakurai, T. Adachi and S. Nakagawa, "A Study of Speech Recognition, Speech Translation, and Speech Summarization of TED English Lectures," 2023 IEEE 12th Global Conference on Consumer Electronics (GCCE), Nara, Japan, 2023, pp. 451-452, doi: 10.1109/GCCE59613.2023.10315471.

Volume:12Issue:02|Feb 2025 www.irjet.net

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056 p-ISSN:2395-0072

[6] N. Andhale and L. A. Bewoor, "An overview of Text Summarization techniques," 2016 International Conference on Computing Communication Control andautomation(ICCUBEA),Pune,India,2016,pp.17,doi:10.1109/ICCUBEA.2016.7860024.

[7] A. Raj, M. Raj, N. Umasankari and D. Geethanjali, "Document Based Text Summarization using T5 smalland gTTS," 2024 International Conference on Advances in Data Engineering and Intelligent ComputingSystems(ADICS),Chennai,India,2024,pp. 1-6,doi:10.1109/ADICS58448.2024.10533605.

[8] A, Vinnarasu & Jose, Deepa. (2019). Speech to text conversion and summarization for effective understanding and documentation. International Journal of Electrical and Computer Engineering (IJECE).9.3642.10.11591/ijece.v9i5.pp3642-3648.

[9] Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., ... & Wu, Y. (2018). Natural TTS Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions. 2018 Dept. of ISE, AcIT 2024-2025 51 IEEE International Conference on Acoustics, Speech andSignalProcessing(ICASSP)

[10] Oord, A. v. d., Li, Y., & Zen, H. (2016). WaveNet: A Generative Model for Raw Audio. arXiv preprint arXiv:1609.03499.

[11] Ren,Y.,Hu,C.,Tan,X.,Qin,T.,Zhao,S.,Zhao,Z.,&Liu,T. Y. (2019). FastSpeech: Fast, Robust and Controllable Text to Speech. Advances in Neural Information ProcessingSystems(NeurIPS).

[12] See, A., Liu, P. J., & Manning, C. D. (2017). Get To The Point: Summarization with Pointer-Generator Networks.arXivpreprintarXiv:1704.04368.

[13] Zhang, J., Zhao, Y., Saleh, M., & Liu, P. J. (2020). PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. arXiv preprint arXiv:1912.08777.

[14] Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones, L.,Gomez,A.N.,...&Polosukhin,I.(2017).AttentionIs All You Need. Advances in Neural Information ProcessingSystems(NeurIPS).

[15] Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., ... & Dean, J. (2017). Google’s Multilingual Neural Machine Translation System: Enabling ZeroShot Translation. Transactions of the Association for ComputationalLinguistics,5,339-351.

[16] Soni,Binjal&Bharti,Drsantosh&Choudhury,Amitava. (2024). Text Summarization and Multilingual Text to Audio Translation using Deep Learning Models. 1-6. 10.1109/ICEC59683.2024.10837105.

[17] Talib, Rana. (2021). Multilingual Text Summarization using Deep Learning. 7. 29-39. 10.31695/IJERAT.2021.3712.

[18] Torres-Moreno, J.-M. (2012a). Artex is another text summarizer.arXivpreprintarXiv:1210.3312.

[19] Hasyim, M., Saleh, F., & Yusuf, R. (2023). Machine Translation Accuracy in Translating FrenchIndonesian CulinaryTexts.

[20] Johnson, M., et al. (2017). Google’s Multilingual Neural Machine Translation System: Enabling ZeroShot Translation. Transactions of the Association for ComputationalLinguistics,5,339-351.

[21] K. Papineni, S. Roukos, T. Ward and W.-J. Zhu, "Bleu: a method for automatic evaluation of machine translation,"inProceedingsofthe40thannualmeeting oftheAssociationforComputationalLinguistics,2002.

[22] W. Ogden, J. Cowie, M. Davis, E. Ludovik, H. MolinaSalgado and H. Shin, "Getting information from documents you cannot read: An interactive crosslanguage text retrieval and summarization system," in Joint ACM DL/SIGIR workshop on multilingual informationdiscoveryandaccess,1999.

[23] H. Saggion, D. R. Radev, S. Teufel, W. Lam and S. M. Strassel, "Developing Infrastructure for the Evaluation of Single and Multidocument Summarization Systems in a Cross-lingual Environment.,"inLREC,2002.

[24] L. Yu and F. Ren, "A study on cross-language text summarization using supervised methods," in 2009 international conference on natural language processingandknowledgeengineering,2009.

[25] X. Wan, "Using bilingual information for crosslanguage document summarization," in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume1,2011

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.