International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395-0056
Volume: 09 Issue: 07 | July 2022
p-ISSN: 2395-0072
www.irjet.net
Development of Information Extraction for Data Analysis using NLP Geetha K S1, Yashwanth G2, Tanisha Jain3 1Professor
and Vice Principal, RV College of Engineering Dept. of Electronics and Communication Engineering, RV College of Engineering 3Student, Dept. of Electronics and Electrical Engineering, RV College of Engineering ---------------------------------------------------------------------***--------------------------------------------------------------------various forms. The information in the document present in Abstract - Information Extraction from PDFs for analysis 1Student,
the form of text and is represented in a presentable format successfully using NLP as well as word-embedding. Therefore, the steps involved in the project include Keyword analysis, Information extraction from text and tables and UI Development with feedback mechanism. The main objectives of the project include to enable intelligent keyword search for data present in text format using pos tagging and word embedding, to extract data from the text and tables by building NLP algorithms and finally combining all of the data extracted and presenting in the form of a table.
is a common sight in the corporate world. The manual work done by the analysts consumes time depending on the size of the annual reports they are referring to. It also hinders the scalability of the process. Therefore, automation of data analysis for the analysis of PDFs is a necessity today. Hence this paper provides an algorithm by which information can be extracted from the PDFs and mapped to various categories of interest. The categories of interest can be varied, depending on the requirements by the user. The text extraction can be done using simple modules like PDFMiner. However, the dictionary creation has to be done for the sentences to be mapped to particular topics. Using rulebased filters will help extract the required sentences without much consumption of memory and can be understood very easily compared to complex procedures in the algorithm. The proposed algorithm simplifies the entire process of information extraction by providing a broad framework inside the algorithm that can be further modified based on the interests of the user.
2. LITERATURE REVIEW T. Hassan and R. Baumgartner [1] provide a unique approach for the text extraction by combining the topdown approach as well as the existing bottom-up approach by segmenting a page in a PDF and later converting the text into Hyper Text Markup Language (HTML) and presenting the extracted data to the user. This would also mean that structured data inside the PDF into semi-structured formats. An automatic PDF extractor is proposed by Reza M. Parizi et al. [2] to extract health parameters in the report present in a PDF. It features language compatibility, batch processing, ease of use and an open-source tool as parameters for efficient text extraction in the required format. Ying Liu et al. [3] describe an algorithm to extract metadata from a table that would help in the extractions of tabular data from a file. Metadata extracted in the algorithm includes page number, position, column number and number of rows. It is capable of extracting texts, numbers, symbols and images.
Key Words: Data Analysis, NLP, Data Embedding, Text extraction, Table extraction
1. INTRODUCTION Information Extraction (IE) is the method of parsing through the unstructured data and deducing required information into editable and readable formats of the data. We usually search for some required data when the context is in digital format or manually check the same. IE tools make this possible to pull the required information present in text documents, database, websites or multiple sources. Using IE in Natural Language Processing (NLP) algorithms, we can automate the extraction of data with all required information such as tables, company growth metrics and other financial details from various kinds of documents, vis-à-vis PDFs, Docs, Images, and so on. Convolutional Neural Network (CNN) are already common in computer vision models to process and derive the relations in multidimensional data. Therefore, NLP models have already been combined with computer vision models in the past, to benefit from positional information and to improve performance of these key information extraction models.
Xiaonan Lu et al. [4] proposes an algorithm to extract data from 2-dimensional plots for the line graphs. It uses the concepts of line segmentation, denoising, PCC coding at pixel level. The identification of curves is necessary for connectivity between two segments. The intersection between two segments is identified based on whether the intersection is M-type, L-type or R-type. The squared mean error is the mathematical parameter used in the extraction process. The method limits the identification of graphs that are not line graphs. Another limitation is that squared mean may not be the suitable mathematical parameter that can give accurate prediction of presence of the line. Karina Weichork and Andrea Charao [5] use the methods of PDFMiner and CyberPDF for the extraction of texts and
A document contains information in various forms and the useful information can be present in any of the forms. Hence, the tool built to extract the information from all the
© 2022, IRJET
|
Impact Factor value: 7.529
|
ISO 9001:2008 Certified Journal
|
Page 2839