Extract and Analyze Data from PDF File and Web : A Review

Page 1

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395 -0056

Volume: 04 Issue: 02 | Feb -2017

p-ISSN: 2395-0072

www.irjet.net

Extract and Analyze Data from PDF File and Web : A Review Darshana Jadhav 1, Dhanashree Jadhav 2, Pooja More 3, Harshali Nikam 1 2

Darshana Jadhav , Dept. of computer Engineering, MET, Nashik

Dhanashree Jadhav , Dept. of computer Engineering, MET, Nashik 3

Pooja More, Dept. of computer Engineering, MET, Nashik

4Harshali

Nikam, Dept. of computer Engineering, MET, Nashik

Assistant Professor : Ms.Tusharsaheb Patil ---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - Current survey done on today’s scenario shows,

result gadget declared by Universities(eg. Pune Uni.) for engineering is in PDF file format. The PDF data contents detail such as seat no, centre, permanent registration no.(PRN), Name, Subjects, Marks, etc. Presently PDF file is extracted in excel file format, this conversion is done in order to extract various reporting formats required by department/college/university at various level. Thus, it involves somewhat manual process. However, all these operation have certain limitations such as semi-automated process, no GUI present, SMS gateway is not support, E-mail gateway is not supported, and mainly graphical analysis of data is not available. On the basis of survey done, we came across existing applications which are semi-automated or automated with some restrictions which does not allow full automation of result analysis in proper format. Thus none of the applications supported the full automation. To overcome above said drawbacks, we proposed a new system for result analysis, which is automated with features like Auto-output generation in different database format like excel, PDF, Mysql for further compatibility with other ERP system as per user selection, active SMS gateway, active Email gateway, interactive and user friendly GUI, graphical result analysis with text. In Proposed system we have targeted the limitations to provide effective solution for result analysis. This system will also work on current grade system. Where we are going to maintain database of students which will show whole status of students. Automated solutions provided by the system will make exam department activities more efficient by covering most of the important drawbacks of manual system, namely speed, precision and simplicity. It will also work as a generalized system to support any type and format of PDF file. A centralized system will ensure that the activities in the context of an examination can be managed effectively, while also making it more accessible and convenient for both staff and students. Key Words: Information Extraction, Pattern Matching, Data Mining, Web Mining.

Š 2017, IRJET

|

Impact Factor value: 5.181

|

1.INTRODUCTION Result evaluation and analysis requires plenty of manual work. so in order to reduce this issue we need system which will support automation. Our system will work for university results. Nowadays in most of the engineering colleges , the traditional method carried out by the colleges is to fill the data within excel sheet manually for each student from the pdf file provided by the university. There are so many formulas for categories the things like toppers, pass, fail, droppers, etc. This is a complete manual process where chances of mistakes are so high. Similarly in diploma colleges results are declared online, so data is taken from web and fill into excel sheet manually and accordingly the data evaluated and analyzed as per requirements of result reports. This process is actually a very time consuming. Thus in order to fill ease the people doing this analysis, we have propose one system which would automate the process of result evaluation and analysis. This system take the input as pdf file provided by university and save into database, once the data get store into database we can use the data to get the information using various queries.

2. LITERATURE SURVEY In Existing System the data sort and analyze by manual

processes. User has to copy/paste the pdf file into excel sheets and have to manually sort it to rank students. Proposed system will be used to automate these processes. Several researchers work on the topic of extracting require data from unstructured data such as PDF. Here we are going describe the tools which are closely related to proposed system in this section. In reference [1] the authors used the PDF-Box technique to extract references from PDF which converts the PDF data into text and get the require information from data. In reference [2] author used LAPDFText technique which is a command line utility to extract text from PDF just by providing path of PDF file. In [3] author uses a technique for extraction of data from the structured web pages. In reference [4] author uses a technique called tag injection which inserts format information into text ISO 9001:2008 Certified Journal

|

Page 1152


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.