Machine Learning Approach for Word Extraction

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395 -0056

Volume: 04 Issue: 04 | Apr -2017

p-ISSN: 2395-0072

www.irjet.net

Machine Learning Approach for Word Extraction Ajith K J1, Ajay Kumar C M1, Adithya S Prakash1, Ramesh G2 1Dept.

of Computer Science and Engineering, The National institute of engineering, Mysuru, Karnataka, India professor, Dept. of Computer Science and Engineering, The National institute of engineering, Mysuru, Karnataka, India ---------------------------------------------------------------------***--------------------------------------------------------------------2. RELATED REASEARCH Abstract - Keywords and Acute words in a document are 2 Assistant

used to describe structure within information retrieval system as they are easy to remember, store, define and share. Acute words can be applied across multiple corpora and are independent of any corpus when compared to mathematical signatures. The goal of word extraction is to train the data using the approach of machine learning and extract the words which gives more meaning in support for understanding the document and for summarized text. While in word extraction we rank the words which is free from special characters, numbers and retrieve the words which contains maximum information and give weightage to the document in better understanding.

Word extraction have been applied to improve the functionality of information retrieval systems. Jones and paynter describe Phrasier, a system that list document related to the primary documents keyword. Hidayet Takci and Tunga Gungor made to classify the language independent text document using centroidbased classification approach. This experiment resulted to obtain better accuracy than other methods when done on multilingual corpus. Based on semantic hierarchy Xiaogang peng and beng Choi propsed automatic classification of documents which actually classifies the document into group of texts which was actually based on the word extractions keeping semantic elements. A Machine learning approach to webpage content extraction by jiawei Yao and Xinhui Zuo where they explained that the webpage contains boilerplate elements such as to be comments, advertisements etc. These are treated as noise in the documents and are removed properly to improve the user’s experience in understanding that document. Using the SVM classifier they set relevant features to predict weather the text block is content or noncontent. In sentimental analysis which have been focused on classification models for text the approach given in hand-coded rules by neviarouskaya in 2010 have done the classification only for few categories and hence our work attempts to extend this by inferring the specific reactions rather than broad categories.

Key Words: Extraction, co-occurrence, word frequency, classification, training, pre-processing.

1. INTRODUCTION Paramount words or keywords plays a vital role in retrieving the right information as per user requirement. In today’s world of technology information we study lots of journal, newspaper, books, articles, messages which makes the user difficult to go through all the words and sometimes leaves them to be apart from studying those materials, instead a tool of text summarization Is required and need for extraction summarized text, keywords, acute words which actually provides the actual contents of the document. As such effective words are necessity. Here the word extraction is the smallest unit which extracts the meaning of entire document, many application can take advantage of it such as automatic indexing, classification, clustering, filtering, web searches etc. In this paper, we put summarized text in the machine learning setting and introduce a supervised learning approach for word extraction. We break the document into text blocks which is structureindependent and extract the features from the text blocks. We also do semantic analysis based on id and class attributes associated with text blocks using Naïve bayes Algorithm. To improve accuracy in understanding of the document from the extracted words the result of semantic analysis can be incorporated. © 2017, IRJET

|

Impact Factor value: 5.181

3. IMPLEMENTATION There are various implementation method to achieve word extraction by locating and defining the acute words that have been used in the document. Despite the difference, in most of the cases we rank the text blocks in the given document, separate them by frequencies, predict their nature and define the a set of words that accurately describe the information contained in the text and provide maximum satisfaction for the user.

|

ISO 9001:2008 Certified Journal

| Page 1024

Turn static files into dynamic content formats.

Create a flipbook