International Research Journal of Engineering and Technology (IRJET) Volume: 04 Issue: 02 | Feb -2017
www.irjet.net
e-ISSN: 2395 -0056 p-ISSN: 2395-0072
Text Document categorization using support vector machine Shugufta Fatima, Dr. B. Srinivasu Shugufta Fatima M.Tech Dept. of Computer Science and Engineering , Stanley College of Engineering and Technology for Women, Telangana- Hyderabad, India. Dr. B. Srinivasu Associate Professor Dept. of Computer Science and Engineering , Stanley College of Engineering and Technology for Women, Telangana- Hyderabad, India. ---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract
tries to find a relationship between the labels and a set of features, such as the words in the documents.
- The Web is a tremendous source of information, so tremendous that it becomes difficult for human beings to select meaningful information without support. Categorization of documents refers to the problem of automatic classification of a set of documents in classes (or categories or topics). Automatic Text Categorization is an important issue in the text mining. The task is to automatically classify text documents into predefined classes based on their content.
1.2 Text Categorization Text Categorization is the process of assigning a given text to one or more categories. This process is considered as a supervised classification technique, since a set of preclassified documents is provided as a training set. The goal of Text Categorization is to assign a category to a new document.
Automatic categorization of text documents has become an important research issue now a days. Proper categorization of text documents requires information retrieval, machine learning and Natural language processing (NLP) techniques. Our aim is to focus on important approaches to automatic text categorization based on machine learning technique.
1.3 Problem Description Information is mostly in the form of unstructured data .as the data on the web has been growing, it has lead to several problems such as increased difficulty of finding relevant information and extracting potentially useful knowledge. As a consequence of this exponential growth, great importance has been put on the classification of documents into groups that describe the content of the documents. The function of a classifier is to merge text documents into one or more predefined categories based on their content. Unlabeled texts provide co-occurrence information for words, which can be used to improve categorization performance. Although unlabeled texts are available from the internet, collecting unlabeled texts which are useful for a text categorization problem is not an easy task because of the wide diversity of texts on the Internet.
Several methods have been proposed for the text documents categorization. We will adapt and create machine learning algorithms for use with the Web's distinctive structures: large-scale, noisy, varied data with potentially rich, human-oriented features using svm. Key Words: Text Documents Categorization, Machine Learning, Support Vector Machine.
1. INTRODUCTION 1.1 Document categorization The Web is a tremendous source of information, so tremendous that it becomes difficult for human beings to select meaningful information without support. Automatic Text Categorization is an important issue in the text mining. The task is to automatically classify text documents into predefined classes based on their content [1]. Feature selection and learning are two important steps in the automatic text categorization. Classification algorithms take a "training" set of labeled documents and
Š 2017, IRJET
|
Impact Factor value: 5.181
1.4 Basics and background knowledge 1.4.1 Approaches for Classification Classification is a problem where a learner is given several labeled training examples and is then asked to label several formerly unseen test examples.this can be done in three ways
|
ISO 9001:2008 Certified Journal
|
Page 141