Comparison of Text Classifiers on News Articles

Page 1

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395 -0056

Volume: 04 Issue: 03 | Mar -2017

p-ISSN: 2395-0072

www.irjet.net

Comparison of Text Classifiers on News Articles Lilima Pradhan1, Neha Ayushi Taneja2, Charu Dixit3 ,Monika Suhag4 1Lilima

Pradhan, Department of Computer Engineering, Army Institute of Technology, Pune, Maharashtra Ayushi Taneja, Department of Computer Engineering, Army Institute of Technology, Pune, Maharashtra 3Charu Dixit, Department of Computer Engineering, Army Institute of Technology, Pune, Maharashtra 4Monika Suhag, Department of Computer Engineering, Army Institute of Technology, Pune, Maharashtra

2Neha

---------------------------------------------------------------------***---------------------------------------------------------------------

probabilities and conditional probability is used to denote the likelihood of occurrence of an event given an event which has previously occurred i.e. it uses the knowledge of prior events to predict the future events. Using Bayes Theorem, the conditional probability can be decomposed as [5]:

Abstract - Text classification is used to classify documents

on the basis of their content. The documents are assigned to one or more categories manually or with the help of classifying algorithms. There are various classifying algorithms available and all of them vary in efficiency and the speed with which they classify documents. The news articles classification was done using Support Vector Machine(SVM) classifier, Naive Bayes classifier, Decision Tree classifier, K-nearest neighbor(kNN) classifier and Rocchio classifier. It is found that SVM gives a higher accuracy in comparison with the other classifiers tested.

… (1) In the context of text classification, the probability that a document dj belongs to a class c is calculated by the Bayes theorem as follows [6]:

Key Words: Text Classification, Support Vector Machine, Naive Bayes, k-Nearest Neighbor, Decision Tree, Rocchio, Preprocessing

1.INTRODUCTION

… (2) It makes an assumption that all the features are independent. Inspite of the assumption, Naive Bayes works quite well for real world problems like spam filtering and text classification. Its main advantage is that it requires small training data to estimate necessary parameters.

Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance [1]. With the rapid growth of technology, the amount of digital information generated is enormous and so organizing the information is very important. Classifying the news articles according to their content is highly desirable as it can enable automatic tagging of articles for online news repositories and the aggregation of news sources by topic (e.g. google news), as well as provide the basis for news recommendation systems [2]. Text classification of news articles helps in uncovering patterns in the articles and also provide better insight into the content of the articles. The aim of the paper is to present a comparison of the various popular classifiers on various datasets to study the characteristics of the classifiers[3][4].

1.2 Decision Tree Decision tree is a very popular tool for classification and prediction. It represents rules that are used to classify data. Decision tree has tree like structure where leaf node indicates class and intermediate node represents decision. It starts with a root node and then branches into multiple solutions. An attribute or branch is selected using different measures like gini index, information gain and gain ratio. Its aim is to predict the value of a target variable based on different inputs by learning simple decision rules. Decision tree uses recursive approach which divides source set into subsets based on attribute value test. Many algorithms can be used for building decision tree like ID3, C4.5, CARD(Classification and Regression Tree) and CHAID.

2. CLASSIFIERS 2.1 Naive Bayes Naive Bayes is an classification algorithm based on Bayes Theorem. The Bayes Theorem is used for finding conditional

© 2017, IRJET

|

Impact Factor value: 5.181

|

ISO 9001:2008 Certified Journal

| Page 2513


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.