Feature Subset Selection for High Dimensional Data Using Clustering Techniques

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395 -0056

Volume: 04 Issue: 03 | March -2017

p-ISSN: 2395-0072

www.irjet.net

Feature Subset Selection for High Dimensional Data Using Clustering Techniques Nilam Prakash Sonawale Student of M.E. Computer Engineering, Bharati Vidyapeeth College of Engineering, Mumbai University, Navi Mumbai , Maharashtra, India.

Prof. B. W. Balkhande

Assistant Professor, Dept. of Computer Engineering, Bharati Vidyapeeth College of Engineering, Mumbai University, Navi Mumbai, Maharashtra, India ---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - Data mining is the process of analyzing the data

low inter-cluster similarity [5]. A categorization of major clustering methods.

from different perspective and summarizing it into useful information (Information that can be used to increase revenue, cuts costs or both). Database contains large volume of attributes or dimensions which are further classified as low dimension data and high dimension data. When dimensionality increases, data in the irrelevant dimension may produce noise, to deal with this problem it is crucial to have a feature selection mechanism that can find a subset of features that meets requirement and achieves high relevance. The proposed algorithm FAST is evaluated in this project. FAST algorithm has three steps: irrelevant features are removed; Features are divided in to clusters, selecting the most representative feature from cluster [8]. This algorithm can be performed by DBSCAN (Density-Based Spatial Clustering with Noise) algorithm that can be worked in the distributed environment using the Map Reduce and Hadoop. The final result will be a small number of discriminative features selected.

1.1 Partitioning method: It classifies the data into k groups, together which satisfy the following requirements: (1) each group must contain at least one object, and (2) each object must belong to exactly one group. Like the algorithms k-means and k-medoids. The cons of it are, most partitioning methods cluster objects based on the distance between objects. Such methods can find only spherical-shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes.

1.2 Hierarchical methods: Hierarchical decomposition of the given set of data objects has been created by Hierarchical methods. Hierarchical methods suffer from the fact that once a step (merger or split) is done, it can never be undone.

Key Words: Data Mining, Feature subset selection, FAST, DBSCAN, SU, Eps, MinPts

1.3 Density-based methods:

1. INTRODUCTION

To continue growing the given cluster as long as the density (number of objects / data points) in the “neighborhood” exceeds some threshold; that is, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points [3]. This method can be used to filer out noise (outliers) and to discover clusters of arbitrary shapes.

Data mining is an interdisciplinary subfield of computer science; it is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database system [4]. The overall goals of the data mining process are too abstract information from a dataset and transform it into an understandable structure for further use. Cluster analysis or clustering is the task of grouping the set of objects in such a way that objects in the same group ( called clusters) are more similar (in some sense or another ) to each other than to those in other groups(Cluster). Clustering is an example of unsupervised learning because there is no predefined class; the quality of cluster can be measure by high intra-cluster similarity and

|

Impact Factor value: 5.181

1.4 Grid-based methods: These methods quantize the object space into a finite number of cells that form a grid structure. The primary advantage of this approach is its fast processing time, which is dependent only on the number of cells in each dimension in the quantized space and independent of the number of data objects.

|

ISO 9001:2008 Certified Journal

|

Page 867

Turn static files into dynamic content formats.

Create a flipbook