Outlier Detection Approaches in Data Mining

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395 -0056

Volume: 04 Issue: 3 | Mar -2017

p-ISSN: 2395-0072

www.irjet.net

Outlier Detection Approaches in Data Mining Bharati Kamble , Kanchan Doke Computer Engineering,Mumbai University,Navi Mumbai-400614,Maharashtra, India. ----------------------------------------------------------------------------------------------------------------------------- -------------------------------Abstract — Outlier is defined as an event that deviates too much from other events. The identification of outlier can lead to the discovery of useful and meaningful knowledge. Outlier means it’s happen at some time it’s not regular activity. Research about Detection of Outlier has been extensively studies in the past decade. However, most existing research focused on the algorithm based on specific knowledge, compared with outlier detection approach is still rare. In this paper mainly focused on different kind of outlier detection approaches and compares it’s prone and cones. In this paper we mainly distribute of outlier detection approach in two parts classic outlier approach and spatial outlier approach. The classical outlier approach identifies outlier in real transaction dataset, which can be grouped into statistical approach, distance approach, deviation approach, and density approach. The spatial outlier approach detect outlier based on spatial dataset are different from transaction data, which can be categorized into spaced approach and graph approach. Finally, the comparison of outlier detection approaches. Keywords— outlier detection; spatial data, transaction data. I. INTRODUCTION Data mining is a process of extracting valid, previously unknown, and ultimately comprehensible information from large datasets and using it for organizational decision making [10]. However, there a lot of problems exist in mining data in large datasets such as data redundancy, the value of attributes is not specific, data is not complete and outlier [13].Outlier is defined as an observation that deviates too much from other observations that it arouses suspicions that it was generated by a different mechanism from other observations [21]. The identification of outliers can provide useful, sufficient and meaningful knowledge and number of applications in areas such as climatology, ecology public health, transportation, and location based services. Recently, a few studies have been conducted on outlier detection for large dataset [4]. However, most existing study concentrate on the algorithm based on special background, compared with outlier identification approach is comparatively less. This paper mainly discusses about outlier detection approaches from data mining perspective. The inherent idea is to research and compare achieving mechanism of those approaches to determine which approach is better based on special dataset and different background. The rest of this paper is organized as follows. Section 2 reviews related work in outlier detection. We would like to discuss different method of outlier detection which can be differentiating based on: classic outlier technique based on real time dataset and spatial outlier technique based on spatial dataset which is discuss in section 3. The classic outlier approach can be grouped into statistical-based approach, distance-based approach, deviation -based approach, density based approach. The spatial outlier approach can be grouped into space-based approach and graph-based approach. Comparison of outlier detection is provided in Section 4. Finally, Section 5 concludes with a summary of those outlier detection algorithms. II. PREVIOUS WORK The classic definition of an outlier is due to Hawkins [21] who defines “an outlier is an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism”. Most approaches on outlier mining in the early work are based on statistics which use a standard distribution to fit the dataset. Outliers are describing based on the probability distribution. For example, Yamanishi et a1. Used a Gaussian mixture model to describe the normal behaviors and each object is given a score on the basis of changes in the model [22]. Knorr et al. proposed a new definition based on the concept of distance, which regard a point p in data set as an outlier with respect to the parameters K and λ, if no more than k points in the data set are at a distance λ or less than p [6]. Arning et a1. Proposed a deviation-based method, which identify outliers by inspecting the main characteristics of objects in a dataset and objects that “deviate” from these features arc considered outliers [1]. Breunig et al. introduced the concept of local outlier, a kind of density-based outlier, which assigns each data a local outlier factor LOF of being an outlier depending on their neighborhood [13]. The outlier factors can be computed very efficiently only if some multi-dimensional index structures such as R-tree and X-tree [17] are employed. A top-n based local outlier mining algorithm which uses distance bound micro cluster to estimate the density was presented in [9].

|

Impact Factor value: 5.181

|

ISO 9001:2008 Certified Journal

|

Page 634

Turn static files into dynamic content formats.

Create a flipbook