International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395-0056
Volume: 04 Issue: 07 | July -2017
p-ISSN: 2395-0072
www.irjet.net
Survey paper on Big data imputation and Privacy algorithms G.Swetha1, G.Ramya2 Professor,CSE,CVRCE,India ---------------------------------------------------------------------***--------------------------------------------------------------------1,2
Abstract - Big data is a collection of large data sets that traditional processing methods are inadequate to deal with them. however , the fast growth of such large data generates both opportunities and problems. This paper presents the literature review about issues, data creation ,data protection and also different algorithms to deal with the issues. Key Words: Big Data, Imputation, nearest neighbour, data protection , Data Distortion, data blocking. 1.INTRODUCTION Goods and Services tax was introduced in India from July 1st2017.People from all over the nation have given their feedback on it. Some people have given positive feedback and some have given negative feedback on it. If we can summaries all types of opinions including updated ones, we can consider it as a good example for Big data. Maximum percentage of the data in the world were produced within the last few years[2].Data is coming from various sources and in various formats. Especially social networking sites are producing large amount of data every hour and handling this large data is very difficult. Big data challenges [7] include Capturing, data storage, data analysis, search, sharing, transfer, visualization, querying, and updating and information privacy The paper is organized as follows. Chapter II gives an introduction to data imputation and algorithms for missing data replacement. chapter III gives an introduction to privacy protection and algorithms for privacy protection. IV. Conclusion , 2. Data Imputation Normally when we preprocess data in data mining, we miss some of the attribute values. But we can extract knowledge from the data only if the data has good quality that is without missing values. But if we have missing data we cannot get good quality data. Missing data may occur because of a detained student in a class, not responding to the questions in a survey and so on. If we can handle missing data carefully, then we can increase the quality of the knowledge. So we need to replace the missing data with some other reasonable data. This is known as data Imputation. If we have knowledge on that data we can predict the missing value, but it is very complicated. Data may be missed © 2017, IRJET
|
Impact Factor value: 5.181
|
in columns or rows or in both. Data which is missed can be replaced before Data mining starts or after it starts. This paper is a survey on 2 methods for handling missing data. First method is Refined Mean Substitution and Second method is K-Nearest Neighbor for missing data. 2.1 Data Imputation Algorithms: The paper[1],proposed an algorithm for missing data. Here missing data is estimated by using an Euclidean distance of the missing instances or attributes and remaining records. In this method distance(d) is calculated between approximately imputed data set and rows of the data set. Now we need to find data whose value is greater than the mean of d. Now name this data as I. That is I is the index elements whose distance is higher than mean(d).Now we need to find mean (μj)of elements Dnew(I,n).Now for all the missing values we need to replace μj in rows of missing data.By calculating for every row like this and by substituting in every missing place ,finally the imputed data set will be generated. This algorithm was evaluated with five different metrics. The performance is evaluated in terms of RAND INDEX, Performance in terms of Accuracy, Performance in terms of Specificity, Performance in terms of sensitivity, and performance in terms of Mean Square Error. According to [1],in almost all the cases this algorithm performed better than MC/mean value substitution method. The second algorithm for[8] imputation is K-Nearest Neighbors. Features of k-Nearest Neighbor are: 1).All the values of the attributes correlate with in an ndimensional Euclidean space. 2).When a new attribute value is entered, then classification begins. 3).Different points’ feature vector is compared for doing classification. 4).Here we don't use any particular function, it may be discrete or real valued. 5).Euclidean distance between any two values will be calculated. Mean value of the k-nearest neighbors will be taken. According to [4], classes for missing data randomness are: (1).Missing completely at random: Here probability of the missing value does not depend on existing value or itself. So, we can do imputation with any data. ISO 9001:2008 Certified Journal
|
Page 3441