1 / 32

Missing values problem in Data Mining

Missing values problem in Data Mining. Jelena Stojanovic 03/20/2014. Outline. Missing data problem Missing values in attributes Missing values in target variable Missingness mechanisms A approaches to Missing values Eliminate Data Objects Estimate Missing Values

alka
Download Presentation

Missing values problem in Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Missingvalues problem in Data Mining Jelena Stojanovic 03/20/2014

  2. Outline • Missing data problem • Missing values in attributes • Missing values in target variable • Missingnessmechanisms • Aapproachesto Missing values • Eliminate Data Objects • Estimate Missing Values • Handlingthe Missing Value During Analysis • Experimental analisys • Conclusion

  3. Missing Data problem • There are a lot of serious data quality problems in real datasets: incomplete, redundant, inconsistent and noisy • reduce the performance of data mining algorithms • Missing data is a common issue in almost every real dataset. • Causedby varied factors: • high cost involved in measuring variables, • failure of sensors, • reluctance of respondents in answering certain questions or • an ill-designed questionnaire.

  4. Missingvalues in datasets • The missing data problem arises when values for one or more variables are missing from recorded observations.

  5. Missingvalues in attributes (independant variables)

  6. Missinglabels

  7. Missingnessmechanism • Missing Completely At Random • Missing At Random • MissingNot At Random

  8. MissingCompletelyat Random - MCAR • Missing Completely at Random- the missingness mechanism does not depend on the variable of interest, or any other variable, which is observed in the dataset. • The data are collected and observed arbitrarily and the collected data does not depend on any other variable of the dataset. • The case when respondents decide to reveal their income levels based on coin-flips • This type of missing data is very rarely found and the best method is to ignore such cases.

  9. MCAR (continued) • Estimate E(X) from partially observed data: X* = [0, 1, m, m,1,1, m, 0, 0, m…] E(X)=? • True data: X = [0, 1, 0, 0, 1, 1, 0, 0, 0, 1…] E(X) = 0.5 • Rx = [0, 0, 1, 1, 0, 0, 1, 0, 0, 1…] • If MCAR: • X* = [0, 1, m, m,1,1, m, 0, 0, m…] and E(X) = 3/6 =0.5

  10. Missing At Random - MAR • Missing at random - when the probability of an instance having a missing value for anattribute may depend on the known values, but not on the value of the missing data itself; • Missingness can only be explained by variables that are fully observed whereas those that are partially observed cannot be responsible for missingness in others; an unrealistic assumption in many cases. • Women in the population are more likely to not reveal their age, therefore percentage of missing data among female individuals will be higher.

  11. MissingNot Ar Random- MNAR • When data are not either MCAR or MAR • Missingness mechanism depends on another partially observed variable • Situationin witch the missingness mechanism depends on the actual value of missing data. The probability of an instance having a missing valuefor an attribute could depend on the value of that attribute • Difficulttask; model themissingness

  12. Missing data consequences • They can significantly bias the outcome of research studies. • Responseprofiles of non-respondents and respondents can be significantly different from each other. • Performing the analysis using only complete cases and ignoring the cases with missing values can reduce the sample size thereby substantially reducing estimation efficiency. • Many of the algorithms and statistical techniques are generally tailored to draw inferences from complete datasets. • It may be difficult or even inappropriate to apply these algorithms and statistical techniques on incomplete datasets.

  13. Handling missing values • In general, methods to handle missing values belong either to sequential methods (preprocessing methods) or to parallel methods (methods in which missing attribute values are taken into account during the main process of acquiring knowledge). • Existing approaches: • Eliminate Data Objects or Attributes • Estimate Missing Values • Handlingthe Missing Values During Analysis

  14. Eliminate data objects

  15. Eliminating data attributes

  16. Estimate Missing Values most common/mean value

  17. Imputation

  18. Imputation- nearest neighbor K-NN

  19. Handlingthe Missing Value During Analysis • Missing values are taken into account during the mainprocess of acquiring knowledge • Some examples: • Clustering - similarity between the objects calculated using only the attributes that do not have missing values. • C4.5 -splitting cases with missing attribute valuesinto fractions and adding these fractions to new case subsets. • CART -A methodofsurrogate splits to handle missing attribute values • Rule-based induction algorithms- missingvalues „do not care conditions“ • Pairwisedeletion is used to evaluate statisticalparameters from available information • CRF-marginalizingout effect ofmissinglabelinstanceson labeled data

  20. Internalmissing data strategyusedby C4.5 • C4.5 uses a probabilistic approach to handle missing data • C4.5: • Multiple split (Each node T can be partitioned into T1 , T2 … Tnsubsets) • Evaluation measure: Information Gain ratio • If there exist missing values in an attribute X, C4.5 uses the subset with all knownvalues ofXto calculate the information gain. • Once a test based on an attributeXis chosen, sC4.5 uses a probabilistic approach to partitionthe instances with missing values in X

  21. Internalmissing data strategyusedby C4.5 • When an instance in Twith known value is assigned to asubsetTi, • probability of that instance belonging to subset Tiis 1 • probability of that instance belonging toallother subsets is 0 • C4.5 associates to each instance inTi aweightrepresenting the probability of that instance belongingtoTi. • If the instance has a known value, and satisfies the test with outcome Oi, then this instanceis assigned to Tiwith weight 1 • If the instance has an unknown value, this instance is assigned toall partitions with different weights for each one: • The weight for the partitionTiis the probabilitythat instance belongs toTi. • This probability is estimated as the sum of the weights of instances inTknown to satisfy the test with outcome Oi, divided by the sum of weights of the cases in Twithknown values on the attributeX.

  22. ExperimentalAnalysis* • Usingcross-validationestimatederrorratescompareperformanceof : • K-nearest neighbour algorithm as an imputation method • Meanor mode imputation method • Internalalgorithms used by C4.5 and CN2 to learn withmissing data • Missingvalues were artificially implanted, in different rates and attributes (more than 50%) • Datasets from UCI [10]: Bupa, Cmc, Pima and Breast *G. Batista and M.C. Monard, “An Analysis of Four Missing DataTreatment Methods for Supervised Learning,”AppliedArtificialIntelligence,vol. 17, pp. 519-533, 2003

  23. Comparative results for the Breast data set

  24. Comparative results for the Bupadata set

  25. Comparative results for the Cmcdata set

  26. Comparative results for the Prima data set

  27. Conclusion • Missing data huge data quality problem • Vastvarietyofcausesofmissingess • In general, there is no best, universal method of handling missingvalues • Differenttypesofmissingnessmechanism(MCAR, MAR, MNAR) anddatasetsrequiredifferentapproachesofdealingwithmissingvalues

  28. Thank you for your attention! Questions?

  29. Homework problem: • 1. List the types of missingness mechanisms. State one way you think should be appropriate for solvingeach of themand shortly explain way.

  30. Eliminate data objectsorattributes • Eliminateobjectswithmissingvalues (listwisedeletion) • Simpleandeffectivestrategy • Evenpartiallyspecifiedobjectscontains some information • Ifthere are manyobjects- reliableanalysiscan be difficultorimpossible • Unless data are missing completely at random, listwise deletion can bias the outcome. • Eliminateattributesthathavemissingvalues • Carefully: Theseattributesmaybecriticalforanalysis • Listwise deletion and pairwise deletion used in approximately 96% of studies in the social and behavioral sciences.

  31. Estimate Missing Values • Missing data sometimescan be estimatedreliablyusingvaluesofremaingcasesorattrubutes: • replacing a missing attribute value by the most common value of that attribute, • replacing a missing attribute value by the mean for numerical attributes, • assigning all possible values to the missing attribute value, • assigning to a missing attribute value the corresponding value taken from the closest case, • replacing a missingattribute value by a new value, computed from a new data set, considering the original attribute as a decision (imputation) • Forthisstrategy, comonlyused are machinelearningalgorithms: • Unstructured (Decisiontrees, Naive Bayes, K-Nearesneighbors…) • Structured (Hidden Markov Models, Conditional Random Fields, Structured SVM…) • Some ofthesemethods are more accurate, but more computationalyexpensive, so differentsituationsrequiredifferentsolutions

  32. Handlingthe Missing Value During Analysis • Missingattribute values are taken into account during the mainprocess of acquiring knowledge • In clustering, similarity between the objects calculated using only the attributes that do not have missing values. Similarity in this case only approximation, but unless the total number of attributes is small or the numbers of missing values is high, degree of inaccuracy may not matter much. • C4.5 induces a decisiontree during tree generation, splitting cases with missing attribute valuesinto fractions and adding these fractions to new case subsets. • A methodofsurrogate splits to handle missing attribute values was introduced inCART. • In modification of the LEM2 (Learning from Examples Module, version 2) ruleinduction algorithm rules are induced form the original dataset, with missing attribute values considered to be "do not care" conditions or lost values. • In statistics, pairwise deletion is used to evaluate statisticalparameters from available information: • to compute the covariance of variables X and Y , all those cases or observations in which both X and Y are observed are used regardless of whether other variables in the dataset have missing values. • In CRFs, marginalizingout effect ofmissinglabelinstanceson labeled data, and thus utilizing information of allobservations and preserving the observed graph structre.

More Related