Missing values problem in Data Mining

Missingvalues problem in Data Mining Jelena Stojanovic 03/20/2014

Outline • Missing data problem • Missing values in attributes • Missing values in target variable • Missingnessmechanisms • Aapproachesto Missing values • Eliminate Data Objects • Estimate Missing Values • Handlingthe Missing Value During Analysis • Experimental analisys • Conclusion

Missing Data problem • There are a lot of serious data quality problems in real datasets: incomplete, redundant, inconsistent and noisy • reduce the performance of data mining algorithms • Missing data is a common issue in almost every real dataset. • Causedby varied factors: • high cost involved in measuring variables, • failure of sensors, • reluctance of respondents in answering certain questions or • an ill-designed questionnaire.

Missingvalues in datasets • The missing data problem arises when values for one or more variables are missing from recorded observations.

Missingvalues in attributes (independant variables)

Missinglabels

Missingnessmechanism • Missing Completely At Random • Missing At Random • MissingNot At Random

MissingCompletelyat Random - MCAR • Missing Completely at Random- the missingness mechanism does not depend on the variable of interest, or any other variable, which is observed in the dataset. • The data are collected and observed arbitrarily and the collected data does not depend on any other variable of the dataset. • The case when respondents decide to reveal their income levels based on coin-flips • This type of missing data is very rarely found and the best method is to ignore such cases.

MCAR (continued) • Estimate E(X) from partially observed data: X* = [0, 1, m, m,1,1, m, 0, 0, m…] E(X)=? • True data: X = [0, 1, 0, 0, 1, 1, 0, 0, 0, 1…] E(X) = 0.5 • Rx = [0, 0, 1, 1, 0, 0, 1, 0, 0, 1…] • If MCAR: • X* = [0, 1, m, m,1,1, m, 0, 0, m…] and E(X) = 3/6 =0.5

Missing At Random - MAR • Missing at random - when the probability of an instance having a missing value for anattribute may depend on the known values, but not on the value of the missing data itself; • Missingness can only be explained by variables that are fully observed whereas those that are partially observed cannot be responsible for missingness in others; an unrealistic assumption in many cases. • Women in the population are more likely to not reveal their age, therefore percentage of missing data among female individuals will be higher.

MissingNot Ar Random- MNAR • When data are not either MCAR or MAR • Missingness mechanism depends on another partially observed variable • Situationin witch the missingness mechanism depends on the actual value of missing data. The probability of an instance having a missing valuefor an attribute could depend on the value of that attribute • Difficulttask; model themissingness

Missing data consequences • They can significantly bias the outcome of research studies. • Responseprofiles of non-respondents and respondents can be significantly different from each other. • Performing the analysis using only complete cases and ignoring the cases with missing values can reduce the sample size thereby substantially reducing estimation efficiency. • Many of the algorithms and statistical techniques are generally tailored to draw inferences from complete datasets. • It may be difficult or even inappropriate to apply these algorithms and statistical techniques on incomplete datasets.

Handling missing values • In general, methods to handle missing values belong either to sequential methods (preprocessing methods) or to parallel methods (methods in which missing attribute values are taken into account during the main process of acquiring knowledge). • Existing approaches: • Eliminate Data Objects or Attributes • Estimate Missing Values • Handlingthe Missing Values During Analysis

Eliminate data objects

Eliminating data attributes

Estimate Missing Values most common/mean value

Imputation

Imputation- nearest neighbor K-NN

Handlingthe Missing Value During Analysis • Missing values are taken into account during the mainprocess of acquiring knowledge • Some examples: • Clustering - similarity between the objects calculated using only the attributes that do not have missing values. • C4.5 -splitting cases with missing attribute valuesinto fractions and adding these fractions to new case subsets. • CART -A methodofsurrogate splits to handle missing attribute values • Rule-based induction algorithms- missingvalues „do not care conditions“ • Pairwisedeletion is used to evaluate statisticalparameters from available information • CRF-marginalizingout effect ofmissinglabelinstanceson labeled data

Internalmissing data strategyusedby C4.5 • C4.5 uses a probabilistic approach to handle missing data • C4.5: • Multiple split (Each node T can be partitioned into T1 , T2 … Tnsubsets) • Evaluation measure: Information Gain ratio • If there exist missing values in an attribute X, C4.5 uses the subset with all knownvalues ofXto calculate the information gain. • Once a test based on an attributeXis chosen, sC4.5 uses a probabilistic approach to partitionthe instances with missing values in X

Internalmissing data strategyusedby C4.5 • When an instance in Twith known value is assigned to asubsetTi, • probability of that instance belonging to subset Tiis 1 • probability of that instance belonging toallother subsets is 0 • C4.5 associates to each instance inTi aweightrepresenting the probability of that instance belongingtoTi. • If the instance has a known value, and satisfies the test with outcome Oi, then this instanceis assigned to Tiwith weight 1 • If the instance has an unknown value, this instance is assigned toall partitions with different weights for each one: • The weight for the partitionTiis the probabilitythat instance belongs toTi. • This probability is estimated as the sum of the weights of instances inTknown to satisfy the test with outcome Oi, divided by the sum of weights of the cases in Twithknown values on the attributeX.

ExperimentalAnalysis* • Usingcross-validationestimatederrorratescompareperformanceof : • K-nearest neighbour algorithm as an imputation method • Meanor mode imputation method • Internalalgorithms used by C4.5 and CN2 to learn withmissing data • Missingvalues were artificially implanted, in different rates and attributes (more than 50%) • Datasets from UCI [10]: Bupa, Cmc, Pima and Breast *G. Batista and M.C. Monard, “An Analysis of Four Missing DataTreatment Methods for Supervised Learning,”AppliedArtificialIntelligence,vol. 17, pp. 519-533, 2003

Comparative results for the Breast data set

Comparative results for the Bupadata set

Comparative results for the Cmcdata set

Comparative results for the Prima data set

Conclusion • Missing data huge data quality problem • Vastvarietyofcausesofmissingess • In general, there is no best, universal method of handling missingvalues • Differenttypesofmissingnessmechanism(MCAR, MAR, MNAR) anddatasetsrequiredifferentapproachesofdealingwithmissingvalues

Thank you for your attention! Questions?

Homework problem: • 1. List the types of missingness mechanisms. State one way you think should be appropriate for solvingeach of themand shortly explain way.

Eliminate data objectsorattributes • Eliminateobjectswithmissingvalues (listwisedeletion) • Simpleandeffectivestrategy • Evenpartiallyspecifiedobjectscontains some information • Ifthere are manyobjects- reliableanalysiscan be difficultorimpossible • Unless data are missing completely at random, listwise deletion can bias the outcome. • Eliminateattributesthathavemissingvalues • Carefully: Theseattributesmaybecriticalforanalysis • Listwise deletion and pairwise deletion used in approximately 96% of studies in the social and behavioral sciences.

Estimate Missing Values • Missing data sometimescan be estimatedreliablyusingvaluesofremaingcasesorattrubutes: • replacing a missing attribute value by the most common value of that attribute, • replacing a missing attribute value by the mean for numerical attributes, • assigning all possible values to the missing attribute value, • assigning to a missing attribute value the corresponding value taken from the closest case, • replacing a missingattribute value by a new value, computed from a new data set, considering the original attribute as a decision (imputation) • Forthisstrategy, comonlyused are machinelearningalgorithms: • Unstructured (Decisiontrees, Naive Bayes, K-Nearesneighbors…) • Structured (Hidden Markov Models, Conditional Random Fields, Structured SVM…) • Some ofthesemethods are more accurate, but more computationalyexpensive, so differentsituationsrequiredifferentsolutions

Handlingthe Missing Value During Analysis • Missingattribute values are taken into account during the mainprocess of acquiring knowledge • In clustering, similarity between the objects calculated using only the attributes that do not have missing values. Similarity in this case only approximation, but unless the total number of attributes is small or the numbers of missing values is high, degree of inaccuracy may not matter much. • C4.5 induces a decisiontree during tree generation, splitting cases with missing attribute valuesinto fractions and adding these fractions to new case subsets. • A methodofsurrogate splits to handle missing attribute values was introduced inCART. • In modification of the LEM2 (Learning from Examples Module, version 2) ruleinduction algorithm rules are induced form the original dataset, with missing attribute values considered to be "do not care" conditions or lost values. • In statistics, pairwise deletion is used to evaluate statisticalparameters from available information: • to compute the covariance of variables X and Y , all those cases or observations in which both X and Y are observed are used regardless of whether other variables in the dataset have missing values. • In CRFs, marginalizingout effect ofmissinglabelinstanceson labeled data, and thus utilizing information of allobservations and preserving the observed graph structre.

Missing values problem in Data Mining

Missing values problem in Data Mining

Presentation Transcript

Data analysis with missing values sociology.ohio-state

MISSING DATA

Replacing Missing Values

Working with Missing Values

Missing Values

Missing Data in Clinical Trials

Treatment of missing values

A Robust Approach for Dealing with Missing Values in Compositional Data

Missing Data

Missing Data

Missing Data

Handling Missing Data

Missing Data in NSQIP

Missing Data

Missing Data in Research Studies

Missing Values in SAS

Handling Missing Data

Special Topic: Missing Values

Data Cleansing: Filling Missing Values in Data

Inference Problem Privacy Preserving Data Mining

Rough Set Strategies to Data with Missing Attribute Values