Data Cleansing: Filling Missing Values in Data

1 / 14

# Data Cleansing: Filling Missing Values in Data - PowerPoint PPT Presentation

Data Cleansing: Filling Missing Values in Data. Class Presentation CIS 764 Instructor Presented by Dr. William Hankley Gaurav Chauhan. Overview. Problems Caused Methods for retrieving missing values Predicting values The average way

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Data Cleansing: Filling Missing Values in Data' - jacob-ruiz

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Data Cleansing: Filling Missing Values in Data

Class Presentation

CIS 764

Instructor Presented by

Dr. William Hankley Gaurav Chauhan

Overview
• Problems Caused
• Methods for retrieving missing values
• Predicting values
• The average way
• The probabilistic way
• By leveraging the relational network structure
• Conclusions

CIS 764-Gaurav Chauhan

Problems Caused

Following problems occur in data analysis because of missing values in the same

• Summarizing variables
• Computing new variables
• Comparing variables
• Combining variables
• In Time Series Analysis

CIS 764-Gaurav Chauhan

Methods for retrieving missing values
• Considering average of the available values for prediction
• Using probabilistic approach for value prediction
• Leveraging relation network structure of the data to predict values

CIS 764-Gaurav Chauhan

For finding the values for year 1938 and 1942

We can calculate the rainfall for these two years as:

Taking avg of rainfall of 1937 and 1939

Rainfall in 1938 = (32+25)/2 cm

= 28.5 cm

Taking avg of rainfall of 1941 and 1943

Rainfall in 1942 = (30+28)/2 cm

= 29 cm

CIS 764-Gaurav Chauhan

Predicting Values- the probabilistic way
• Assume that we have n values and we are required to predict n+1th value
• For every i such that i=1 to n the probability that a data instance has a value vi is p(vi)
• Each of these probabilities is calculated on the bases of the frequency with which vi occurs in the data.
• That said, vn+1 is picked at random such that

p(vn+1= vi ) > p(vn+1= vj)

If p(vi)>p(vj)

CIS 764-Gaurav Chauhan

Predicting Values by leveraging the relational network
• This technique applies only to relational data only
• The values of missing instances are predicted as the mode of the peers who fit the relational network and have no missing values

CIS 764-Gaurav Chauhan

Predicting Valuesby leveraging the relational network
• Example 1

Book A Book C Book B

Category A Category C Category B

Book A Book C Book B

? (Predicted= A) Category C Category B

CIS 764-Gaurav Chauhan

Predicting Values by leveraging the relational network
• Example 2

Teacher

Student 1 Student 2 Student 3 Student 4

Age(19) ? Age(18) Age(19)

(Predicted 19)

CIS 764-Gaurav Chauhan

Conclusion
• Missing values in the data are bad when it is used for analysis, learning or mining purposes
• Various techniques aim at predicting data but none has reached a 100% accuracy
• An average of 90% accuracy with which these values are predicted is still acceptable

CIS 764-Gaurav Chauhan

References
• www.hrs.co.nz
• http://dblife.cs.wisc.edu/search.cgi?entity=entity-8982

CIS 764-Gaurav Chauhan

Questions Anyone
• I am shivering not because of nervousness but because of cold room temperature

-one nervous student

CIS 764-Gaurav Chauhan