data cleansing filling missing values in data n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Data Cleansing: Filling Missing Values in Data PowerPoint Presentation
Download Presentation
Data Cleansing: Filling Missing Values in Data

Loading in 2 Seconds...

play fullscreen
1 / 14

Data Cleansing: Filling Missing Values in Data - PowerPoint PPT Presentation


  • 99 Views
  • Uploaded on

Data Cleansing: Filling Missing Values in Data. Class Presentation CIS 764 Instructor Presented by Dr. William Hankley Gaurav Chauhan. Overview. Problems Caused Methods for retrieving missing values Predicting values The average way

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data Cleansing: Filling Missing Values in Data' - jacob-ruiz


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
data cleansing filling missing values in data

Data Cleansing: Filling Missing Values in Data

Class Presentation

CIS 764

Instructor Presented by

Dr. William Hankley Gaurav Chauhan

overview
Overview
  • Problems Caused
  • Methods for retrieving missing values
  • Predicting values
    • The average way
    • The probabilistic way
    • By leveraging the relational network structure
  • Conclusions

CIS 764-Gaurav Chauhan

problems caused
Problems Caused

Following problems occur in data analysis because of missing values in the same

  • Summarizing variables
  • Computing new variables
  • Comparing variables
  • Combining variables
  • In Time Series Analysis

CIS 764-Gaurav Chauhan

methods for retrieving missing values
Methods for retrieving missing values
  • Considering average of the available values for prediction
  • Using probabilistic approach for value prediction
  • Leveraging relation network structure of the data to predict values

CIS 764-Gaurav Chauhan

for finding the values for year 1938 and 1942
For finding the values for year 1938 and 1942

We can calculate the rainfall for these two years as:

Taking avg of rainfall of 1937 and 1939

Rainfall in 1938 = (32+25)/2 cm

= 28.5 cm

Taking avg of rainfall of 1941 and 1943

Rainfall in 1942 = (30+28)/2 cm

= 29 cm

CIS 764-Gaurav Chauhan

predicting values the probabilistic way
Predicting Values- the probabilistic way
  • Assume that we have n values and we are required to predict n+1th value
  • For every i such that i=1 to n the probability that a data instance has a value vi is p(vi)
  • Each of these probabilities is calculated on the bases of the frequency with which vi occurs in the data.
  • That said, vn+1 is picked at random such that

p(vn+1= vi ) > p(vn+1= vj)

If p(vi)>p(vj)

CIS 764-Gaurav Chauhan

predicting values by leveraging the relational network
Predicting Values by leveraging the relational network
  • This technique applies only to relational data only
  • The values of missing instances are predicted as the mode of the peers who fit the relational network and have no missing values

CIS 764-Gaurav Chauhan

predicting values by leveraging the relational network2
Predicting Valuesby leveraging the relational network
  • Example 1

Book A Book C Book B

Category A Category C Category B

Book A Book C Book B

? (Predicted= A) Category C Category B

CIS 764-Gaurav Chauhan

predicting values by leveraging the relational network3
Predicting Values by leveraging the relational network
  • Example 2

Teacher

Student 1 Student 2 Student 3 Student 4

Age(19) ? Age(18) Age(19)

(Predicted 19)

CIS 764-Gaurav Chauhan

conclusion
Conclusion
  • Missing values in the data are bad when it is used for analysis, learning or mining purposes
  • Various techniques aim at predicting data but none has reached a 100% accuracy
  • An average of 90% accuracy with which these values are predicted is still acceptable

CIS 764-Gaurav Chauhan

references
References
  • www.hrs.co.nz
  • http://dblife.cs.wisc.edu/search.cgi?entity=entity-8982

CIS 764-Gaurav Chauhan

questions anyone
Questions Anyone
  • I am shivering not because of nervousness but because of cold room temperature

-one nervous student

CIS 764-Gaurav Chauhan