1 / 14

Data Cleansing: Filling Missing Values in Data

Data Cleansing: Filling Missing Values in Data. Class Presentation CIS 764 Instructor Presented by Dr. William Hankley Gaurav Chauhan. Overview. Problems Caused Methods for retrieving missing values Predicting values The average way

jacob-ruiz
Download Presentation

Data Cleansing: Filling Missing Values in Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Cleansing: Filling Missing Values in Data Class Presentation CIS 764 Instructor Presented by Dr. William Hankley Gaurav Chauhan

  2. Overview • Problems Caused • Methods for retrieving missing values • Predicting values • The average way • The probabilistic way • By leveraging the relational network structure • Conclusions CIS 764-Gaurav Chauhan

  3. Problems Caused Following problems occur in data analysis because of missing values in the same • Summarizing variables • Computing new variables • Comparing variables • Combining variables • In Time Series Analysis CIS 764-Gaurav Chauhan

  4. Methods for retrieving missing values • Considering average of the available values for prediction • Using probabilistic approach for value prediction • Leveraging relation network structure of the data to predict values CIS 764-Gaurav Chauhan

  5. Predicting Values- the average way CIS 764-Gaurav Chauhan

  6. For finding the values for year 1938 and 1942 We can calculate the rainfall for these two years as: Taking avg of rainfall of 1937 and 1939 Rainfall in 1938 = (32+25)/2 cm = 28.5 cm Taking avg of rainfall of 1941 and 1943 Rainfall in 1942 = (30+28)/2 cm = 29 cm CIS 764-Gaurav Chauhan

  7. Predicting Values- the probabilistic way • Assume that we have n values and we are required to predict n+1th value • For every i such that i=1 to n the probability that a data instance has a value vi is p(vi) • Each of these probabilities is calculated on the bases of the frequency with which vi occurs in the data. • That said, vn+1 is picked at random such that p(vn+1= vi ) > p(vn+1= vj) If p(vi)>p(vj) CIS 764-Gaurav Chauhan

  8. Predicting Values by leveraging the relational network • This technique applies only to relational data only • The values of missing instances are predicted as the mode of the peers who fit the relational network and have no missing values CIS 764-Gaurav Chauhan

  9. Predicting Values by leveraging the relational network CIS 764-Gaurav Chauhan

  10. Predicting Valuesby leveraging the relational network • Example 1 Book A Book C Book B Category A Category C Category B Book A Book C Book B ? (Predicted= A) Category C Category B CIS 764-Gaurav Chauhan

  11. Predicting Values by leveraging the relational network • Example 2 Teacher Student 1 Student 2 Student 3 Student 4 Age(19) ? Age(18) Age(19) (Predicted 19) CIS 764-Gaurav Chauhan

  12. Conclusion • Missing values in the data are bad when it is used for analysis, learning or mining purposes • Various techniques aim at predicting data but none has reached a 100% accuracy • An average of 90% accuracy with which these values are predicted is still acceptable CIS 764-Gaurav Chauhan

  13. References • www.hrs.co.nz • http://dblife.cs.wisc.edu/search.cgi?entity=entity-8982 CIS 764-Gaurav Chauhan

  14. Questions Anyone • I am shivering not because of nervousness but because of cold room temperature -one nervous student CIS 764-Gaurav Chauhan

More Related