1 / 42

Data Cleaning

Information Integration Lecture 13. Data Cleaning. Michael Kassoff Spring 2003. Big Picture. Problem Statement – What is the desired end result? (“Business requirements”)

kana
Download Presentation

Data Cleaning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Integration Lecture 13 Data Cleaning Michael Kassoff Spring 2003

  2. Big Picture • Problem Statement – What is the desired end result? (“Business requirements”) • Data Location / Collection – Find some data to integrate, and/or make some yourself. Put the data in staging tables if desired and possible. • Data Cleaning - Get rid of bad data • Data Integration– See first 12 lectures of CS 246

  3. Staging Tables • Local tables in a database in the schema of imported data and with ‘dirty data’ • May have more staging tables as intermediates between imported schema staging tables and local schema • Especially useful if importing data infrequently

  4. Data Cleaning • That’s what this lecture is about! • Why is data cleaning needed? • How do we clean it?

  5. Data is often of low quality • Why? • You didn’t collect it yourself! • It probably was created for some other use, and then you came along wanting to integrate it • People make mistakes (typos) • People are busy (“this is good enough”)

  6. Problems with data • Some data have problems on their own • Other data is problematic only when you want to integrate it

  7. Data with problems on their own

  8. Problems due to lack of structure • Nowadays, we have databases and XML • Before, everyone had their own way of structuring data, based on what was convenient for them • Even today, not everyone uses structured data • Convenience • Lack of time • Efficiency of representation

  9. Case study: government agency data • What we want:

  10. First problem What’s wrong here? 1'Dept. of Transportation'New York'NY 2'Dept. of Finance'New York'NY 3'Office of Veteran's Affairs'New York'NY • The separator is used in the data. • Easy to miss if you don’t check the # of columns when parsing each row.

  11. Second problem What’s wrong here? 1,Dept. of Transportation,New York City,NY 2,Dept. of Finance,City of New York,NY 3,Office of Veteran's Affairs,New York,NY • We need standardization / naming conventions • Could we enforce this in a database? In XML?

  12. Third problem What’s wrong here? 1,Dept. of Transportation,New York,NY ,Dept. of Finance,New York,NY 3,Office of Veteran's Affairs,New York,NY • A missing required field • Couldn’t occur in a database because of primary key constraint • Could fix this in XML with a REQUIRED tag

  13. Fourth problem What’s wrong here? 1,Dept. of Transportation,New York,NY Two,Dept. of Finance,New York,NY Office of Veteran's Affairs,3,New York,NY • No data type constraints • Can’t fix in XML w/ DTDs. • XML Schema fixes this problem

  14. Fifth Problem What’s potentially a problem here? 1,Dept. of Transportation,New York,NY 2,Dept. of Finance,New York,NY 3,Commission for the United Nations Consular Corps and Protocol,New York,NY • Field longer than documentation tells us. • XML DTDs cannot help, XML Schema can help

  15. Sixth Problem What’s wrong here? 1,Dept. of Transportation,New York,NY 2,Dept. of Finance,New York,NY 2,Office of Veteran's Affairs,New York,NY • Primary key constraint violation • XML can help us if we use ID attribute

  16. Seventh Problem What’s wrong here? 1,Dept. of Transportation,New York,NY 2,Dept. of Finance,New York,NY 3,Dept. of Finance,New York,NY • Redundancy! • XML can’t help here • Databases can help (use constraints) • Solution is to normalize data

  17. Eighth problem What’s wrong here? Contract 1,Dept. of Transportation,New York,NY 2,Dept. of Finance,New York,NY 3,Office of Veteran's Affairs,New York,NY • Contractual obligations disallow you from using the data as you like

  18. Problems not due to lack of structure (it’s in a database)

  19. Flags • More subtle if they are in a database • Most frequently used flags: 0s and 9s • Signify ‘no data’ or ‘out of range’ • Be careful: sometimes 0 is a better value for ‘out of range’ than is null (Ex: trace elements)

  20. Typos • How to catch them? • Pay someone to manually check all data • Sort data by frequency • ‘green’ is more frequent than ‘rgeen’ • Works well for categorical data • Use constraints • Use statistical techniques

  21. Using Constraints to Catch Corrupt Data • Foreign key constraints work well if data possibilities can be pre-enumerated • Numerical constraints • Weight can’t be negative • People can’t have more than 2 parents • Women can’t bear 80 children

  22. Using Statistical Techniques to Catch Corrupt Data • Check for outliers (the case of the 60 foot man) • Check for correlated outliers (“pregnant males”) • People can be male • People can be pregnant • People can’t be male AND pregnant

  23. Hidden Semantics • Even in a structured document, not all data is explicit • Order may be important, even if there is no explicit ‘order tag’ • Whitespace may be important

  24. Hidden Semantics Example: MYCIN rules • MYCIN is a program that uses a knowledge base of rules to diagnose a patient using backward chaining • The order of the rules is important – using the same rules in a different order produces different results • There is nothing in the rules themselves that suggests the importance of order

  25. Hidden Semantics Example: Gene Ontology (GO) • Above is some data from the GO • The whitespace is important • What does the whitespace mean? I have no idea. I’d need to look at the documentation.

  26. Misleading data • The table on the top is data as it is given to us • The table on the bottom is what the data “means” • Whitespace is used in a misleading way

  27. Data that that is fine on its own, but becomes problematic when you want to integrate it

  28. Formats • Not everyone uses the same format as you • Dates are especially problematic: • 12/19/77 • 12/19/1977 • 12-19-77 • 19/12/77 • Dec 19, 1977 • 19 December 1977 • 9 in Tevet, 5738

  29. Data that Moves • Be careful of taking snapshots of a moving target • Let’s say you want to store the price of a skillet in France, and the price of a pot in Germany • You can’t store it all in the same currency (say, US$) because the exchange rate changes • Price in foreign currency stays the same • Must keep the data in foreign currency and use the current exchange rate to convert

  30. Data at a different level of detail than you need • If it is at a finer level of detail, you can sometimes bin it • Example • I need age ranges of 20-30, 30-40, 40-50, etc. • Imported data contains birth date • No problem! Divide data into appropriate categories

  31. Data at a different level of detail than you need (cont’d) • Sometimes you cannot bin it • Example • I need age ranges 20-30, 30-40, 40-50 etc. • Data is of age ranges 25-35, 35-45, etc. • What to do? • Ignore age ranges because you aren’t sure • Make educated guess based on imported data (e.g., assume that # people of age 25-35 are average # of people of age 20-30 & 30-40)

  32. Conflicting Data • Information source #1 says that George lives in Texas • Information source #2 says that George lives in Washington, DC • What to do? • Use both (He lives in both places) • Use the most recently updated piece of info • Use the “most trusted” info • Flag row to be investigated further by hand • Use neither (We’d rather be incomplete than wrong)

  33. Data Cleaning seems like a big pain. Can I avoid it somehow?

  34. One Possibility: Do Nothing • Maybe you don’t want to clean up the data at all • Example: RiboWeb • RiboWeb consists of integrated data from scientific papers about ribosomal structure • Even if data is known to be incorrect, it is kept in the RiboWeb database • The source of the data is contained in the schema, allowing you to judge for yourself the likely quality of the data • Author • Journal name • Whether paper was reviewed by peers

  35. 5 Steps to Cleaner Data

  36. 5 Steps of Data Cleaning • Semi-structure • Standardize • Local consistency check • Global consistency check • Document

  37. Example example adapted from Dealing with Dirty Data By Ralph Kimball DBMS, September 1996 Address entry from unstructured file: Ralph B and Julianne Kimball Ste. 11613150 Hiway 9Box 1234 Boulder CrkColo 95006

  38. Semi-structuring Also known as parsing: Addressee First Name(1): RalphAddressee Middle Initial(1): BAddressee Last Name(1): KimballAddressee First Name(2): JulianneAddressee Last Name(2): KimballStreet Address Number: 13150Street Name: Hiway 9Suite Number: 116Post Office Box Number: 1234City: Boulder CrkState: ColoFive Digit Zip: 95006

  39. Standardize • We should replace synonyms with one standard term • Hiway 9  Highway 9 • Boulder Crk  Boulder Creek • Colo  Colorado

  40. Local consistency check Does each piece of data make sense on its own? • Boulder Creek and Zip Code 95006 are in California • State was listed as Colorado • 2/3 attributes point to California as the correct state. Change state to California.

  41. Global consistency check • Find either Ralph Kimball or Julianne Kimball in other customer records and ensure that all of the elements of all the addresses are identical. • This differs from the sanity check because it looks at other pieces of data, not just the current one

  42. Documenting Document the results of semi-structuring, standardizing, and consistency checking in metadata • Important for users of the integrated database • Important for doing future updates of the database

More Related