Advanced Database Systems F24DS2 / F29AT2

Advanced Database SystemsF24DS2 / F29AT2 Data Quality and Data Cleaning 2 David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

Acknowledgements I adapted this material from various sources, most notably: • A ppt presentation called `Data Quality and Data Cleaning: An Overview’ by Tamrapani Dasu and Theodore Johnson, at AT & T Labs • A paper called `Data Cleaning: Problems and Current Approaches’, by Erhard Rahm and Hong Hai Do, University of Leipzig, Germany. My thanks to these researchers for making their materials freely available online. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

What Keeps DBs Dirty A good DBMS will have built in tools for: Consistency in data types Consistency in field values Constraints and checks that deal with Null values, Outliers, Duplication. Automatic timestamps Powerful query language (makes retrieval logic errors less likely) … so, why are you refused a loan, have mail delivered to the wrong address, and get charged too much for your mobile calls? David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

… all this: • Consistency constraints are often not applied, or are applied! • suppose height is not allowed to go over 2 metres in a school student DB • My postcode problem • The data are just too numerous, complex and ill-understood. `Cleaning it’ would cost too much! • Undetectable problems: incorrect values, missing entries • Metadata not maintained properly David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

Single Source vs Multiple Source Schema Level v Instance Level One useful way to categorize problems, independent of how we did so in the last lecture, is according to whether the problems are the sort we can get if we have just one source of data, or whether the problem arises directly from trying to combine data from multiple sources Problems can also be schema level or instance level David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

Single Source / Schema level examples David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

Single Source / Instance level examples David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

Multiple Source Problems/ Instance and Schema level examples The Smiths buy books and music online from company A: They also buy books and music online from company B: David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

When Companies A and B merge, various problems arise when they merge their DBs Combining customer fields and client fields – are they really the same things? How to ensure that Company A’s customer 37 and Company B’s client 37 get separate entries in the new DB. Are Luke Smith and Luke Michael Smith the same person?? Do Luke and Leia live at the same address? Etc … A forced `fast resolution’ to these problems will usually lead to errors in the new `integrated’ DB David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

A Special but Common type of Problem: Semantic Complexity Semantic Complexity (SC) is the state of play where different users of a DB have different conceptions of what the data represent. E.g. Local Police DB keep record of all crimes in an area, where the key is the victim’s name. When someone who was a victim moves to a different area, they remove all records relating to that person.The local council use this DB to produce a report of the total amount of crime every month. Why does it give figures that are too low? David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

Semantic Complexity: Missing/Default Values One source of semantic complexity is the different meanings that missing values can have. E.g. Suppose the histogram of value types in mobile phone no. field is: David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

What does NULL mean? A. This record is of someone who does not have a mobile phone? B. This record is of someone who has a mobile phone, but chose not to supply the number? C. This record is of someone who has a mobile phone, but who forgot to supply the number, or it was hard to decipher and recorded as NULL? Maybe some are of type A and some are of type B and some are of type C. For some applications/analyses, we may wish to know the breakdown into types. What about the All zero and All nine entries? Precisely the same can be said of them. Or, perhaps the protocols for recording the entries indicated NULL for type A, 0000000 for type B and 9999999 for type C. The above relate to a quite simple form of semantic complexity – but what if someone uses this DB to estimate the proportion of people who have never had a mobile phone? David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

Data Cleaning: Phases Phases in DC: Analysis: to detect errors and inconsistencies in the DB needs detailed analysis, involving both manual inspection and automated analysis programs. This reveals where (most of) the problems are. Defining transformation and mapping rules: Having found the problems, this next phase is concerned with defining the way you are going to automate solutions to clean the data David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

Data Cleaning: phases continued Verification: In this phase we test and evaluate the transformation plans we made in stage 2; without this, we may end up making the data dirtier rather than cleaner. Transformation: Do the transformation, now that you’re sure it will be done correctly. Backflow of cleaned data: Do what we can to ensure that cleaned data percolates to various repositories that may still harbour errors. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

Phases in DC: Data Analysis Data Profiling: examine the instances to see how the attributes vary. E.g. Automatically generate a histogram of values for that attribute. How does the histogram help us in finding problems in this case? David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

What problems does this analysis alert us to? David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

Phases in DC: Data Mining Data Mining is simply about more advanced forms of data analysis. We talk about that next week. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

Phases in DC: Defining Data Transformation Rules As a result of the analysis phase, you will find various problems that translate to a list of actions, such as: • Remove all entries for J. Smith (duplicates of John Smith) • Find entries with `bule’ in colour field and change these to `blue’. • Output a list of all records where the Phone number field does not match the pattern (NNNNN NNNNNN) (further steps required to then cleanse these data) • Find all entries where the Name field contains a potential DoB string, and the DoB field is NULL, and then repair these entries. • Etc … David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

Phases in DC: Verification This speaks for itself! Data transformation is the main step that actually changes the data itself – so you need to be sure you will do it correctly. So, test and examine the transformation plans very carefully. It is easy to mess the data up even more if you have a faulty transformation plan. • I have a very thick C++ book where it says strict in all the places where it should say struct David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

Phases in DC: Transformation Go ahead and do it. For large DBs, this task is supported by a variety of tools (as also is data analysis, often in the same tool). The list is growing. E.g. DATACLEANSER is a specialist tool for identifying and eliminating duplicates. TRILLIUM focuses on cleaning name/address data. Such tools use a huge built-in library of rules for dealing with the common problems. Alternatively or additionally you can write your own code for specialised bits of cleaning (and then verify it!). David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

Phases in DC: Backflow Once the `master’ source of data – perhaps a newly integrated DB, is `cleaned’, there is the opportunity to fix errors that may have spread beyond the DB before it was cleaned. This will be a very different and varied process in every case, and the results of the first Analysis stage should start to provide clues about what could be done here. Examples of such backflow can vary between: • Refunding 1 customer 12p because he was mischarged for postage owing to a faulty postcode entry • Removing £1,000,000,000’s worth of a brand of olive oil from supermarket shelves across Europe, since a DB (and hence the label) did not correctly indicate that it contains something dangerous to those with nut allergies. David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

What this lecture was about • Why DBs are almost always not `clean’ • A single source/multi-source and instance level/schema level classification of errors • Semantic Complexity • Five Phases in a corporate Data Cleaning process David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to contact me

Advanced Database Systems F24DS2 / F29AT2