Data Preprocessing

Data Preprocessing

Relational Databases - Normalization • Denormalization • Data Preprocessing • Missing Data • Missing values and the 3VL approach • Problems with 3VL approach • Special Values

Remember: Relational Databases • Model entities and relationships • Entities are the things in the real world • Information about employees and the department they work for • Employee and department are entities • Relationships are the links between these entities • Employee works for a department

Relation or Tables • Relation: a table of data • table = relation,(set theory, based on predicate logic)

Columns and Rows • Each column or attribute describes some piece of data that each record in the table has • Each row in a table represents a record • Rows, records or tupels

Keys • A superkey is a column (or a set of columns) that can be used to identify a row in as table • a key is a minimal superkey • There are different possible keys • candidate keys • We chose form the candidate keys the primary key • Primary key is used to identify a single row (record) • Foreign keys represents links between tables

Keys • primary key • foreign key

Functional Dependencies • If there is a functional dependency between columns A and B in a given table which may be written • Then the value of column A determines the value of column B • employeeID functionally determines the name

Schema • Database schema • Structure or design of the database • Database without any data in it employee(employeeID,name,job,departmentID)

Design • Minimize redundancy • Redundancy: data is repeated in different rows employee(employeeID,name,job,departmentID,departmentName)

Reduce redudancy employee(employeeID,name,job,departmentID,departmentName) employee(employeeID,name,job,departmentID) employee(departmentID,name)

Insert Anomalies • Insert data into flawed table • Data does not match what is already in the table • It is not obvious which of the rows in the database is correct

Deletion Anomalies • Delete data from a flawed schema • When we delete all the employees of Department 128, we no longer have any record, that the Department 128 exists

Update Anomalies • Change data in a flawed schema • We do not change the data for every row correctly

Null Values • Avoid schema designs that have large numbers of empty attributes

Normalization • Remove design flaws from a database • Normal forms, which are a set of rules describing what we should and should not do in our table structures • Breaking tables into smaller tables that form a better design

Normal Forms 1 Forma Normal 2 Forma Normal 3 Forma Normal Forma Normal Boyce Codd 4 Forma Normal 5 Forma Normal

First Normal Form (1NF) • Each attribute or column value must be atomic • Each attribute must contain a single value

1NF

Second Normal Form (2NF) • All attributes that are no part of the primary key are fully dependent on the primary key • Each non key attribute must be functionally dependent on the key • Is already in 1NF

2NF ? employee(employeeID,name,job,departmentID,skill)

Functional dependencies • employeeID,skill  name, job, deparmentID • employeeID  name, job, deparmentID • Partially functionally dependent on the primary key • Not fully functionally dependent on the primary key

2NF • Decompose the table into tables which all the non-key attributes are fully functionally dependent on the key • Breaking the table into two tables • employee(employeeID,name,job,departmentID) • employeeSkills(employeeID,skill)

Third Normal Form (3NF) • Remove all transitive dependencies • Be in 2NF

employee(employeeID,name,job,departmentID,departmentName) • employeeID  name,job,departmentID,departmentName • departmentID  departmentName

Transitive dependency • employeeID  departmentName • employeeID  deparmtentID • departmentID  departmentName

3NF • Remove transitive dependency • Decompose into multiple tables • emploee(employeeID,name,jop,departmentID) • deparment(deparmentID,deparmtentName)

3NF • The left side of the functional dependency is a superkey (that is, a key that is not necessarily minimal) • Boyce-Codd Normal Form or • The right side of the functional dependency is a part of any key of the table

BCNF • All attributes must be functionally determined by a superkey

Full normalization means lots of logically seperate relations • Lots of logically separate relations means a lot of physically separate files • Lots of physically separate files means a lot of I/O • Difficulties in finding dimensions for dimensional schema, star schema • (dimension tables, fact table)

What is Denormalization? • Normalizing a relational variable R means replacing R by a set of projections R1,R2,..,Rn such that R is equal to the join R1,R2,..,Rn • Reduce redundancy, each projections R1,R2,..,Rn is at the highest possible value of normalization • Denormalizing the relational variables means replacing them by their join R • Increase redundancy, by ensuring that R is a lower level of normalization than R1,R2,..,Rn • Problems • Once we start to denormalize, it is not clear when to stop?

Dimensional Schema • Array cells often empty • The more dimensions, there more empty cells • Empty cell  Missing information • How to treat not present information ? • How does the system support • Information is unknown • Has been not captured • Not applicable • .... • Solution?

Why Data Preprocessing? • Data in the real world is dirty • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • e.g., occupation=“ ” • noisy: containing errors or outliers • e.g., Salary=“-10” • inconsistent: containing discrepancies in codes or names • e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records

Why Is Data Dirty? • Incomplete data may come from • “Not applicable” data value when collected • Different considerations between the time when the data was collected and when it is analyzed. • Human/hardware/software problems • Noisy data (incorrect values) may come from • Faulty data collection instruments • Human or computer error at data entry • Errors in data transmission • Inconsistent data may come from • Different data sources • Functional dependency violation (e.g., modify some linked data) • Duplicate records also need data cleaning

Why Is Data Preprocessing Important? • No quality data, no quality mining results! • Quality decisions must be based on quality data • e.g., duplicate or missing data may cause incorrect or even misleading statistics. • Data warehouse needs consistent integration of quality data • Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

Multi-Dimensional Measure of Data Quality • A well-accepted multidimensional view: • Accuracy • Completeness • Consistency • Timeliness • Believability • Value added • Interpretability • Accessibility

Major Tasks in Data Preprocessing • Data cleaning • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration • Integration of multiple databases, data cubes, or files • Data transformation • Normalization and aggregation • Data reduction • Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization • Part of data reduction but with particular importance, especially for numerical data

Forms of Data Preprocessing

Data Cleaning • Importance • “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball • “Data cleaning is the number one problem in data warehousing”—DCI survey • Data cleaning tasks • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data • Resolve redundancy caused by data integration

Missing Data • Data is not always available • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry • not register history or changes of the data • Missing data may need to be inferred

Missing Values • The approach of the problem of missing values adopted in SQL is based on nulls and three-valued logic (3VL) • null corresponds to UNK for unknown • 3VL a mistake?

Boolean Operators • In scalar comparison in which either of the compared is UNK evaluates the unknown truth value

MAYBE • Another important Boolean operator is MAYBE

Example • Consider the query “Get employees who may be- but are not definitely known to be- programmers born before January 18, 1971, with salary less then €40.000 EMP WHERE MAYBE ( JOB = ‘PROGRAMMER’ AND DOB < DATE (‘1971-1-18’) AND SALLARY < 40000 )

Without maybe we assume the existence of another operator called IS_UKN which takes a single scalar operand and returns true if operand evaluates UNK otherwise false EMP WHERE ( JOB = ‘PROGRAMMER’ OR IS_UKN (JOB) ) AND ( DOB < DATE (‘1971-1-18’) OR IS_UKN (DOB) ) AND ( SALLARY < 40000 OR IS_UKN (SALLARY) ) AND NOT ( JOB = ‘PROGRAMMER’ AND DOB < DATE (‘1971-1-18’) AND SALLARY < 40000 )

Numeric expression • WEIGHT * 454 • If WEIGHT is UKN, then the result is also UKN • Any numeric expression is considered to evaluate UNK if any operands of that expression is itself UNK • Anomalies • WEIGHT-WEIGHT=UNK (0) • WEIGHT/0=UNK (“zero divide”)

UNK is not u (unk) • UNK (the value-unknown null) • u (unk) (unknown truth value) • ...are not the same thing • u is a value, UNK not a value at all! • Suppose X is BOOLEAN • Has tree values: t (true),f (false), u ukn • X is ukn, X is known to be unk • X is UKN, X is not known!

Some 3VL Consequences • The comparison x=x does not give true • In 3VL x is not equal to itself it is happens to be UNK • The Boolean expression p OR NOT(p) does not give necessarily true • unk OR NOT (unk) = unk

Example • Get all suppliers in Porto and take the union with get all suppliers not in Porto • We do not get all suppliers! • We need to add maybe in Porto • In 2 VL p OR NOT(p) corresponds to • p OR NOT(p) OR MAYBE(p) in 3VL • While two cases my exhaust full range of possibilities in the real world, the database does not contain the real world - instead it contains only knowledge about real world

Data Preprocessing