1 / 56

Data Preprocessing

Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems with 3VL approach Special Values. Remember: Relational Databases. Model entities and relationships

Download Presentation

Data Preprocessing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Preprocessing

  2. Relational Databases - Normalization • Denormalization • Data Preprocessing • Missing Data • Missing values and the 3VL approach • Problems with 3VL approach • Special Values

  3. Remember: Relational Databases • Model entities and relationships • Entities are the things in the real world • Information about employees and the department they work for • Employee and department are entities • Relationships are the links between these entities • Employee works for a department

  4. Relation or Tables • Relation: a table of data • table = relation,(set theory, based on predicate logic)

  5. Columns and Rows • Each column or attribute describes some piece of data that each record in the table has • Each row in a table represents a record • Rows, records or tupels

  6. Keys • A superkey is a column (or a set of columns) that can be used to identify a row in as table • a key is a minimal superkey • There are different possible keys • candidate keys • We chose form the candidate keys the primary key • Primary key is used to identify a single row (record) • Foreign keys represents links between tables

  7. Keys • primary key • foreign key

  8. Functional Dependencies • If there is a functional dependency between columns A and B in a given table which may be written • Then the value of column A determines the value of column B • employeeID functionally determines the name

  9. Schema • Database schema • Structure or design of the database • Database without any data in it employee(employeeID,name,job,departmentID)

  10. Design • Minimize redundancy • Redundancy: data is repeated in different rows employee(employeeID,name,job,departmentID,departmentName)

  11. Reduce redudancy employee(employeeID,name,job,departmentID,departmentName) employee(employeeID,name,job,departmentID) employee(departmentID,name)

  12. Insert Anomalies • Insert data into flawed table • Data does not match what is already in the table • It is not obvious which of the rows in the database is correct

  13. Deletion Anomalies • Delete data from a flawed schema • When we delete all the employees of Department 128, we no longer have any record, that the Department 128 exists

  14. Update Anomalies • Change data in a flawed schema • We do not change the data for every row correctly

  15. Null Values • Avoid schema designs that have large numbers of empty attributes

  16. Normalization • Remove design flaws from a database • Normal forms, which are a set of rules describing what we should and should not do in our table structures • Breaking tables into smaller tables that form a better design

  17. Normal Forms 1 Forma Normal 2 Forma Normal 3 Forma Normal Forma Normal Boyce Codd 4 Forma Normal 5 Forma Normal

  18. First Normal Form (1NF) • Each attribute or column value must be atomic • Each attribute must contain a single value

  19. 1NF

  20. Second Normal Form (2NF) • All attributes that are no part of the primary key are fully dependent on the primary key • Each non key attribute must be functionally dependent on the key • Is already in 1NF

  21. 2NF ? employee(employeeID,name,job,departmentID,skill)

  22. Functional dependencies • employeeID,skill  name, job, deparmentID • employeeID  name, job, deparmentID • Partially functionally dependent on the primary key • Not fully functionally dependent on the primary key

  23. 2NF • Decompose the table into tables which all the non-key attributes are fully functionally dependent on the key • Breaking the table into two tables • employee(employeeID,name,job,departmentID) • employeeSkills(employeeID,skill)

  24. Third Normal Form (3NF) • Remove all transitive dependencies • Be in 2NF

  25. employee(employeeID,name,job,departmentID,departmentName) • employeeID  name,job,departmentID,departmentName • departmentID  departmentName

  26. Transitive dependency • employeeID  departmentName • employeeID  deparmtentID • departmentID  departmentName

  27. 3NF • Remove transitive dependency • Decompose into multiple tables • emploee(employeeID,name,jop,departmentID) • deparment(deparmentID,deparmtentName)

  28. 3NF • The left side of the functional dependency is a superkey (that is, a key that is not necessarily minimal) • Boyce-Codd Normal Form or • The right side of the functional dependency is a part of any key of the table

  29. BCNF • All attributes must be functionally determined by a superkey

  30. Full normalization means lots of logically seperate relations • Lots of logically separate relations means a lot of physically separate files • Lots of physically separate files means a lot of I/O • Difficulties in finding dimensions for dimensional schema, star schema • (dimension tables, fact table)

  31. What is Denormalization? • Normalizing a relational variable R means replacing R by a set of projections R1,R2,..,Rn such that R is equal to the join R1,R2,..,Rn • Reduce redundancy, each projections R1,R2,..,Rn is at the highest possible value of normalization • Denormalizing the relational variables means replacing them by their join R • Increase redundancy, by ensuring that R is a lower level of normalization than R1,R2,..,Rn • Problems • Once we start to denormalize, it is not clear when to stop?

  32. Dimensional Schema • Array cells often empty • The more dimensions, there more empty cells • Empty cell  Missing information • How to treat not present information ? • How does the system support • Information is unknown • Has been not captured • Not applicable • .... • Solution?

  33. Why Data Preprocessing? • Data in the real world is dirty • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • e.g., occupation=“ ” • noisy: containing errors or outliers • e.g., Salary=“-10” • inconsistent: containing discrepancies in codes or names • e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records

  34. Why Is Data Dirty? • Incomplete data may come from • “Not applicable” data value when collected • Different considerations between the time when the data was collected and when it is analyzed. • Human/hardware/software problems • Noisy data (incorrect values) may come from • Faulty data collection instruments • Human or computer error at data entry • Errors in data transmission • Inconsistent data may come from • Different data sources • Functional dependency violation (e.g., modify some linked data) • Duplicate records also need data cleaning

  35. Why Is Data Preprocessing Important? • No quality data, no quality mining results! • Quality decisions must be based on quality data • e.g., duplicate or missing data may cause incorrect or even misleading statistics. • Data warehouse needs consistent integration of quality data • Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

  36. Multi-Dimensional Measure of Data Quality • A well-accepted multidimensional view: • Accuracy • Completeness • Consistency • Timeliness • Believability • Value added • Interpretability • Accessibility

  37. Major Tasks in Data Preprocessing • Data cleaning • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration • Integration of multiple databases, data cubes, or files • Data transformation • Normalization and aggregation • Data reduction • Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization • Part of data reduction but with particular importance, especially for numerical data

  38. Forms of Data Preprocessing

  39. Data Cleaning • Importance • “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball • “Data cleaning is the number one problem in data warehousing”—DCI survey • Data cleaning tasks • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data • Resolve redundancy caused by data integration

  40. Missing Data • Data is not always available • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry • not register history or changes of the data • Missing data may need to be inferred

  41. Missing Values • The approach of the problem of missing values adopted in SQL is based on nulls and three-valued logic (3VL) • null corresponds to UNK for unknown • 3VL a mistake?

  42. Boolean Operators • In scalar comparison in which either of the compared is UNK evaluates the unknown truth value

  43. MAYBE • Another important Boolean operator is MAYBE

  44. Example • Consider the query “Get employees who may be- but are not definitely known to be- programmers born before January 18, 1971, with salary less then €40.000 EMP WHERE MAYBE ( JOB = ‘PROGRAMMER’ AND DOB < DATE (‘1971-1-18’) AND SALLARY < 40000 )

  45. Without maybe we assume the existence of another operator called IS_UKN which takes a single scalar operand and returns true if operand evaluates UNK otherwise false EMP WHERE ( JOB = ‘PROGRAMMER’ OR IS_UKN (JOB) ) AND ( DOB < DATE (‘1971-1-18’) OR IS_UKN (DOB) ) AND ( SALLARY < 40000 OR IS_UKN (SALLARY) ) AND NOT ( JOB = ‘PROGRAMMER’ AND DOB < DATE (‘1971-1-18’) AND SALLARY < 40000 )

  46. Numeric expression • WEIGHT * 454 • If WEIGHT is UKN, then the result is also UKN • Any numeric expression is considered to evaluate UNK if any operands of that expression is itself UNK • Anomalies • WEIGHT-WEIGHT=UNK (0) • WEIGHT/0=UNK (“zero divide”)

  47. UNK is not u (unk) • UNK (the value-unknown null) • u (unk) (unknown truth value) • ...are not the same thing • u is a value, UNK not a value at all! • Suppose X is BOOLEAN • Has tree values: t (true),f (false), u ukn • X is ukn, X is known to be unk • X is UKN, X is not known!

  48. Some 3VL Consequences • The comparison x=x does not give true • In 3VL x is not equal to itself it is happens to be UNK • The Boolean expression p OR NOT(p) does not give necessarily true • unk OR NOT (unk) = unk

  49. Example • Get all suppliers in Porto and take the union with get all suppliers not in Porto • We do not get all suppliers! • We need to add maybe in Porto • In 2 VL p OR NOT(p) corresponds to • p OR NOT(p) OR MAYBE(p) in 3VL • While two cases my exhaust full range of possibilities in the real world, the database does not contain the real world - instead it contains only knowledge about real world

More Related