Data Preprocessing

1 / 56

# Data Preprocessing - PowerPoint PPT Presentation

Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems with 3VL approach Special Values. Remember: Relational Databases. Model entities and relationships

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Data Preprocessing' - mariko-curry

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Data Preprocessing

Relational Databases - Normalization
• Denormalization
• Data Preprocessing
• Missing Data
• Missing values and the 3VL approach
• Problems with 3VL approach
• Special Values
Remember: Relational Databases
• Model entities and relationships
• Entities are the things in the real world
• Information about employees and the department they work for
• Employee and department are entities
• Relationships are the links between these entities
• Employee works for a department
Relation or Tables
• Relation: a table of data
• table = relation,(set theory, based on predicate logic)
Columns and Rows
• Each column or attribute describes some piece of data that each record in the table has
• Each row in a table represents a record
• Rows, records or tupels
Keys
• A superkey is a column (or a set of columns) that can be used to identify a row in as table
• a key is a minimal superkey
• There are different possible keys
• candidate keys
• We chose form the candidate keys the primary key
• Primary key is used to identify a single row (record)
• Foreign keys represents links between tables
Keys
• primary key
• foreign key
Functional Dependencies
• If there is a functional dependency between columns A and B in a given table which may be written
• Then the value of column A determines the value of column B
• employeeID functionally determines the name
Schema
• Database schema
• Structure or design of the database
• Database without any data in it

employee(employeeID,name,job,departmentID)

Design
• Minimize redundancy
• Redundancy: data is repeated in different rows

employee(employeeID,name,job,departmentID,departmentName)

Reduce redudancy

employee(employeeID,name,job,departmentID,departmentName)

employee(employeeID,name,job,departmentID)

employee(departmentID,name)

Insert Anomalies
• Insert data into flawed table
• Data does not match what is already in the table
• It is not obvious which of the rows in the database is correct
Deletion Anomalies
• Delete data from a flawed schema
• When we delete all the employees of Department 128, we no longer have any record, that the Department 128 exists
Update Anomalies
• Change data in a flawed schema
• We do not change the data for every row correctly
Null Values
• Avoid schema designs that have large numbers of empty attributes
Normalization
• Remove design flaws from a database
• Normal forms, which are a set of rules describing what we should and should not do in our table structures
• Breaking tables into smaller tables that form a better design
Normal Forms

1 Forma Normal

2 Forma Normal

3 Forma Normal

Forma Normal Boyce Codd

4 Forma Normal

5 Forma Normal

First Normal Form (1NF)
• Each attribute or column value must be atomic
• Each attribute must contain a single value
Second Normal Form (2NF)
• All attributes that are no part of the primary key are fully dependent on the primary key
• Each non key attribute must be functionally dependent on the key
2NF ?

employee(employeeID,name,job,departmentID,skill)

Functional dependencies
• employeeID,skill  name, job, deparmentID
• employeeID  name, job, deparmentID
• Partially functionally dependent on the primary key
• Not fully functionally dependent on the primary key
2NF
• Decompose the table into tables which all the non-key attributes are fully functionally dependent on the key
• Breaking the table into two tables
• employee(employeeID,name,job,departmentID)
• employeeSkills(employeeID,skill)
Third Normal Form (3NF)
• Remove all transitive dependencies
• Be in 2NF
employee(employeeID,name,job,departmentID,departmentName)
• employeeID  name,job,departmentID,departmentName
• departmentID  departmentName
Transitive dependency
• employeeID  departmentName
• employeeID  deparmtentID
• departmentID  departmentName
3NF
• Remove transitive dependency
• Decompose into multiple tables
• emploee(employeeID,name,jop,departmentID)
• deparment(deparmentID,deparmtentName)
3NF
• The left side of the functional dependency is a superkey (that is, a key that is not necessarily minimal)
• Boyce-Codd Normal Form

or

• The right side of the functional dependency is a part of any key of the table
BCNF
• All attributes must be functionally determined by a superkey
• Lots of logically separate relations means a lot of physically separate files
• Lots of physically separate files means a lot of I/O
• Difficulties in finding dimensions for dimensional schema, star schema
• (dimension tables, fact table)
What is Denormalization?
• Normalizing a relational variable R means replacing R by a set of projections R1,R2,..,Rn such that R is equal to the join R1,R2,..,Rn
• Reduce redundancy, each projections R1,R2,..,Rn is at the highest possible value of normalization
• Denormalizing the relational variables means replacing them by their join R
• Increase redundancy, by ensuring that R is a lower level of normalization than R1,R2,..,Rn
• Problems
• Once we start to denormalize, it is not clear when to stop?
Dimensional Schema
• Array cells often empty
• The more dimensions, there more empty cells
• Empty cell  Missing information
• How to treat not present information ?
• How does the system support
• Information is unknown
• Has been not captured
• Not applicable
• ....
• Solution?
Why Data Preprocessing?
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
• e.g., occupation=“ ”
• noisy: containing errors or outliers
• e.g., Salary=“-10”
• inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
Why Is Data Dirty?
• Incomplete data may come from
• “Not applicable” data value when collected
• Different considerations between the time when the data was collected and when it is analyzed.
• Human/hardware/software problems
• Noisy data (incorrect values) may come from
• Faulty data collection instruments
• Human or computer error at data entry
• Errors in data transmission
• Inconsistent data may come from
• Different data sources
• Functional dependency violation (e.g., modify some linked data)
• Duplicate records also need data cleaning
Why Is Data Preprocessing Important?
• No quality data, no quality mining results!
• Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even misleading statistics.
• Data warehouse needs consistent integration of quality data
• Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse
Multi-Dimensional Measure of Data Quality
• A well-accepted multidimensional view:
• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Interpretability
• Accessibility
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for numerical data
Data Cleaning
• Importance
• “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball
• “Data cleaning is the number one problem in data warehousing”—DCI survey
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred
Missing Values
• The approach of the problem of missing values adopted in SQL is based on nulls and three-valued logic (3VL)
• null corresponds to UNK for unknown
• 3VL a mistake?
Boolean Operators
• In scalar comparison in which either of the compared is UNK evaluates the unknown truth value
MAYBE
• Another important Boolean operator is MAYBE
Example
• Consider the query “Get employees who may be- but are not definitely known to be- programmers born before January 18, 1971, with salary less then €40.000

EMP WHERE MAYBE ( JOB = ‘PROGRAMMER’ AND

DOB < DATE (‘1971-1-18’) AND

SALLARY < 40000 )

Without maybe we assume the existence of another operator called IS_UKN which takes a single scalar operand and returns true if operand evaluates UNK otherwise false

EMP WHERE ( JOB = ‘PROGRAMMER’

OR IS_UKN (JOB) )

AND ( DOB < DATE (‘1971-1-18’)

OR IS_UKN (DOB) )

AND ( SALLARY < 40000

OR IS_UKN (SALLARY) )

AND NOT ( JOB = ‘PROGRAMMER’ AND

DOB < DATE (‘1971-1-18’) AND

SALLARY < 40000 )

Numeric expression
• WEIGHT * 454
• If WEIGHT is UKN, then the result is also UKN
• Any numeric expression is considered to evaluate UNK if any operands of that expression is itself UNK
• Anomalies
• WEIGHT-WEIGHT=UNK (0)
• WEIGHT/0=UNK (“zero divide”)
UNK is not u (unk)
• UNK (the value-unknown null)
• u (unk) (unknown truth value)
• ...are not the same thing
• u is a value, UNK not a value at all!
• Suppose X is BOOLEAN
• Has tree values: t (true),f (false), u ukn
• X is ukn, X is known to be unk
• X is UKN, X is not known!
Some 3VL Consequences
• The comparison x=x does not give true
• In 3VL x is not equal to itself it is happens to be UNK
• The Boolean expression p OR NOT(p) does not give necessarily true
• unk OR NOT (unk) = unk
Example
• Get all suppliers in Porto and take the union with get all suppliers not in Porto
• We do not get all suppliers!
• We need to add maybe in Porto
• In 2 VL p OR NOT(p) corresponds to
• p OR NOT(p) OR MAYBE(p) in 3VL
• While two cases my exhaust full range of possibilities in the real world, the database does not contain the real world - instead it contains only knowledge about real world
Some 3VL Consequences
• The expression r JOIN r does not necessarily give r
• A=B and B=C together does not imply A=C
• ....
• Many equivalences that are valid in 2VL break down in 3VL
• We will get wrong answers
Special Values
• Drop the idea of null and UNK,unk 3VL
• Use special values instead to represent missing information
• Special values are used in the real world
• In the real world we might use the special value „?“ to denote hours worked by a certain employee if actual value is unknown
Special Values
• General Idea:
• Use an appropriate special value, distinct from all regular values of the attribute in question, when no regular value can be used
• The special value must be of the applicable attribute is not just integers, but integers integers plus whatever the special value is
• Approach is not very elegant, but without 3VL problems, because it is in 2VL
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or decision tree
Relational Databases - Normalization
• Denormalization
• Data Preprocessing
• Missing Data
• Missing values and the 3VL approach
• Problems with 3VL approach
• Special Values
Next..
• Data Preprocessing
• Visual inspection
• Noise Reduction
• Data Reduction
• Data Discretization
• Data Integration