Data preprocessing
1 / 56

Data Preprocessing - PowerPoint PPT Presentation

  • Uploaded on

Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems with 3VL approach Special Values. Remember: Relational Databases. Model entities and relationships

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Data Preprocessing' - mariko-curry

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Remember relational databases
Remember: Relational Databases

  • Model entities and relationships

  • Entities are the things in the real world

  • Information about employees and the department they work for

    • Employee and department are entities

  • Relationships are the links between these entities

    • Employee works for a department

Relation or tables
Relation or Tables

  • Relation: a table of data

    • table = relation,(set theory, based on predicate logic)

Columns and rows
Columns and Rows

  • Each column or attribute describes some piece of data that each record in the table has

  • Each row in a table represents a record

    • Rows, records or tupels


  • A superkey is a column (or a set of columns) that can be used to identify a row in as table

    • a key is a minimal superkey

  • There are different possible keys

    • candidate keys

  • We chose form the candidate keys the primary key

  • Primary key is used to identify a single row (record)

  • Foreign keys represents links between tables


  • primary key

  • foreign key

Functional dependencies
Functional Dependencies

  • If there is a functional dependency between columns A and B in a given table which may be written

  • Then the value of column A determines the value of column B

  • employeeID functionally determines the name


  • Database schema

    • Structure or design of the database

    • Database without any data in it



  • Minimize redundancy

    • Redundancy: data is repeated in different rows


Reduce redudancy
Reduce redudancy




Insert anomalies
Insert Anomalies

  • Insert data into flawed table

  • Data does not match what is already in the table

  • It is not obvious which of the rows in the database is correct

Deletion anomalies
Deletion Anomalies

  • Delete data from a flawed schema

  • When we delete all the employees of Department 128, we no longer have any record, that the Department 128 exists

Update anomalies
Update Anomalies

  • Change data in a flawed schema

  • We do not change the data for every row correctly

Null values
Null Values

  • Avoid schema designs that have large numbers of empty attributes


  • Remove design flaws from a database

  • Normal forms, which are a set of rules describing what we should and should not do in our table structures

  • Breaking tables into smaller tables that form a better design

Normal forms
Normal Forms

1 Forma Normal

2 Forma Normal

3 Forma Normal

Forma Normal Boyce Codd

4 Forma Normal

5 Forma Normal

First normal form 1nf
First Normal Form (1NF)

  • Each attribute or column value must be atomic

  • Each attribute must contain a single value

Second normal form 2nf
Second Normal Form (2NF)

  • All attributes that are no part of the primary key are fully dependent on the primary key

    • Each non key attribute must be functionally dependent on the key

  • Is already in 1NF

2NF ?


Functional dependencies1
Functional dependencies

  • employeeID,skill  name, job, deparmentID

  • employeeID  name, job, deparmentID

    • Partially functionally dependent on the primary key

    • Not fully functionally dependent on the primary key


  • Decompose the table into tables which all the non-key attributes are fully functionally dependent on the key

  • Breaking the table into two tables

    • employee(employeeID,name,job,departmentID)

    • employeeSkills(employeeID,skill)

Third normal form 3nf
Third Normal Form (3NF)

  • Remove all transitive dependencies

  • Be in 2NF

  • employee(employeeID,name,job,departmentID,departmentName)

    • employeeID  name,job,departmentID,departmentName

    • departmentID  departmentName

Transitive dependency
Transitive dependency

  • employeeID  departmentName

  • employeeID  deparmtentID

  • departmentID  departmentName


  • Remove transitive dependency

  • Decompose into multiple tables

  • emploee(employeeID,name,jop,departmentID)

  • deparment(deparmentID,deparmtentName)


  • The left side of the functional dependency is a superkey (that is, a key that is not necessarily minimal)

    • Boyce-Codd Normal Form


  • The right side of the functional dependency is a part of any key of the table


  • All attributes must be functionally determined by a superkey

  • Full normalization means lots of logically seperate relations

  • Lots of logically separate relations means a lot of physically separate files

  • Lots of physically separate files means a lot of I/O

  • Difficulties in finding dimensions for dimensional schema, star schema

    • (dimension tables, fact table)

What is denormalization
What is Denormalization? relations

  • Normalizing a relational variable R means replacing R by a set of projections R1,R2,..,Rn such that R is equal to the join R1,R2,..,Rn

    • Reduce redundancy, each projections R1,R2,..,Rn is at the highest possible value of normalization

  • Denormalizing the relational variables means replacing them by their join R

    • Increase redundancy, by ensuring that R is a lower level of normalization than R1,R2,..,Rn

  • Problems

    • Once we start to denormalize, it is not clear when to stop?

Dimensional schema
Dimensional Schema relations

  • Array cells often empty

    • The more dimensions, there more empty cells

    • Empty cell  Missing information

    • How to treat not present information ?

    • How does the system support

      • Information is unknown

      • Has been not captured

      • Not applicable

      • ....

  • Solution?

Why data preprocessing
Why Data Preprocessing? relations

  • Data in the real world is dirty

    • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

      • e.g., occupation=“ ”

    • noisy: containing errors or outliers

      • e.g., Salary=“-10”

    • inconsistent: containing discrepancies in codes or names

      • e.g., Age=“42” Birthday=“03/07/1997”

      • e.g., Was rating “1,2,3”, now rating “A, B, C”

      • e.g., discrepancy between duplicate records

Why is data dirty
Why Is Data Dirty? relations

  • Incomplete data may come from

    • “Not applicable” data value when collected

    • Different considerations between the time when the data was collected and when it is analyzed.

    • Human/hardware/software problems

  • Noisy data (incorrect values) may come from

    • Faulty data collection instruments

    • Human or computer error at data entry

    • Errors in data transmission

  • Inconsistent data may come from

    • Different data sources

    • Functional dependency violation (e.g., modify some linked data)

  • Duplicate records also need data cleaning

Why is data preprocessing important
Why Is Data Preprocessing Important? relations

  • No quality data, no quality mining results!

    • Quality decisions must be based on quality data

      • e.g., duplicate or missing data may cause incorrect or even misleading statistics.

    • Data warehouse needs consistent integration of quality data

  • Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

Multi dimensional measure of data quality
Multi-Dimensional Measure of Data Quality relations

  • A well-accepted multidimensional view:

    • Accuracy

    • Completeness

    • Consistency

    • Timeliness

    • Believability

    • Value added

    • Interpretability

    • Accessibility

Major tasks in data preprocessing
Major Tasks in Data Preprocessing relations

  • Data cleaning

    • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

  • Data integration

    • Integration of multiple databases, data cubes, or files

  • Data transformation

    • Normalization and aggregation

  • Data reduction

    • Obtains reduced representation in volume but produces the same or similar analytical results

  • Data discretization

    • Part of data reduction but with particular importance, especially for numerical data

Data cleaning
Data Cleaning relations

  • Importance

    • “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball

    • “Data cleaning is the number one problem in data warehousing”—DCI survey

  • Data cleaning tasks

    • Fill in missing values

    • Identify outliers and smooth out noisy data

    • Correct inconsistent data

    • Resolve redundancy caused by data integration

Missing data
Missing Data relations

  • Data is not always available

    • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data

  • Missing data may be due to

    • equipment malfunction

    • inconsistent with other recorded data and thus deleted

    • data not entered due to misunderstanding

    • certain data may not be considered important at the time of entry

    • not register history or changes of the data

  • Missing data may need to be inferred

Missing values
Missing Values relations

  • The approach of the problem of missing values adopted in SQL is based on nulls and three-valued logic (3VL)

  • null corresponds to UNK for unknown

  • 3VL a mistake?

Boolean operators
Boolean Operators relations

  • In scalar comparison in which either of the compared is UNK evaluates the unknown truth value

MAYBE relations

  • Another important Boolean operator is MAYBE

Example relations

  • Consider the query “Get employees who may be- but are not definitely known to be- programmers born before January 18, 1971, with salary less then €40.000


    DOB < DATE (‘1971-1-18’) AND

    SALLARY < 40000 )

  • Without maybe we assume the existence of another operator called IS_UKN which takes a single scalar operand and returns true if operand evaluates UNK otherwise false


    OR IS_UKN (JOB) )

    AND ( DOB < DATE (‘1971-1-18’)

    OR IS_UKN (DOB) )

    AND ( SALLARY < 40000



    DOB < DATE (‘1971-1-18’) AND

    SALLARY < 40000 )

Numeric expression
Numeric expression called IS_UKN which takes a single scalar operand and returns

  • WEIGHT * 454

    • If WEIGHT is UKN, then the result is also UKN

    • Any numeric expression is considered to evaluate UNK if any operands of that expression is itself UNK

  • Anomalies


    • WEIGHT/0=UNK (“zero divide”)

Unk is not u unk
UNK is not called IS_UKN which takes a single scalar operand and returns u (unk)

  • UNK (the value-unknown null)

  • u (unk) (unknown truth value)

  • ...are not the same thing

    • u is a value, UNK not a value at all!

  • Suppose X is BOOLEAN

    • Has tree values: t (true),f (false), u ukn

    • X is ukn, X is known to be unk

    • X is UKN, X is not known!

Some 3vl consequences
Some 3VL Consequences called IS_UKN which takes a single scalar operand and returns

  • The comparison x=x does not give true

    • In 3VL x is not equal to itself it is happens to be UNK

  • The Boolean expression p OR NOT(p) does not give necessarily true

    • unk OR NOT (unk) = unk

Example called IS_UKN which takes a single scalar operand and returns

  • Get all suppliers in Porto and take the union with get all suppliers not in Porto

    • We do not get all suppliers!

  • We need to add maybe in Porto

    • In 2 VL p OR NOT(p) corresponds to

    • p OR NOT(p) OR MAYBE(p) in 3VL

  • While two cases my exhaust full range of possibilities in the real world, the database does not contain the real world - instead it contains only knowledge about real world

Some 3vl consequences1
Some 3VL Consequences called IS_UKN which takes a single scalar operand and returns

  • The expression r JOIN r does not necessarily give r

  • A=B and B=C together does not imply A=C

  • ....

  • Many equivalences that are valid in 2VL break down in 3VL

  • We will get wrong answers

Special values
Special Values called IS_UKN which takes a single scalar operand and returns

  • Drop the idea of null and UNK,unk 3VL

  • Use special values instead to represent missing information

  • Special values are used in the real world

    • In the real world we might use the special value „?“ to denote hours worked by a certain employee if actual value is unknown

Special values1
Special Values called IS_UKN which takes a single scalar operand and returns

  • General Idea:

  • Use an appropriate special value, distinct from all regular values of the attribute in question, when no regular value can be used

  • The special value must be of the applicable attribute is not just integers, but integers integers plus whatever the special value is

  • Approach is not very elegant, but without 3VL problems, because it is in 2VL

How to handle missing data
How to Handle Missing Data? called IS_UKN which takes a single scalar operand and returns

  • Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably.

  • Fill in the missing value manually: tedious + infeasible?

  • Fill in it automatically with

    • a global constant : e.g., “unknown”, a new class?!

    • the attribute mean

    • the attribute mean for all samples belonging to the same class: smarter

    • the most probable value: inference-based such as Bayesian formula or decision tree

  • Relational Databases - Normalization called IS_UKN which takes a single scalar operand and returns

    • Denormalization

  • Data Preprocessing

    • Missing Data

    • Missing values and the 3VL approach

    • Problems with 3VL approach

    • Special Values

Next.. called IS_UKN which takes a single scalar operand and returns

  • Data Preprocessing

    • Visual inspection

    • Noise Reduction

    • Data Reduction

    • Data Discretization

    • Data Integration