Data preprocessing
This presentation is the property of its rightful owner.
Sponsored Links
1 / 56

Data Preprocessing PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on
  • Presentation posted in: General

Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems with 3VL approach Special Values. Remember: Relational Databases. Model entities and relationships

Download Presentation

Data Preprocessing

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Data preprocessing

Data Preprocessing


Data preprocessing

  • Relational Databases - Normalization

    • Denormalization

  • Data Preprocessing

    • Missing Data

    • Missing values and the 3VL approach

    • Problems with 3VL approach

    • Special Values


Remember relational databases

Remember: Relational Databases

  • Model entities and relationships

  • Entities are the things in the real world

  • Information about employees and the department they work for

    • Employee and department are entities

  • Relationships are the links between these entities

    • Employee works for a department


Relation or tables

Relation or Tables

  • Relation: a table of data

    • table = relation,(set theory, based on predicate logic)


Columns and rows

Columns and Rows

  • Each column or attribute describes some piece of data that each record in the table has

  • Each row in a table represents a record

    • Rows, records or tupels


Data preprocessing

Keys

  • A superkey is a column (or a set of columns) that can be used to identify a row in as table

    • a key is a minimal superkey

  • There are different possible keys

    • candidate keys

  • We chose form the candidate keys the primary key

  • Primary key is used to identify a single row (record)

  • Foreign keys represents links between tables


Data preprocessing

Keys

  • primary key

  • foreign key


Functional dependencies

Functional Dependencies

  • If there is a functional dependency between columns A and B in a given table which may be written

  • Then the value of column A determines the value of column B

  • employeeID functionally determines the name


Schema

Schema

  • Database schema

    • Structure or design of the database

    • Database without any data in it

      employee(employeeID,name,job,departmentID)


Design

Design

  • Minimize redundancy

    • Redundancy: data is repeated in different rows

      employee(employeeID,name,job,departmentID,departmentName)


Reduce redudancy

Reduce redudancy

employee(employeeID,name,job,departmentID,departmentName)

employee(employeeID,name,job,departmentID)

employee(departmentID,name)


Insert anomalies

Insert Anomalies

  • Insert data into flawed table

  • Data does not match what is already in the table

  • It is not obvious which of the rows in the database is correct


Deletion anomalies

Deletion Anomalies

  • Delete data from a flawed schema

  • When we delete all the employees of Department 128, we no longer have any record, that the Department 128 exists


Update anomalies

Update Anomalies

  • Change data in a flawed schema

  • We do not change the data for every row correctly


Null values

Null Values

  • Avoid schema designs that have large numbers of empty attributes


Normalization

Normalization

  • Remove design flaws from a database

  • Normal forms, which are a set of rules describing what we should and should not do in our table structures

  • Breaking tables into smaller tables that form a better design


Normal forms

Normal Forms

1 Forma Normal

2 Forma Normal

3 Forma Normal

Forma Normal Boyce Codd

4 Forma Normal

5 Forma Normal


First normal form 1nf

First Normal Form (1NF)

  • Each attribute or column value must be atomic

  • Each attribute must contain a single value


Data preprocessing

1NF


Second normal form 2nf

Second Normal Form (2NF)

  • All attributes that are no part of the primary key are fully dependent on the primary key

    • Each non key attribute must be functionally dependent on the key

  • Is already in 1NF


Data preprocessing

2NF ?

employee(employeeID,name,job,departmentID,skill)


Functional dependencies1

Functional dependencies

  • employeeID,skill  name, job, deparmentID

  • employeeID  name, job, deparmentID

    • Partially functionally dependent on the primary key

    • Not fully functionally dependent on the primary key


Data preprocessing

2NF

  • Decompose the table into tables which all the non-key attributes are fully functionally dependent on the key

  • Breaking the table into two tables

    • employee(employeeID,name,job,departmentID)

    • employeeSkills(employeeID,skill)


Third normal form 3nf

Third Normal Form (3NF)

  • Remove all transitive dependencies

  • Be in 2NF


Data preprocessing

  • employee(employeeID,name,job,departmentID,departmentName)

    • employeeIDname,job,departmentID,departmentName

    • departmentIDdepartmentName


Transitive dependency

Transitive dependency

  • employeeIDdepartmentName

  • employeeIDdeparmtentID

  • departmentIDdepartmentName


Data preprocessing

3NF

  • Remove transitive dependency

  • Decompose into multiple tables

  • emploee(employeeID,name,jop,departmentID)

  • deparment(deparmentID,deparmtentName)


Data preprocessing

3NF

  • The left side of the functional dependency is a superkey (that is, a key that is not necessarily minimal)

    • Boyce-Codd Normal Form

      or

  • The right side of the functional dependency is a part of any key of the table


Data preprocessing

BCNF

  • All attributes must be functionally determined by a superkey


Data preprocessing

  • Full normalization means lots of logically seperate relations

  • Lots of logically separate relations means a lot of physically separate files

  • Lots of physically separate files means a lot of I/O

  • Difficulties in finding dimensions for dimensional schema, star schema

    • (dimension tables, fact table)


What is denormalization

What is Denormalization?

  • Normalizing a relational variable R means replacing R by a set of projections R1,R2,..,Rn such that R is equal to the join R1,R2,..,Rn

    • Reduce redundancy, each projections R1,R2,..,Rn is at the highest possible value of normalization

  • Denormalizing the relational variables means replacing them by their join R

    • Increase redundancy, by ensuring that R is a lower level of normalization than R1,R2,..,Rn

  • Problems

    • Once we start to denormalize, it is not clear when to stop?


Dimensional schema

Dimensional Schema

  • Array cells often empty

    • The more dimensions, there more empty cells

    • Empty cell  Missing information

    • How to treat not present information ?

    • How does the system support

      • Information is unknown

      • Has been not captured

      • Not applicable

      • ....

  • Solution?


Why data preprocessing

Why Data Preprocessing?

  • Data in the real world is dirty

    • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

      • e.g., occupation=“ ”

    • noisy: containing errors or outliers

      • e.g., Salary=“-10”

    • inconsistent: containing discrepancies in codes or names

      • e.g., Age=“42” Birthday=“03/07/1997”

      • e.g., Was rating “1,2,3”, now rating “A, B, C”

      • e.g., discrepancy between duplicate records


Why is data dirty

Why Is Data Dirty?

  • Incomplete data may come from

    • “Not applicable” data value when collected

    • Different considerations between the time when the data was collected and when it is analyzed.

    • Human/hardware/software problems

  • Noisy data (incorrect values) may come from

    • Faulty data collection instruments

    • Human or computer error at data entry

    • Errors in data transmission

  • Inconsistent data may come from

    • Different data sources

    • Functional dependency violation (e.g., modify some linked data)

  • Duplicate records also need data cleaning


Why is data preprocessing important

Why Is Data Preprocessing Important?

  • No quality data, no quality mining results!

    • Quality decisions must be based on quality data

      • e.g., duplicate or missing data may cause incorrect or even misleading statistics.

    • Data warehouse needs consistent integration of quality data

  • Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse


Multi dimensional measure of data quality

Multi-Dimensional Measure of Data Quality

  • A well-accepted multidimensional view:

    • Accuracy

    • Completeness

    • Consistency

    • Timeliness

    • Believability

    • Value added

    • Interpretability

    • Accessibility


Major tasks in data preprocessing

Major Tasks in Data Preprocessing

  • Data cleaning

    • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

  • Data integration

    • Integration of multiple databases, data cubes, or files

  • Data transformation

    • Normalization and aggregation

  • Data reduction

    • Obtains reduced representation in volume but produces the same or similar analytical results

  • Data discretization

    • Part of data reduction but with particular importance, especially for numerical data


Forms of data preprocessing

Forms of Data Preprocessing


Data cleaning

Data Cleaning

  • Importance

    • “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball

    • “Data cleaning is the number one problem in data warehousing”—DCI survey

  • Data cleaning tasks

    • Fill in missing values

    • Identify outliers and smooth out noisy data

    • Correct inconsistent data

    • Resolve redundancy caused by data integration


Missing data

Missing Data

  • Data is not always available

    • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data

  • Missing data may be due to

    • equipment malfunction

    • inconsistent with other recorded data and thus deleted

    • data not entered due to misunderstanding

    • certain data may not be considered important at the time of entry

    • not register history or changes of the data

  • Missing data may need to be inferred


Missing values

Missing Values

  • The approach of the problem of missing values adopted in SQL is based on nulls and three-valued logic (3VL)

  • null corresponds to UNK for unknown

  • 3VL a mistake?


Boolean operators

Boolean Operators

  • In scalar comparison in which either of the compared is UNK evaluates the unknown truth value


Maybe

MAYBE

  • Another important Boolean operator is MAYBE


Example

Example

  • Consider the query “Get employees who may be- but are not definitely known to be- programmers born before January 18, 1971, with salary less then €40.000

    EMP WHERE MAYBE ( JOB = ‘PROGRAMMER’ AND

    DOB < DATE (‘1971-1-18’) AND

    SALLARY < 40000 )


Data preprocessing

  • Without maybe we assume the existence of another operator called IS_UKN which takes a single scalar operand and returns true if operand evaluates UNK otherwise false

    EMP WHERE ( JOB = ‘PROGRAMMER’

    OR IS_UKN (JOB) )

    AND ( DOB < DATE (‘1971-1-18’)

    OR IS_UKN (DOB) )

    AND ( SALLARY < 40000

    OR IS_UKN (SALLARY) )

    AND NOT ( JOB = ‘PROGRAMMER’ AND

    DOB < DATE (‘1971-1-18’) AND

    SALLARY < 40000 )


Numeric expression

Numeric expression

  • WEIGHT * 454

    • If WEIGHT is UKN, then the result is also UKN

    • Any numeric expression is considered to evaluate UNK if any operands of that expression is itself UNK

  • Anomalies

    • WEIGHT-WEIGHT=UNK (0)

    • WEIGHT/0=UNK (“zero divide”)


Unk is not u unk

UNK is not u (unk)

  • UNK (the value-unknown null)

  • u (unk) (unknown truth value)

  • ...are not the same thing

    • u is a value, UNK not a value at all!

  • Suppose X is BOOLEAN

    • Has tree values: t (true),f (false), u ukn

    • X is ukn, X is known to be unk

    • X is UKN, X is not known!


Some 3vl consequences

Some 3VL Consequences

  • The comparison x=x does not give true

    • In 3VL x is not equal to itself it is happens to be UNK

  • The Boolean expression p OR NOT(p) does not give necessarily true

    • unk OR NOT (unk) = unk


Example1

Example

  • Get all suppliers in Porto and take the union with get all suppliers not in Porto

    • We do not get all suppliers!

  • We need to add maybe in Porto

    • In 2 VL p OR NOT(p) corresponds to

    • p OR NOT(p) OR MAYBE(p) in 3VL

  • While two cases my exhaust full range of possibilities in the real world, the database does not contain the real world - instead it contains only knowledge about real world


Some 3vl consequences1

Some 3VL Consequences

  • The expression r JOIN r does not necessarily give r

  • A=B and B=C together does not imply A=C

  • ....

  • Many equivalences that are valid in 2VL break down in 3VL

  • We will get wrong answers


Special values

Special Values

  • Drop the idea of null and UNK,unk 3VL

  • Use special values instead to represent missing information

  • Special values are used in the real world

    • In the real world we might use the special value „?“ to denote hours worked by a certain employee if actual value is unknown


Special values1

Special Values

  • General Idea:

  • Use an appropriate special value, distinct from all regular values of the attribute in question, when no regular value can be used

  • The special value must be of the applicable attribute is not just integers, but integers integers plus whatever the special value is

  • Approach is not very elegant, but without 3VL problems, because it is in 2VL


How to handle missing data

How to Handle Missing Data?

  • Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably.

  • Fill in the missing value manually: tedious + infeasible?

  • Fill in it automatically with

    • a global constant : e.g., “unknown”, a new class?!

    • the attribute mean

    • the attribute mean for all samples belonging to the same class: smarter

    • the most probable value: inference-based such as Bayesian formula or decision tree


Data preprocessing

  • Relational Databases - Normalization

    • Denormalization

  • Data Preprocessing

    • Missing Data

    • Missing values and the 3VL approach

    • Problems with 3VL approach

    • Special Values


Data preprocessing

Next..

  • Data Preprocessing

    • Visual inspection

    • Noise Reduction

    • Data Reduction

    • Data Discretization

    • Data Integration


  • Login