data preprocessing n.
Skip this Video
Download Presentation
Data Preprocessing

Loading in 2 Seconds...

play fullscreen
1 / 56

Data Preprocessing - PowerPoint PPT Presentation

  • Uploaded on

Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems with 3VL approach Special Values. Remember: Relational Databases. Model entities and relationships

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Data Preprocessing' - mariko-curry

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Relational Databases - Normalization
    • Denormalization
  • Data Preprocessing
    • Missing Data
    • Missing values and the 3VL approach
    • Problems with 3VL approach
    • Special Values
remember relational databases
Remember: Relational Databases
  • Model entities and relationships
  • Entities are the things in the real world
  • Information about employees and the department they work for
    • Employee and department are entities
  • Relationships are the links between these entities
    • Employee works for a department
relation or tables
Relation or Tables
  • Relation: a table of data
    • table = relation,(set theory, based on predicate logic)
columns and rows
Columns and Rows
  • Each column or attribute describes some piece of data that each record in the table has
  • Each row in a table represents a record
    • Rows, records or tupels
  • A superkey is a column (or a set of columns) that can be used to identify a row in as table
    • a key is a minimal superkey
  • There are different possible keys
    • candidate keys
  • We chose form the candidate keys the primary key
  • Primary key is used to identify a single row (record)
  • Foreign keys represents links between tables
  • primary key
  • foreign key
functional dependencies
Functional Dependencies
  • If there is a functional dependency between columns A and B in a given table which may be written
  • Then the value of column A determines the value of column B
  • employeeID functionally determines the name
  • Database schema
    • Structure or design of the database
    • Database without any data in it


  • Minimize redundancy
    • Redundancy: data is repeated in different rows


reduce redudancy
Reduce redudancy




insert anomalies
Insert Anomalies
  • Insert data into flawed table
  • Data does not match what is already in the table
  • It is not obvious which of the rows in the database is correct
deletion anomalies
Deletion Anomalies
  • Delete data from a flawed schema
  • When we delete all the employees of Department 128, we no longer have any record, that the Department 128 exists
update anomalies
Update Anomalies
  • Change data in a flawed schema
  • We do not change the data for every row correctly
null values
Null Values
  • Avoid schema designs that have large numbers of empty attributes
  • Remove design flaws from a database
  • Normal forms, which are a set of rules describing what we should and should not do in our table structures
  • Breaking tables into smaller tables that form a better design
normal forms
Normal Forms

1 Forma Normal

2 Forma Normal

3 Forma Normal

Forma Normal Boyce Codd

4 Forma Normal

5 Forma Normal

first normal form 1nf
First Normal Form (1NF)
  • Each attribute or column value must be atomic
  • Each attribute must contain a single value
second normal form 2nf
Second Normal Form (2NF)
  • All attributes that are no part of the primary key are fully dependent on the primary key
    • Each non key attribute must be functionally dependent on the key
  • Is already in 1NF
2NF ?


functional dependencies1
Functional dependencies
  • employeeID,skill  name, job, deparmentID
  • employeeID  name, job, deparmentID
    • Partially functionally dependent on the primary key
    • Not fully functionally dependent on the primary key
  • Decompose the table into tables which all the non-key attributes are fully functionally dependent on the key
  • Breaking the table into two tables
    • employee(employeeID,name,job,departmentID)
    • employeeSkills(employeeID,skill)
third normal form 3nf
Third Normal Form (3NF)
  • Remove all transitive dependencies
  • Be in 2NF
    • employeeID  name,job,departmentID,departmentName
    • departmentID  departmentName
transitive dependency
Transitive dependency
  • employeeID  departmentName
  • employeeID  deparmtentID
  • departmentID  departmentName
  • Remove transitive dependency
  • Decompose into multiple tables
  • emploee(employeeID,name,jop,departmentID)
  • deparment(deparmentID,deparmtentName)
  • The left side of the functional dependency is a superkey (that is, a key that is not necessarily minimal)
    • Boyce-Codd Normal Form


  • The right side of the functional dependency is a part of any key of the table
  • All attributes must be functionally determined by a superkey
Full normalization means lots of logically seperate relations
  • Lots of logically separate relations means a lot of physically separate files
  • Lots of physically separate files means a lot of I/O
  • Difficulties in finding dimensions for dimensional schema, star schema
    • (dimension tables, fact table)
what is denormalization
What is Denormalization?
  • Normalizing a relational variable R means replacing R by a set of projections R1,R2,..,Rn such that R is equal to the join R1,R2,..,Rn
    • Reduce redundancy, each projections R1,R2,..,Rn is at the highest possible value of normalization
  • Denormalizing the relational variables means replacing them by their join R
    • Increase redundancy, by ensuring that R is a lower level of normalization than R1,R2,..,Rn
  • Problems
    • Once we start to denormalize, it is not clear when to stop?
dimensional schema
Dimensional Schema
  • Array cells often empty
    • The more dimensions, there more empty cells
    • Empty cell  Missing information
    • How to treat not present information ?
    • How does the system support
      • Information is unknown
      • Has been not captured
      • Not applicable
      • ....
  • Solution?
why data preprocessing
Why Data Preprocessing?
  • Data in the real world is dirty
    • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
      • e.g., occupation=“ ”
    • noisy: containing errors or outliers
      • e.g., Salary=“-10”
    • inconsistent: containing discrepancies in codes or names
      • e.g., Age=“42” Birthday=“03/07/1997”
      • e.g., Was rating “1,2,3”, now rating “A, B, C”
      • e.g., discrepancy between duplicate records
why is data dirty
Why Is Data Dirty?
  • Incomplete data may come from
    • “Not applicable” data value when collected
    • Different considerations between the time when the data was collected and when it is analyzed.
    • Human/hardware/software problems
  • Noisy data (incorrect values) may come from
    • Faulty data collection instruments
    • Human or computer error at data entry
    • Errors in data transmission
  • Inconsistent data may come from
    • Different data sources
    • Functional dependency violation (e.g., modify some linked data)
  • Duplicate records also need data cleaning
why is data preprocessing important
Why Is Data Preprocessing Important?
  • No quality data, no quality mining results!
    • Quality decisions must be based on quality data
      • e.g., duplicate or missing data may cause incorrect or even misleading statistics.
    • Data warehouse needs consistent integration of quality data
  • Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse
multi dimensional measure of data quality
Multi-Dimensional Measure of Data Quality
  • A well-accepted multidimensional view:
    • Accuracy
    • Completeness
    • Consistency
    • Timeliness
    • Believability
    • Value added
    • Interpretability
    • Accessibility
major tasks in data preprocessing
Major Tasks in Data Preprocessing
  • Data cleaning
    • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
  • Data integration
    • Integration of multiple databases, data cubes, or files
  • Data transformation
    • Normalization and aggregation
  • Data reduction
    • Obtains reduced representation in volume but produces the same or similar analytical results
  • Data discretization
    • Part of data reduction but with particular importance, especially for numerical data
data cleaning
Data Cleaning
  • Importance
    • “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball
    • “Data cleaning is the number one problem in data warehousing”—DCI survey
  • Data cleaning tasks
    • Fill in missing values
    • Identify outliers and smooth out noisy data
    • Correct inconsistent data
    • Resolve redundancy caused by data integration
missing data
Missing Data
  • Data is not always available
    • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data
  • Missing data may be due to
    • equipment malfunction
    • inconsistent with other recorded data and thus deleted
    • data not entered due to misunderstanding
    • certain data may not be considered important at the time of entry
    • not register history or changes of the data
  • Missing data may need to be inferred
missing values
Missing Values
  • The approach of the problem of missing values adopted in SQL is based on nulls and three-valued logic (3VL)
  • null corresponds to UNK for unknown
  • 3VL a mistake?
boolean operators
Boolean Operators
  • In scalar comparison in which either of the compared is UNK evaluates the unknown truth value
  • Another important Boolean operator is MAYBE
  • Consider the query “Get employees who may be- but are not definitely known to be- programmers born before January 18, 1971, with salary less then €40.000


DOB < DATE (‘1971-1-18’) AND

SALLARY < 40000 )

Without maybe we assume the existence of another operator called IS_UKN which takes a single scalar operand and returns true if operand evaluates UNK otherwise false



AND ( DOB < DATE (‘1971-1-18’)


AND ( SALLARY < 40000



DOB < DATE (‘1971-1-18’) AND

SALLARY < 40000 )

numeric expression
Numeric expression
  • WEIGHT * 454
    • If WEIGHT is UKN, then the result is also UKN
    • Any numeric expression is considered to evaluate UNK if any operands of that expression is itself UNK
  • Anomalies
    • WEIGHT/0=UNK (“zero divide”)
unk is not u unk
UNK is not u (unk)
  • UNK (the value-unknown null)
  • u (unk) (unknown truth value)
  • ...are not the same thing
    • u is a value, UNK not a value at all!
  • Suppose X is BOOLEAN
    • Has tree values: t (true),f (false), u ukn
    • X is ukn, X is known to be unk
    • X is UKN, X is not known!
some 3vl consequences
Some 3VL Consequences
  • The comparison x=x does not give true
    • In 3VL x is not equal to itself it is happens to be UNK
  • The Boolean expression p OR NOT(p) does not give necessarily true
    • unk OR NOT (unk) = unk
  • Get all suppliers in Porto and take the union with get all suppliers not in Porto
    • We do not get all suppliers!
  • We need to add maybe in Porto
    • In 2 VL p OR NOT(p) corresponds to
    • p OR NOT(p) OR MAYBE(p) in 3VL
  • While two cases my exhaust full range of possibilities in the real world, the database does not contain the real world - instead it contains only knowledge about real world
some 3vl consequences1
Some 3VL Consequences
  • The expression r JOIN r does not necessarily give r
  • A=B and B=C together does not imply A=C
  • ....
  • Many equivalences that are valid in 2VL break down in 3VL
  • We will get wrong answers
special values
Special Values
  • Drop the idea of null and UNK,unk 3VL
  • Use special values instead to represent missing information
  • Special values are used in the real world
    • In the real world we might use the special value „?“ to denote hours worked by a certain employee if actual value is unknown
special values1
Special Values
  • General Idea:
  • Use an appropriate special value, distinct from all regular values of the attribute in question, when no regular value can be used
  • The special value must be of the applicable attribute is not just integers, but integers integers plus whatever the special value is
  • Approach is not very elegant, but without 3VL problems, because it is in 2VL
how to handle missing data
How to Handle Missing Data?
  • Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably.
  • Fill in the missing value manually: tedious + infeasible?
  • Fill in it automatically with
    • a global constant : e.g., “unknown”, a new class?!
    • the attribute mean
    • the attribute mean for all samples belonging to the same class: smarter
    • the most probable value: inference-based such as Bayesian formula or decision tree
Relational Databases - Normalization
    • Denormalization
  • Data Preprocessing
    • Missing Data
    • Missing values and the 3VL approach
    • Problems with 3VL approach
    • Special Values
  • Data Preprocessing
    • Visual inspection
    • Noise Reduction
    • Data Reduction
    • Data Discretization
    • Data Integration