2 data preparation and preprocessing
Download
1 / 25

2. Data Preparation and Preprocessing - PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on

2. Data Preparation and Preprocessing. Data and Its Forms Preparation Preprocessing and Data Reduction. Data Types and Forms. Attribute-vector data: Data types numeric, categorical ( see the hierarchy for their relationship ) static, dynamic (temporal) Other data forms distributed data

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' 2. Data Preparation and Preprocessing' - ghazi


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
2 data preparation and preprocessing

2. Data Preparation and Preprocessing

Data and Its Forms

Preparation

Preprocessing and Data Reduction

Data Mining – Data Preprocessing

Guozhu Dong


Data types and forms
Data Types and Forms

  • Attribute-vector data:

  • Data types

    • numeric, categorical (see the hierarchy for their relationship)

    • static, dynamic (temporal)

  • Other data forms

    • distributed data

    • text, Web, meta data

    • images, audio/video

Data Mining – Data Preprocessing Guozhu Dong


Data preparation
Data Preparation

  • An important & time consuming task in KDD

  • High dimensional data (20, 100, 1000, …)

  • Huge size (volume) data

  • Missing data

  • Outliers

  • Erroneous data (inconsistent, mis-recorded, distorted)

  • Raw data

Data Mining – Data Preprocessing Guozhu Dong


Data preparation methods
Data Preparation Methods

  • Data annotation

  • Data normalization

    • Examples: image pixels, age

  • Dealing with sequential or temporal data

    • Transform to tabular form

  • Removing outliers

    • Different types

Data Mining – Data Preprocessing Guozhu Dong


Normalization
Normalization

  • Decimal scaling

    • v’(i) = v(i)/10k for the smallest k such that max(|v’(i)|)<1.

    • For the range between -991 and 99, 10k is 1000, -991  -.991

  • Min-max normalization into new max/min range:

    • v’ = (v - minA)/(maxA - minA) *

      (new_maxA - new_minA) + new_minA

    • v = 73600 in [12000,98000]  v’= 0.716 in [0,1] (new range)

  • Zero-mean normalization:

    • v’ = (v - meanA) / std_devA

    • (1, 2, 3), mean and std_dev are 2 and 1, (-1, 0, 1)

    • If meanIncome = 54000 and std_devIncome = 16000,

      then v = 73600  1.225

Data Mining – Data Preprocessing Guozhu Dong


Temporal data
Temporal Data

  • The goal is to forecast t(n+1) from previous values

    • X = {t(1), t(2), …, t(n)}

  • An example with two features and widow size 3

    • How to determine the window size?

Data Mining – Data Preprocessing Guozhu Dong


Outlier removal
Outlier Removal

  • Outlier: Data points inconsistent with the majority of data

  • Different outliers

    • Valid: CEO’s salary,

    • Noisy: One’s age = 200, widely deviated points

  • Removal methods

    • Clustering

    • Curve-fitting

    • Hypothesis-testing with a given model

Data Mining – Data Preprocessing Guozhu Dong


Data preprocessing
Data Preprocessing

  • Data cleaning

    • missing data

    • noisy data

    • inconsistent data

  • Data reduction

    • Dimensionality reduction

    • Instance selection

    • Value discretization

Data Mining – Data Preprocessing Guozhu Dong


Missing data
Missing Data

  • Many types of missing data

    • not measured

    • not applicable

    • wrongly placed, and ?

  • Some methods

    • leave as is

    • ignore/remove the instance with missing value

    • manual fix (assign a value for implicit meaning)

    • statistical methods (majority, most likely,mean, nearest neighbor, …)

Data Mining – Data Preprocessing Guozhu Dong


Noisy data
Noisy Data

  • Noise: Random error or variance in a measured variable

    • inconsistent values for features or classes (processing)

    • measuring errors (source)

  • Noise is normally a minority in the data set

    • Why?

  • Removing noise

    • Clustering/merging

    • Smoothing (rounding, averaging within a window)

    • Outlier detection (deviation-based or distance-based)

Data Mining – Data Preprocessing Guozhu Dong


Inconsistent data
Inconsistent Data

  • Inconsistent with our models or common sense

  • Examples

    • The same name occurs as different ones in an application

    • Different names appear the same (Dennis vs. Denis)

    • Inappropriate values (Male-Pregnant, negative age)

    • One bank’s database shows that 5% of its customers were born on 11/11/11

Data Mining – Data Preprocessing Guozhu Dong


Dimensionality reduction
Dimensionality Reduction

  • Feature selection

    • select m from n features, m≤ n

    • remove irrelevant, redundant features

    • + saving in search space

  • Feature transformation (PCA)

    • form new features (a) in a new domain from original features (f)

    • many uses, but it does not reduce the original dimensionality

    • often used in visualization of data

Data Mining – Data Preprocessing Guozhu Dong


Feature selection
Feature Selection

  • Problem illustration

    • Full set

    • Empty set

    • Enumeration

  • Search

    • Exhaustive/Complete (Enumeration/B&B)

    • Heuristic (Sequential forward/backward)

    • Stochastic (generate/evaluate)

    • Individual features or subsets generation/evaluation

Data Mining – Data Preprocessing Guozhu Dong


Feature selection 2
Feature Selection (2)

  • Goodness metrics

    • Dependency: dependence on classes

    • Distance: separating classes

    • Information: entropy

    • Consistency: 1 - #inconsistencies/N

      • Example: (F1, F2, F3) and (F1,F3)

      • Both sets have 2/6 inconsistency rate

    • Accuracy (classifier based): 1 - errorRate

  • Their comparisons

    • Time complexity, number of features, removing redundancy

Data Mining – Data Preprocessing Guozhu Dong


Feature selection 3
Feature Selection (3)

  • Filter vs. Wrapper Model

    • Pros and cons

      • time

      • generality

      • performance such as accuracy

  • Stopping criteria

    • thresholding (number of iterations, some accuracy,…)

    • anytime algorithms

      • providing approximate solutions

      • solutions improve over time

Data Mining – Data Preprocessing Guozhu Dong


Feature selection examples
Feature Selection (Examples)

  • SFS using consistency (cRate)

    • select 1 from n, then 1 from n-1, n-2,… features

    • increase the number of selected features until pre-specified cRate is reached.

  • LVF using consistency (cRate)

    • randomly generate a subset S from the full set

    • if it satisfies prespecified cRate, keep S with min #S

    • go back to 1 until a stopping criterion is met

  • LVF is an any time algorithm

  • Many other algorithms: SBS, B&B, ...

Data Mining – Data Preprocessing Guozhu Dong


Transformation pca
Transformation: PCA

  • D’ = DA, D is mean-centered, (Nn)

    • Calculate and rank eigenvalues of the covariance matrix

    • Select largest ’s such that r > threshold (e.g., .95)

    • corresponding eigenvectors form A (nm)

  • Example of Iris data

m n

r = (  i ) / (  i )

i=1 i=1

Data Mining – Data Preprocessing Guozhu Dong


Instance selection
Instance Selection

  • Sampling methods

    • random sampling

    • stratified sampling

  • Search-based methods

    • Representatives

    • Prototypes

    • Sufficient statistics (N, mean, stdDev)

    • Support vectors

Data Mining – Data Preprocessing Guozhu Dong


Value discretization
Value Discretization

  • Binning methods

    • Equal-width

    • Equal-frequency

    • Class information is not used

  • Entropy-based

  • ChiMerge

    • Chi2

Data Mining – Data Preprocessing Guozhu Dong


Binning
Binning

  • Attribute values (for one attribute e.g., age):

    • 0, 4, 12, 16, 16, 18, 24, 26, 28

  • Equi-width binning – for bin width of e.g., 10:

    • Bin 1: 0, 4 [-,10) bin

    • Bin 2: 12, 16, 16, 18 [10,20) bin

    • Bin 3: 24, 26, 28 [20,+) bin

    • We use – to denote negative infinity, + for positive infinity

  • Equi-frequency binning – for bin density of e.g., 3:

    • Bin 1: 0, 4, 12 [-,14) bin

    • Bin 2: 16, 16, 18 [14,21) bin

    • Bin 3: 24, 26, 28 [21,+] bin

  • Any problems with the above methods?

Data Mining – Data Preprocessing Guozhu Dong


Entropy based
Entropy-based

  • Given attribute-value/class pairs:

    • (0,P), (4,P), (12,P), (16,N), (16,N), (18,P), (24,N), (26,N), (28,N)

  • Entropy-based binning via binarization:

    • Intuitively, find best split so that the bins are as pure as possible

    • Formally characterized by maximal information gain.

  • Let S denote the above 9 pairs, p=4/9 be fraction of P pairs, and n=5/9 be fraction of N pairs.

  • Entropy(S) = - p log p - n log n.

    • Smaller entropy – set is relatively pure; smallest is 0.

    • Large entropy – set is mixed. Largest is 1.

Data Mining – Data Preprocessing Guozhu Dong


Entropy based 2
Entropy-based (2)

  • Let v be a possible split. Then S is divided into two sets:

    • S1: value <= v and S2: value > v

  • Information of the split:

    • I(S1,S2) = (|S1|/|S|) Entropy(S1)+ (|S2|/|S|) Entropy(S2)

  • Information gain of the split:

    • Gain(v,S) = Entropy(S) – I(S1,S2)

  • Goal: split with maximal information gain.

  • Possible splits: mid points b/w any two consecutive values.

  • For v=14, I(S1,S2) = 0 + 6/9*Entropy(S2) = 6/9 * 0.65 = 0.433

  • Gain(14,S) = Entropy(S) - 0.433

    • maximum Gain means minimum I.

  • The best split is found after examining all possible split points.

Data Mining – Data Preprocessing Guozhu Dong


Chimerge and chi2
ChiMerge and Chi2

  • Given attribute-value/class pairs

  • Build a contingency table for every pair of intervals

  • Chi-Squared Test (goodness-of-fit),

  • Parameters: df = k-1 and p% level of significance

    • Chi2 algorithm provides an automatic way to adjust p

2 k

2 =   (Aij – Eij)2 / Eij

i=1 j=1

Data Mining – Data Preprocessing Guozhu Dong


Summary
Summary

  • Data have many forms

    • Attribute-vectors: the most common form

  • Raw data need to be prepared and preprocessed for data mining

    • Data miners have to work on the data provided

    • Domain expertise is important in DPP

  • Data preparation: Normalization, Transformation

  • Data preprocessing: Cleaning and Reduction

  • DPP is a critical and time-consuming task

    • Why?

Data Mining – Data Preprocessing Guozhu Dong


Bibliography
Bibliography

  • H. Liu & H. Motoda, 1998. Feature Selection for Knowledge Discovery and Data Mining. Kluwer.

  • M. Kantardzic, 2003. Data Mining - Concepts, Models, Methods, and Algorithms. IEEE and Wiley Inter-Science.

  • H. Liu & H. Motoda, edited, 2001. Instance Selection and Construction for Data Mining. Kluwer.

  • H. Liu, F. Hussain, C.L. Tan, and M. Dash, 2002. Discretization: An Enabling Technique. DMKD 6:393-423.

Data Mining – Data Preprocessing Guozhu Dong


ad