2 data preparation and preprocessing
Download
Skip this Video
Download Presentation
2. Data Preparation and Preprocessing

Loading in 2 Seconds...

play fullscreen
1 / 25

2. Data Preparation and Preprocessing - PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on

2. Data Preparation and Preprocessing. Data and Its Forms Preparation Preprocessing and Data Reduction. Data Types and Forms. Attribute-vector data: Data types numeric, categorical ( see the hierarchy for their relationship ) static, dynamic (temporal) Other data forms distributed data

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' 2. Data Preparation and Preprocessing' - ghazi


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
2 data preparation and preprocessing

2. Data Preparation and Preprocessing

Data and Its Forms

Preparation

Preprocessing and Data Reduction

Data Mining – Data Preprocessing

Guozhu Dong

data types and forms
Data Types and Forms
  • Attribute-vector data:
  • Data types
    • numeric, categorical (see the hierarchy for their relationship)
    • static, dynamic (temporal)
  • Other data forms
    • distributed data
    • text, Web, meta data
    • images, audio/video

Data Mining – Data Preprocessing Guozhu Dong

data preparation
Data Preparation
  • An important & time consuming task in KDD
  • High dimensional data (20, 100, 1000, …)
  • Huge size (volume) data
  • Missing data
  • Outliers
  • Erroneous data (inconsistent, mis-recorded, distorted)
  • Raw data

Data Mining – Data Preprocessing Guozhu Dong

data preparation methods
Data Preparation Methods
  • Data annotation
  • Data normalization
    • Examples: image pixels, age
  • Dealing with sequential or temporal data
    • Transform to tabular form
  • Removing outliers
    • Different types

Data Mining – Data Preprocessing Guozhu Dong

normalization
Normalization
  • Decimal scaling
    • v’(i) = v(i)/10k for the smallest k such that max(|v’(i)|)<1.
    • For the range between -991 and 99, 10k is 1000, -991  -.991
  • Min-max normalization into new max/min range:
    • v’ = (v - minA)/(maxA - minA) *

(new_maxA - new_minA) + new_minA

    • v = 73600 in [12000,98000]  v’= 0.716 in [0,1] (new range)
  • Zero-mean normalization:
    • v’ = (v - meanA) / std_devA
    • (1, 2, 3), mean and std_dev are 2 and 1, (-1, 0, 1)
    • If meanIncome = 54000 and std_devIncome = 16000,

then v = 73600  1.225

Data Mining – Data Preprocessing Guozhu Dong

temporal data
Temporal Data
  • The goal is to forecast t(n+1) from previous values
    • X = {t(1), t(2), …, t(n)}
  • An example with two features and widow size 3
    • How to determine the window size?

Data Mining – Data Preprocessing Guozhu Dong

outlier removal
Outlier Removal
  • Outlier: Data points inconsistent with the majority of data
  • Different outliers
    • Valid: CEO’s salary,
    • Noisy: One’s age = 200, widely deviated points
  • Removal methods
    • Clustering
    • Curve-fitting
    • Hypothesis-testing with a given model

Data Mining – Data Preprocessing Guozhu Dong

data preprocessing
Data Preprocessing
  • Data cleaning
    • missing data
    • noisy data
    • inconsistent data
  • Data reduction
    • Dimensionality reduction
    • Instance selection
    • Value discretization

Data Mining – Data Preprocessing Guozhu Dong

missing data
Missing Data
  • Many types of missing data
    • not measured
    • not applicable
    • wrongly placed, and ?
  • Some methods
    • leave as is
    • ignore/remove the instance with missing value
    • manual fix (assign a value for implicit meaning)
    • statistical methods (majority, most likely,mean, nearest neighbor, …)

Data Mining – Data Preprocessing Guozhu Dong

noisy data
Noisy Data
  • Noise: Random error or variance in a measured variable
    • inconsistent values for features or classes (processing)
    • measuring errors (source)
  • Noise is normally a minority in the data set
    • Why?
  • Removing noise
    • Clustering/merging
    • Smoothing (rounding, averaging within a window)
    • Outlier detection (deviation-based or distance-based)

Data Mining – Data Preprocessing Guozhu Dong

inconsistent data
Inconsistent Data
  • Inconsistent with our models or common sense
  • Examples
    • The same name occurs as different ones in an application
    • Different names appear the same (Dennis vs. Denis)
    • Inappropriate values (Male-Pregnant, negative age)
    • One bank’s database shows that 5% of its customers were born on 11/11/11

Data Mining – Data Preprocessing Guozhu Dong

dimensionality reduction
Dimensionality Reduction
  • Feature selection
    • select m from n features, m≤ n
    • remove irrelevant, redundant features
    • + saving in search space
  • Feature transformation (PCA)
    • form new features (a) in a new domain from original features (f)
    • many uses, but it does not reduce the original dimensionality
    • often used in visualization of data

Data Mining – Data Preprocessing Guozhu Dong

feature selection
Feature Selection
  • Problem illustration
    • Full set
    • Empty set
    • Enumeration
  • Search
    • Exhaustive/Complete (Enumeration/B&B)
    • Heuristic (Sequential forward/backward)
    • Stochastic (generate/evaluate)
    • Individual features or subsets generation/evaluation

Data Mining – Data Preprocessing Guozhu Dong

feature selection 2
Feature Selection (2)
  • Goodness metrics
    • Dependency: dependence on classes
    • Distance: separating classes
    • Information: entropy
    • Consistency: 1 - #inconsistencies/N
      • Example: (F1, F2, F3) and (F1,F3)
      • Both sets have 2/6 inconsistency rate
    • Accuracy (classifier based): 1 - errorRate
  • Their comparisons
    • Time complexity, number of features, removing redundancy

Data Mining – Data Preprocessing Guozhu Dong

feature selection 3
Feature Selection (3)
  • Filter vs. Wrapper Model
    • Pros and cons
      • time
      • generality
      • performance such as accuracy
  • Stopping criteria
    • thresholding (number of iterations, some accuracy,…)
    • anytime algorithms
      • providing approximate solutions
      • solutions improve over time

Data Mining – Data Preprocessing Guozhu Dong

feature selection examples
Feature Selection (Examples)
  • SFS using consistency (cRate)
    • select 1 from n, then 1 from n-1, n-2,… features
    • increase the number of selected features until pre-specified cRate is reached.
  • LVF using consistency (cRate)
    • randomly generate a subset S from the full set
    • if it satisfies prespecified cRate, keep S with min #S
    • go back to 1 until a stopping criterion is met
  • LVF is an any time algorithm
  • Many other algorithms: SBS, B&B, ...

Data Mining – Data Preprocessing Guozhu Dong

transformation pca
Transformation: PCA
  • D’ = DA, D is mean-centered, (Nn)
    • Calculate and rank eigenvalues of the covariance matrix
    • Select largest ’s such that r > threshold (e.g., .95)
    • corresponding eigenvectors form A (nm)
  • Example of Iris data

m n

r = (  i ) / (  i )

i=1 i=1

Data Mining – Data Preprocessing Guozhu Dong

instance selection
Instance Selection
  • Sampling methods
    • random sampling
    • stratified sampling
  • Search-based methods
    • Representatives
    • Prototypes
    • Sufficient statistics (N, mean, stdDev)
    • Support vectors

Data Mining – Data Preprocessing Guozhu Dong

value discretization
Value Discretization
  • Binning methods
    • Equal-width
    • Equal-frequency
    • Class information is not used
  • Entropy-based
  • ChiMerge
    • Chi2

Data Mining – Data Preprocessing Guozhu Dong

binning
Binning
  • Attribute values (for one attribute e.g., age):
    • 0, 4, 12, 16, 16, 18, 24, 26, 28
  • Equi-width binning – for bin width of e.g., 10:
    • Bin 1: 0, 4 [-,10) bin
    • Bin 2: 12, 16, 16, 18 [10,20) bin
    • Bin 3: 24, 26, 28 [20,+) bin
    • We use – to denote negative infinity, + for positive infinity
  • Equi-frequency binning – for bin density of e.g., 3:
    • Bin 1: 0, 4, 12 [-,14) bin
    • Bin 2: 16, 16, 18 [14,21) bin
    • Bin 3: 24, 26, 28 [21,+] bin
  • Any problems with the above methods?

Data Mining – Data Preprocessing Guozhu Dong

entropy based
Entropy-based
  • Given attribute-value/class pairs:
    • (0,P), (4,P), (12,P), (16,N), (16,N), (18,P), (24,N), (26,N), (28,N)
  • Entropy-based binning via binarization:
    • Intuitively, find best split so that the bins are as pure as possible
    • Formally characterized by maximal information gain.
  • Let S denote the above 9 pairs, p=4/9 be fraction of P pairs, and n=5/9 be fraction of N pairs.
  • Entropy(S) = - p log p - n log n.
    • Smaller entropy – set is relatively pure; smallest is 0.
    • Large entropy – set is mixed. Largest is 1.

Data Mining – Data Preprocessing Guozhu Dong

entropy based 2
Entropy-based (2)
  • Let v be a possible split. Then S is divided into two sets:
    • S1: value <= v and S2: value > v
  • Information of the split:
    • I(S1,S2) = (|S1|/|S|) Entropy(S1)+ (|S2|/|S|) Entropy(S2)
  • Information gain of the split:
    • Gain(v,S) = Entropy(S) – I(S1,S2)
  • Goal: split with maximal information gain.
  • Possible splits: mid points b/w any two consecutive values.
  • For v=14, I(S1,S2) = 0 + 6/9*Entropy(S2) = 6/9 * 0.65 = 0.433
  • Gain(14,S) = Entropy(S) - 0.433
    • maximum Gain means minimum I.
  • The best split is found after examining all possible split points.

Data Mining – Data Preprocessing Guozhu Dong

chimerge and chi2
ChiMerge and Chi2
  • Given attribute-value/class pairs
  • Build a contingency table for every pair of intervals
  • Chi-Squared Test (goodness-of-fit),
  • Parameters: df = k-1 and p% level of significance
    • Chi2 algorithm provides an automatic way to adjust p

2 k

2 =   (Aij – Eij)2 / Eij

i=1 j=1

Data Mining – Data Preprocessing Guozhu Dong

summary
Summary
  • Data have many forms
    • Attribute-vectors: the most common form
  • Raw data need to be prepared and preprocessed for data mining
    • Data miners have to work on the data provided
    • Domain expertise is important in DPP
  • Data preparation: Normalization, Transformation
  • Data preprocessing: Cleaning and Reduction
  • DPP is a critical and time-consuming task
    • Why?

Data Mining – Data Preprocessing Guozhu Dong

bibliography
Bibliography
  • H. Liu & H. Motoda, 1998. Feature Selection for Knowledge Discovery and Data Mining. Kluwer.
  • M. Kantardzic, 2003. Data Mining - Concepts, Models, Methods, and Algorithms. IEEE and Wiley Inter-Science.
  • H. Liu & H. Motoda, edited, 2001. Instance Selection and Construction for Data Mining. Kluwer.
  • H. Liu, F. Hussain, C.L. Tan, and M. Dash, 2002. Discretization: An Enabling Technique. DMKD 6:393-423.

Data Mining – Data Preprocessing Guozhu Dong

ad