- By
**ghazi** - Follow User

- 93 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' 2. Data Preparation and Preprocessing' - ghazi

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### 2. Data Preparation and Preprocessing

Data and Its Forms

Preparation

Preprocessing and Data Reduction

Data Mining – Data Preprocessing

Guozhu Dong

Data Types and Forms

- Attribute-vector data:
- Data types
- numeric, categorical (see the hierarchy for their relationship)
- static, dynamic (temporal)
- Other data forms
- distributed data
- text, Web, meta data
- images, audio/video

Data Mining – Data Preprocessing Guozhu Dong

Data Preparation

- An important & time consuming task in KDD
- High dimensional data (20, 100, 1000, …)
- Huge size (volume) data
- Missing data
- Outliers
- Erroneous data (inconsistent, mis-recorded, distorted)
- Raw data

Data Mining – Data Preprocessing Guozhu Dong

Data Preparation Methods

- Data annotation
- Data normalization
- Examples: image pixels, age
- Dealing with sequential or temporal data
- Transform to tabular form
- Removing outliers
- Different types

Data Mining – Data Preprocessing Guozhu Dong

Normalization

- Decimal scaling
- v’(i) = v(i)/10k for the smallest k such that max(|v’(i)|)<1.
- For the range between -991 and 99, 10k is 1000, -991 -.991
- Min-max normalization into new max/min range:
- v’ = (v - minA)/(maxA - minA) *

(new_maxA - new_minA) + new_minA

- v = 73600 in [12000,98000] v’= 0.716 in [0,1] (new range)
- Zero-mean normalization:
- v’ = (v - meanA) / std_devA
- (1, 2, 3), mean and std_dev are 2 and 1, (-1, 0, 1)
- If meanIncome = 54000 and std_devIncome = 16000,

then v = 73600 1.225

Data Mining – Data Preprocessing Guozhu Dong

Temporal Data

- The goal is to forecast t(n+1) from previous values
- X = {t(1), t(2), …, t(n)}
- An example with two features and widow size 3
- How to determine the window size?

Data Mining – Data Preprocessing Guozhu Dong

Outlier Removal

- Outlier: Data points inconsistent with the majority of data
- Different outliers
- Valid: CEO’s salary,
- Noisy: One’s age = 200, widely deviated points
- Removal methods
- Clustering
- Curve-fitting
- Hypothesis-testing with a given model

Data Mining – Data Preprocessing Guozhu Dong

Data Preprocessing

- Data cleaning
- missing data
- noisy data
- inconsistent data
- Data reduction
- Dimensionality reduction
- Instance selection
- Value discretization

Data Mining – Data Preprocessing Guozhu Dong

Missing Data

- Many types of missing data
- not measured
- not applicable
- wrongly placed, and ?
- Some methods
- leave as is
- ignore/remove the instance with missing value
- manual fix (assign a value for implicit meaning)
- statistical methods (majority, most likely,mean, nearest neighbor, …)

Data Mining – Data Preprocessing Guozhu Dong

Noisy Data

- Noise: Random error or variance in a measured variable
- inconsistent values for features or classes (processing)
- measuring errors (source)
- Noise is normally a minority in the data set
- Why?
- Removing noise
- Clustering/merging
- Smoothing (rounding, averaging within a window)
- Outlier detection (deviation-based or distance-based)

Data Mining – Data Preprocessing Guozhu Dong

Inconsistent Data

- Inconsistent with our models or common sense
- Examples
- The same name occurs as different ones in an application
- Different names appear the same (Dennis vs. Denis)
- Inappropriate values (Male-Pregnant, negative age)
- One bank’s database shows that 5% of its customers were born on 11/11/11
- …

Data Mining – Data Preprocessing Guozhu Dong

Dimensionality Reduction

- Feature selection
- select m from n features, m≤ n
- remove irrelevant, redundant features
- + saving in search space
- Feature transformation (PCA)
- form new features (a) in a new domain from original features (f)
- many uses, but it does not reduce the original dimensionality
- often used in visualization of data

Data Mining – Data Preprocessing Guozhu Dong

Feature Selection

- Problem illustration
- Full set
- Empty set
- Enumeration
- Search
- Exhaustive/Complete (Enumeration/B&B)
- Heuristic (Sequential forward/backward)
- Stochastic (generate/evaluate)
- Individual features or subsets generation/evaluation

Data Mining – Data Preprocessing Guozhu Dong

Feature Selection (2)

- Goodness metrics
- Dependency: dependence on classes
- Distance: separating classes
- Information: entropy
- Consistency: 1 - #inconsistencies/N
- Example: (F1, F2, F3) and (F1,F3)
- Both sets have 2/6 inconsistency rate
- Accuracy (classifier based): 1 - errorRate
- Their comparisons
- Time complexity, number of features, removing redundancy

Data Mining – Data Preprocessing Guozhu Dong

Feature Selection (3)

- Filter vs. Wrapper Model
- Pros and cons
- time
- generality
- performance such as accuracy
- Stopping criteria
- thresholding (number of iterations, some accuracy,…)
- anytime algorithms
- providing approximate solutions
- solutions improve over time

Data Mining – Data Preprocessing Guozhu Dong

Feature Selection (Examples)

- SFS using consistency (cRate)
- select 1 from n, then 1 from n-1, n-2,… features
- increase the number of selected features until pre-specified cRate is reached.
- LVF using consistency (cRate)
- randomly generate a subset S from the full set
- if it satisfies prespecified cRate, keep S with min #S
- go back to 1 until a stopping criterion is met
- LVF is an any time algorithm
- Many other algorithms: SBS, B&B, ...

Data Mining – Data Preprocessing Guozhu Dong

Transformation: PCA

- D’ = DA, D is mean-centered, (Nn)
- Calculate and rank eigenvalues of the covariance matrix
- Select largest ’s such that r > threshold (e.g., .95)
- corresponding eigenvectors form A (nm)
- Example of Iris data

m n

r = ( i ) / ( i )

i=1 i=1

Data Mining – Data Preprocessing Guozhu Dong

Instance Selection

- Sampling methods
- random sampling
- stratified sampling
- Search-based methods
- Representatives
- Prototypes
- Sufficient statistics (N, mean, stdDev)
- Support vectors

Data Mining – Data Preprocessing Guozhu Dong

Value Discretization

- Binning methods
- Equal-width
- Equal-frequency
- Class information is not used
- Entropy-based
- ChiMerge
- Chi2

Data Mining – Data Preprocessing Guozhu Dong

Binning

- Attribute values (for one attribute e.g., age):
- 0, 4, 12, 16, 16, 18, 24, 26, 28
- Equi-width binning – for bin width of e.g., 10:
- Bin 1: 0, 4 [-,10) bin
- Bin 2: 12, 16, 16, 18 [10,20) bin
- Bin 3: 24, 26, 28 [20,+) bin
- We use – to denote negative infinity, + for positive infinity
- Equi-frequency binning – for bin density of e.g., 3:
- Bin 1: 0, 4, 12 [-,14) bin
- Bin 2: 16, 16, 18 [14,21) bin
- Bin 3: 24, 26, 28 [21,+] bin
- Any problems with the above methods?

Data Mining – Data Preprocessing Guozhu Dong

Entropy-based

- Given attribute-value/class pairs:
- (0,P), (4,P), (12,P), (16,N), (16,N), (18,P), (24,N), (26,N), (28,N)
- Entropy-based binning via binarization:
- Intuitively, find best split so that the bins are as pure as possible
- Formally characterized by maximal information gain.
- Let S denote the above 9 pairs, p=4/9 be fraction of P pairs, and n=5/9 be fraction of N pairs.
- Entropy(S) = - p log p - n log n.
- Smaller entropy – set is relatively pure; smallest is 0.
- Large entropy – set is mixed. Largest is 1.

Data Mining – Data Preprocessing Guozhu Dong

Entropy-based (2)

- Let v be a possible split. Then S is divided into two sets:
- S1: value <= v and S2: value > v
- Information of the split:
- I(S1,S2) = (|S1|/|S|) Entropy(S1)+ (|S2|/|S|) Entropy(S2)
- Information gain of the split:
- Gain(v,S) = Entropy(S) – I(S1,S2)
- Goal: split with maximal information gain.
- Possible splits: mid points b/w any two consecutive values.
- For v=14, I(S1,S2) = 0 + 6/9*Entropy(S2) = 6/9 * 0.65 = 0.433
- Gain(14,S) = Entropy(S) - 0.433
- maximum Gain means minimum I.
- The best split is found after examining all possible split points.

Data Mining – Data Preprocessing Guozhu Dong

ChiMerge and Chi2

- Given attribute-value/class pairs
- Build a contingency table for every pair of intervals
- Chi-Squared Test (goodness-of-fit),
- Parameters: df = k-1 and p% level of significance
- Chi2 algorithm provides an automatic way to adjust p

2 k

2 = (Aij – Eij)2 / Eij

i=1 j=1

Data Mining – Data Preprocessing Guozhu Dong

Summary

- Data have many forms
- Attribute-vectors: the most common form
- Raw data need to be prepared and preprocessed for data mining
- Data miners have to work on the data provided
- Domain expertise is important in DPP
- Data preparation: Normalization, Transformation
- Data preprocessing: Cleaning and Reduction
- DPP is a critical and time-consuming task
- Why?

Data Mining – Data Preprocessing Guozhu Dong

Bibliography

- H. Liu & H. Motoda, 1998. Feature Selection for Knowledge Discovery and Data Mining. Kluwer.
- M. Kantardzic, 2003. Data Mining - Concepts, Models, Methods, and Algorithms. IEEE and Wiley Inter-Science.
- H. Liu & H. Motoda, edited, 2001. Instance Selection and Construction for Data Mining. Kluwer.
- H. Liu, F. Hussain, C.L. Tan, and M. Dash, 2002. Discretization: An Enabling Technique. DMKD 6:393-423.

Data Mining – Data Preprocessing Guozhu Dong

Download Presentation

Connecting to Server..