1 / 25

2. Data Preparation and Preprocessing

2. Data Preparation and Preprocessing. Data and Its Forms Preparation Preprocessing and Data Reduction. Data Types and Forms. Attribute-vector data: Data types numeric, categorical ( see the hierarchy for their relationship ) static, dynamic (temporal) Other data forms distributed data

ghazi
Download Presentation

2. Data Preparation and Preprocessing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 2. Data Preparation and Preprocessing Data and Its Forms Preparation Preprocessing and Data Reduction Data Mining – Data Preprocessing Guozhu Dong

  2. Data Types and Forms • Attribute-vector data: • Data types • numeric, categorical (see the hierarchy for their relationship) • static, dynamic (temporal) • Other data forms • distributed data • text, Web, meta data • images, audio/video Data Mining – Data Preprocessing Guozhu Dong

  3. Data Preparation • An important & time consuming task in KDD • High dimensional data (20, 100, 1000, …) • Huge size (volume) data • Missing data • Outliers • Erroneous data (inconsistent, mis-recorded, distorted) • Raw data Data Mining – Data Preprocessing Guozhu Dong

  4. Data Preparation Methods • Data annotation • Data normalization • Examples: image pixels, age • Dealing with sequential or temporal data • Transform to tabular form • Removing outliers • Different types Data Mining – Data Preprocessing Guozhu Dong

  5. Normalization • Decimal scaling • v’(i) = v(i)/10k for the smallest k such that max(|v’(i)|)<1. • For the range between -991 and 99, 10k is 1000, -991  -.991 • Min-max normalization into new max/min range: • v’ = (v - minA)/(maxA - minA) * (new_maxA - new_minA) + new_minA • v = 73600 in [12000,98000]  v’= 0.716 in [0,1] (new range) • Zero-mean normalization: • v’ = (v - meanA) / std_devA • (1, 2, 3), mean and std_dev are 2 and 1, (-1, 0, 1) • If meanIncome = 54000 and std_devIncome = 16000, then v = 73600  1.225 Data Mining – Data Preprocessing Guozhu Dong

  6. Temporal Data • The goal is to forecast t(n+1) from previous values • X = {t(1), t(2), …, t(n)} • An example with two features and widow size 3 • How to determine the window size? Data Mining – Data Preprocessing Guozhu Dong

  7. Outlier Removal • Outlier: Data points inconsistent with the majority of data • Different outliers • Valid: CEO’s salary, • Noisy: One’s age = 200, widely deviated points • Removal methods • Clustering • Curve-fitting • Hypothesis-testing with a given model Data Mining – Data Preprocessing Guozhu Dong

  8. Data Preprocessing • Data cleaning • missing data • noisy data • inconsistent data • Data reduction • Dimensionality reduction • Instance selection • Value discretization Data Mining – Data Preprocessing Guozhu Dong

  9. Missing Data • Many types of missing data • not measured • not applicable • wrongly placed, and ? • Some methods • leave as is • ignore/remove the instance with missing value • manual fix (assign a value for implicit meaning) • statistical methods (majority, most likely,mean, nearest neighbor, …) Data Mining – Data Preprocessing Guozhu Dong

  10. Noisy Data • Noise: Random error or variance in a measured variable • inconsistent values for features or classes (processing) • measuring errors (source) • Noise is normally a minority in the data set • Why? • Removing noise • Clustering/merging • Smoothing (rounding, averaging within a window) • Outlier detection (deviation-based or distance-based) Data Mining – Data Preprocessing Guozhu Dong

  11. Inconsistent Data • Inconsistent with our models or common sense • Examples • The same name occurs as different ones in an application • Different names appear the same (Dennis vs. Denis) • Inappropriate values (Male-Pregnant, negative age) • One bank’s database shows that 5% of its customers were born on 11/11/11 • … Data Mining – Data Preprocessing Guozhu Dong

  12. Dimensionality Reduction • Feature selection • select m from n features, m≤ n • remove irrelevant, redundant features • + saving in search space • Feature transformation (PCA) • form new features (a) in a new domain from original features (f) • many uses, but it does not reduce the original dimensionality • often used in visualization of data Data Mining – Data Preprocessing Guozhu Dong

  13. Feature Selection • Problem illustration • Full set • Empty set • Enumeration • Search • Exhaustive/Complete (Enumeration/B&B) • Heuristic (Sequential forward/backward) • Stochastic (generate/evaluate) • Individual features or subsets generation/evaluation Data Mining – Data Preprocessing Guozhu Dong

  14. Feature Selection (2) • Goodness metrics • Dependency: dependence on classes • Distance: separating classes • Information: entropy • Consistency: 1 - #inconsistencies/N • Example: (F1, F2, F3) and (F1,F3) • Both sets have 2/6 inconsistency rate • Accuracy (classifier based): 1 - errorRate • Their comparisons • Time complexity, number of features, removing redundancy Data Mining – Data Preprocessing Guozhu Dong

  15. Feature Selection (3) • Filter vs. Wrapper Model • Pros and cons • time • generality • performance such as accuracy • Stopping criteria • thresholding (number of iterations, some accuracy,…) • anytime algorithms • providing approximate solutions • solutions improve over time Data Mining – Data Preprocessing Guozhu Dong

  16. Feature Selection (Examples) • SFS using consistency (cRate) • select 1 from n, then 1 from n-1, n-2,… features • increase the number of selected features until pre-specified cRate is reached. • LVF using consistency (cRate) • randomly generate a subset S from the full set • if it satisfies prespecified cRate, keep S with min #S • go back to 1 until a stopping criterion is met • LVF is an any time algorithm • Many other algorithms: SBS, B&B, ... Data Mining – Data Preprocessing Guozhu Dong

  17. Transformation: PCA • D’ = DA, D is mean-centered, (Nn) • Calculate and rank eigenvalues of the covariance matrix • Select largest ’s such that r > threshold (e.g., .95) • corresponding eigenvectors form A (nm) • Example of Iris data m n r = (  i ) / (  i ) i=1 i=1 Data Mining – Data Preprocessing Guozhu Dong

  18. Instance Selection • Sampling methods • random sampling • stratified sampling • Search-based methods • Representatives • Prototypes • Sufficient statistics (N, mean, stdDev) • Support vectors Data Mining – Data Preprocessing Guozhu Dong

  19. Value Discretization • Binning methods • Equal-width • Equal-frequency • Class information is not used • Entropy-based • ChiMerge • Chi2 Data Mining – Data Preprocessing Guozhu Dong

  20. Binning • Attribute values (for one attribute e.g., age): • 0, 4, 12, 16, 16, 18, 24, 26, 28 • Equi-width binning – for bin width of e.g., 10: • Bin 1: 0, 4 [-,10) bin • Bin 2: 12, 16, 16, 18 [10,20) bin • Bin 3: 24, 26, 28 [20,+) bin • We use – to denote negative infinity, + for positive infinity • Equi-frequency binning – for bin density of e.g., 3: • Bin 1: 0, 4, 12 [-,14) bin • Bin 2: 16, 16, 18 [14,21) bin • Bin 3: 24, 26, 28 [21,+] bin • Any problems with the above methods? Data Mining – Data Preprocessing Guozhu Dong

  21. Entropy-based • Given attribute-value/class pairs: • (0,P), (4,P), (12,P), (16,N), (16,N), (18,P), (24,N), (26,N), (28,N) • Entropy-based binning via binarization: • Intuitively, find best split so that the bins are as pure as possible • Formally characterized by maximal information gain. • Let S denote the above 9 pairs, p=4/9 be fraction of P pairs, and n=5/9 be fraction of N pairs. • Entropy(S) = - p log p - n log n. • Smaller entropy – set is relatively pure; smallest is 0. • Large entropy – set is mixed. Largest is 1. Data Mining – Data Preprocessing Guozhu Dong

  22. Entropy-based (2) • Let v be a possible split. Then S is divided into two sets: • S1: value <= v and S2: value > v • Information of the split: • I(S1,S2) = (|S1|/|S|) Entropy(S1)+ (|S2|/|S|) Entropy(S2) • Information gain of the split: • Gain(v,S) = Entropy(S) – I(S1,S2) • Goal: split with maximal information gain. • Possible splits: mid points b/w any two consecutive values. • For v=14, I(S1,S2) = 0 + 6/9*Entropy(S2) = 6/9 * 0.65 = 0.433 • Gain(14,S) = Entropy(S) - 0.433 • maximum Gain means minimum I. • The best split is found after examining all possible split points. Data Mining – Data Preprocessing Guozhu Dong

  23. ChiMerge and Chi2 • Given attribute-value/class pairs • Build a contingency table for every pair of intervals • Chi-Squared Test (goodness-of-fit), • Parameters: df = k-1 and p% level of significance • Chi2 algorithm provides an automatic way to adjust p 2 k 2 =   (Aij – Eij)2 / Eij i=1 j=1 Data Mining – Data Preprocessing Guozhu Dong

  24. Summary • Data have many forms • Attribute-vectors: the most common form • Raw data need to be prepared and preprocessed for data mining • Data miners have to work on the data provided • Domain expertise is important in DPP • Data preparation: Normalization, Transformation • Data preprocessing: Cleaning and Reduction • DPP is a critical and time-consuming task • Why? Data Mining – Data Preprocessing Guozhu Dong

  25. Bibliography • H. Liu & H. Motoda, 1998. Feature Selection for Knowledge Discovery and Data Mining. Kluwer. • M. Kantardzic, 2003. Data Mining - Concepts, Models, Methods, and Algorithms. IEEE and Wiley Inter-Science. • H. Liu & H. Motoda, edited, 2001. Instance Selection and Construction for Data Mining. Kluwer. • H. Liu, F. Hussain, C.L. Tan, and M. Dash, 2002. Discretization: An Enabling Technique. DMKD 6:393-423. Data Mining – Data Preprocessing Guozhu Dong

More Related