1 / 24

Ch2 Data Preprocessing part3

Ch2 Data Preprocessing part3. Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009. Knowledge Discovery (KDD) Process. Knowledge. Pattern Evaluation. Data mining—core of knowledge discovery process. Data Mining. Task-relevant Data. Selection. Data Warehouse. Data Cleaning.

Download Presentation

Ch2 Data Preprocessing part3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

  2. Knowledge Discovery (KDD) Process Knowledge Pattern Evaluation • Data mining—core of knowledge discovery process Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases

  3. Forms of Data Preprocessing

  4. Data Transformation • Data transformation – the data are transformed or consolidated into forms appropriate for mining

  5. Data Transformation • Data Transformation can involve the following: • Smoothing: remove noise from the data, including binning, regression and clustering • Aggregation • Generalization • Normalization • Attribute construction

  6. Normalization • Min-max normalization • Z-score normalization • Decimal normalization

  7. Min-max normalization • Min-max normalization: to [new_minA, new_maxA] • Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to

  8. Z-score normalization • Z-score normalization (μ: mean, σ: standard deviation): • Ex. Let μ = 54,000, σ = 16,000. Then

  9. Decimal normalization • Normalization by decimal scaling • Suppose the recorded value of A range from -986 to 917, the max absolute value is 986, so j = 3 Where j is the smallest integer such that Max(|ν’|) < 1

  10. Data Reduction • Why data reduction? • A database/data warehouse may store terabytes of data • Complex data analysis/mining may take a very long time to run on the complete data set

  11. Data Reduction • Data reduction • Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results

  12. Data Reduction • Data reduction strategies • Data cube aggregation • Attribute subset selection • Dimensionality reduction — e.g.,remove unimportant attributes • Numerosity reduction — e.g.,fit data into models • Discretization and concept hierarchy generation

  13. Data cube aggregation

  14. Data cube aggregation • Multiple levels of aggregation in data cubes • Further reduce the size of data to deal with • Reference appropriate levels • Use the smallest representation which is enough to solve the task

  15. Attribute subset selectionDimensionality reduction • Feature selection (i.e., attribute subset selection): • Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features • reduce # of patterns in the patterns, easier to understand

  16. Attribute subset selectionDimensionality reduction • Heuristic methods (due to exponential # of choices): • Step-wise forward selection • Step-wise backward elimination • Combining forward selection and backward elimination • Decision-tree induction

  17. > Attribute subset selectionDimensionality reduction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A6? A1? Class 2 Class 2 Class 1 Class 1 Reduced attribute set: {A1, A4, A6}

  18. Numerosity reduction • Reduce data volume by choosing alternative, smaller forms of data representation • Major families: histograms, clustering, sampling

  19. Data Reduction Method: Histograms

  20. Data Reduction Method: Histograms • Divide data into buckets and store average (sum) for each bucket • Partitioning rules: • Equal-width: equal bucket range • Equal-frequency (or equal-depth) • V-optimal: with the least histogram variance (weighted sum of the original values that each bucket represents) • MaxDiff: set bucket boundary between each pair for pairs have the β–1 largest differences

  21. Data Reduction Method: Clustering • Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only • There are many choices of clustering definitions and clustering algorithms • Cluster analysis will be studied in depth in Chapter 7

  22. Data Reduction Method: Sampling • Sampling: obtaining a small sample s to represent the whole data set N • Simple random sample without replacement • Simple random sample with replacement • Cluster sample: if the tuples in D are grouped into M mutually disjoint clusters, then an Simple Random Sample can be obtained, where s < M • Stratified sample

  23. Raw Data Sampling: with or without Replacement SRSWOR (simple random sample without replacement) SRSWR

  24. Sampling: Cluster or Stratified Sampling Cluster/Stratified Sample Raw Data

More Related