1 / 23

Data preprocessing before classification

Data preprocessing before classification. In Kennedy et al.: “Solving data mining problems”. Outline. Ch.7 Collecting data Ch.8 Preparing data Ch.9 Data preprocessing. Ch.7 Collecting data. Collecting “example patterns” Inputs (vectors of independent variables)

bonifacy
Download Presentation

Data preprocessing before classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

  2. Outline • Ch.7 Collecting data • Ch.8 Preparing data • Ch.9 Data preprocessing

  3. Ch.7 Collecting data

  4. Collecting “example patterns” Inputs (vectors of independent variables) Outputs (vectors dependent variables) More data is better Begin with an elementary set of data Collecting data

  5. Collecting data • Choose an appropriate sampling rate for time-series data. • Make sure the data measurements units are consistent. • Keep non-essential variables not in the input vector • Make sure no major structural (systemic) changes have occurred during collection.

  6. Collecting data • How much data is enough? • Training and testing using a subset of data • If the performance does not increase when full data is used, data is enough • There are statistical validating methods (Ch.11) • Using simulated data • When it is difficult to collect (sufficient) data • Realistic • Representative

  7. Ch.8 Preparing data

  8. Preparing data • Handling • Missing data • Categorical data • Inconsistent data and outliers

  9. Missing data • Discard incomplete example patterns • Manually enter a reasonable, probable, or expected values • Use an statistic generated from the example patterns with that value • Mean, mode • Encode missing values explicitly by creating new indicator variables • Generate a predictive model to predict each of the missing data value

  10. Categorical data • Ordinal: • Convert to a numerical representation in a straightforward manner • “Low”, “medium”, “high” => 0, 1, 2 • Nominal: • “One of n” representation • Encode the input variables as n different binary inputs, when there are n distinct categories.

  11. Further process of “one of n” • When n is too large, reduce the number of inputs in the new encoding. • Manually • PCA-based reduction • Reduce the one-of-n representation to a one-of-m representation where m is less than n. • Eigenvalue-based reduction • Output variable-based reduction

  12. Inconsistent data and outliers • Removing erroneous data • Identifying inconsistent data • Thresholding, filtering • Outliers • Data points that lie outside of the normal region of interest in the input space, which may be • Unusual situations that are “correct” • Misleading or incorrect measurements

  13. Outliers • Ways to spot outliers • Plot: box plot, histogram… • Number of S.D. from the mean • Handling outliers • Remove them • Assumption: the input space where the outliers reside are not concerned • “Winzorize” them • Convert the values of outliers into the values of upper or lower thresholds. • Outliers can always be reintroduced into the satisfying model to study the changes in the performance of the model.

  14. Ben Shabad

  15. Ch.9 Data preprocessing

  16. Reasons to preprocess data • Reducing noise • Enhancing the signal • Reducing input space • Feature extraction • Normalizing data • Modifying prior probabilities (specific for classification)

  17. Reducing noise • Averaging data values • Thresholding data • Convert numeric format data into categorical • E.g. grey-scale => monotone image

  18. Reducing input space • Principle component analysis (PCA) • Identify m-dimensional subspace of the n-dimensional input space • original n variables are reduced to m variables that are mutually orthogonal (independent) • Eliminating correlated input variables • Identify highly correlated input variables by • Statistical correlation tests • Visual inspection of graphed data variables • Seeing if a data variable can be modeled using one or more others.

  19. Reducing input space • Combining non-correlated input variables • Sensitivity analysis • If variations of a particular input variable cause large changes in the estimation model output, the variable is very significant. • Sensitivity analysis prunes input variables based on information provided by both input and output data.

  20. Normalizing data • Not “transform to normal distribution” • For models that perform better • Non-parametric algorithms implicitly assume distances in different directions carry the same weight (e.g. K-nearest neighbor, ”KNN”) • Backpropagation (BP) and multi-layered perception (MLP) models often perform better if all inputs and outputs are normalized • Avoiding numerical problems

  21. Types of normalization • Min-max normalization • It preserves all relationships of the data values exactly • It would compress the normal range if extreme values or outliers exist • Z-score normalization • Sigmoidal normalization

  22. Other considerations • According to the characteristics of the specific classifiers being used for modeling • E.g. CHAID uses categorical data directly • Input variables produce the best modeling accuracy when exhibiting a uniform or Gaussian distribution • Add expert knowledge when preprocessing data

  23. Get prepared and then go!

More Related