230 likes | 521 Views
Data preprocessing before classification. In Kennedy et al.: “Solving data mining problems”. Outline. Ch.7 Collecting data Ch.8 Preparing data Ch.9 Data preprocessing. Ch.7 Collecting data. Collecting “example patterns” Inputs (vectors of independent variables)
E N D
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Outline • Ch.7 Collecting data • Ch.8 Preparing data • Ch.9 Data preprocessing
Collecting “example patterns” Inputs (vectors of independent variables) Outputs (vectors dependent variables) More data is better Begin with an elementary set of data Collecting data
Collecting data • Choose an appropriate sampling rate for time-series data. • Make sure the data measurements units are consistent. • Keep non-essential variables not in the input vector • Make sure no major structural (systemic) changes have occurred during collection.
Collecting data • How much data is enough? • Training and testing using a subset of data • If the performance does not increase when full data is used, data is enough • There are statistical validating methods (Ch.11) • Using simulated data • When it is difficult to collect (sufficient) data • Realistic • Representative
Preparing data • Handling • Missing data • Categorical data • Inconsistent data and outliers
Missing data • Discard incomplete example patterns • Manually enter a reasonable, probable, or expected values • Use an statistic generated from the example patterns with that value • Mean, mode • Encode missing values explicitly by creating new indicator variables • Generate a predictive model to predict each of the missing data value
Categorical data • Ordinal: • Convert to a numerical representation in a straightforward manner • “Low”, “medium”, “high” => 0, 1, 2 • Nominal: • “One of n” representation • Encode the input variables as n different binary inputs, when there are n distinct categories.
Further process of “one of n” • When n is too large, reduce the number of inputs in the new encoding. • Manually • PCA-based reduction • Reduce the one-of-n representation to a one-of-m representation where m is less than n. • Eigenvalue-based reduction • Output variable-based reduction
Inconsistent data and outliers • Removing erroneous data • Identifying inconsistent data • Thresholding, filtering • Outliers • Data points that lie outside of the normal region of interest in the input space, which may be • Unusual situations that are “correct” • Misleading or incorrect measurements
Outliers • Ways to spot outliers • Plot: box plot, histogram… • Number of S.D. from the mean • Handling outliers • Remove them • Assumption: the input space where the outliers reside are not concerned • “Winzorize” them • Convert the values of outliers into the values of upper or lower thresholds. • Outliers can always be reintroduced into the satisfying model to study the changes in the performance of the model.
Reasons to preprocess data • Reducing noise • Enhancing the signal • Reducing input space • Feature extraction • Normalizing data • Modifying prior probabilities (specific for classification)
Reducing noise • Averaging data values • Thresholding data • Convert numeric format data into categorical • E.g. grey-scale => monotone image
Reducing input space • Principle component analysis (PCA) • Identify m-dimensional subspace of the n-dimensional input space • original n variables are reduced to m variables that are mutually orthogonal (independent) • Eliminating correlated input variables • Identify highly correlated input variables by • Statistical correlation tests • Visual inspection of graphed data variables • Seeing if a data variable can be modeled using one or more others.
Reducing input space • Combining non-correlated input variables • Sensitivity analysis • If variations of a particular input variable cause large changes in the estimation model output, the variable is very significant. • Sensitivity analysis prunes input variables based on information provided by both input and output data.
Normalizing data • Not “transform to normal distribution” • For models that perform better • Non-parametric algorithms implicitly assume distances in different directions carry the same weight (e.g. K-nearest neighbor, ”KNN”) • Backpropagation (BP) and multi-layered perception (MLP) models often perform better if all inputs and outputs are normalized • Avoiding numerical problems
Types of normalization • Min-max normalization • It preserves all relationships of the data values exactly • It would compress the normal range if extreme values or outliers exist • Z-score normalization • Sigmoidal normalization
Other considerations • According to the characteristics of the specific classifiers being used for modeling • E.g. CHAID uses categorical data directly • Input variables produce the best modeling accuracy when exhibiting a uniform or Gaussian distribution • Add expert knowledge when preprocessing data