Data preprocessing before classification

Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

Outline • Ch.7 Collecting data • Ch.8 Preparing data • Ch.9 Data preprocessing

Ch.7 Collecting data

Collecting “example patterns” Inputs (vectors of independent variables) Outputs (vectors dependent variables) More data is better Begin with an elementary set of data Collecting data

Collecting data • Choose an appropriate sampling rate for time-series data. • Make sure the data measurements units are consistent. • Keep non-essential variables not in the input vector • Make sure no major structural (systemic) changes have occurred during collection.

Collecting data • How much data is enough? • Training and testing using a subset of data • If the performance does not increase when full data is used, data is enough • There are statistical validating methods (Ch.11) • Using simulated data • When it is difficult to collect (sufficient) data • Realistic • Representative

Ch.8 Preparing data

Preparing data • Handling • Missing data • Categorical data • Inconsistent data and outliers

Missing data • Discard incomplete example patterns • Manually enter a reasonable, probable, or expected values • Use an statistic generated from the example patterns with that value • Mean, mode • Encode missing values explicitly by creating new indicator variables • Generate a predictive model to predict each of the missing data value

Categorical data • Ordinal: • Convert to a numerical representation in a straightforward manner • “Low”, “medium”, “high” => 0, 1, 2 • Nominal: • “One of n” representation • Encode the input variables as n different binary inputs, when there are n distinct categories.

Further process of “one of n” • When n is too large, reduce the number of inputs in the new encoding. • Manually • PCA-based reduction • Reduce the one-of-n representation to a one-of-m representation where m is less than n. • Eigenvalue-based reduction • Output variable-based reduction

Inconsistent data and outliers • Removing erroneous data • Identifying inconsistent data • Thresholding, filtering • Outliers • Data points that lie outside of the normal region of interest in the input space, which may be • Unusual situations that are “correct” • Misleading or incorrect measurements

Outliers • Ways to spot outliers • Plot: box plot, histogram… • Number of S.D. from the mean • Handling outliers • Remove them • Assumption: the input space where the outliers reside are not concerned • “Winzorize” them • Convert the values of outliers into the values of upper or lower thresholds. • Outliers can always be reintroduced into the satisfying model to study the changes in the performance of the model.

Ben Shabad

Ch.9 Data preprocessing

Reasons to preprocess data • Reducing noise • Enhancing the signal • Reducing input space • Feature extraction • Normalizing data • Modifying prior probabilities (specific for classification)

Reducing noise • Averaging data values • Thresholding data • Convert numeric format data into categorical • E.g. grey-scale => monotone image

Reducing input space • Principle component analysis (PCA) • Identify m-dimensional subspace of the n-dimensional input space • original n variables are reduced to m variables that are mutually orthogonal (independent) • Eliminating correlated input variables • Identify highly correlated input variables by • Statistical correlation tests • Visual inspection of graphed data variables • Seeing if a data variable can be modeled using one or more others.

Reducing input space • Combining non-correlated input variables • Sensitivity analysis • If variations of a particular input variable cause large changes in the estimation model output, the variable is very significant. • Sensitivity analysis prunes input variables based on information provided by both input and output data.

Normalizing data • Not “transform to normal distribution” • For models that perform better • Non-parametric algorithms implicitly assume distances in different directions carry the same weight (e.g. K-nearest neighbor, ”KNN”) • Backpropagation (BP) and multi-layered perception (MLP) models often perform better if all inputs and outputs are normalized • Avoiding numerical problems

Types of normalization • Min-max normalization • It preserves all relationships of the data values exactly • It would compress the normal range if extreme values or outliers exist • Z-score normalization • Sigmoidal normalization

Other considerations • According to the characteristics of the specific classifiers being used for modeling • E.g. CHAID uses categorical data directly • Input variables produce the best modeling accuracy when exhibiting a uniform or Gaussian distribution • Add expert knowledge when preprocessing data

Get prepared and then go!

Data preprocessing before classification

Data preprocessing before classification

Presentation Transcript

Data Preprocessing

Data Preprocessing

Data preprocessing

Before Data Preprocessing

Data Preprocessing

Data Preprocessing

Data Preprocessing

Data Mining: Data Preprocessing

Data Preprocessing

Data Preprocessing

Data Preprocessing

Data Preprocessing

Data Preprocessing

Data Preprocessing

Data Preprocessing

Chapter 2: Data Preprocessing

Data Preprocessing

Data Preprocessing

Data Preprocessing

Data Preprocessing

Data PreProcessing