Data Preprocessing

Data Preprocessing Chapter 2 DW/DM: Data Preprocessing

Chapter Objectives • Realize the importance of data preprocessing for real world data before data mining or construction of data warehouses. • Get an overview of some data preprocessing issues and techniques. DW/DM: Data Preprocessing

The course (4) DS OLAP (2) (3) Data Preprocessing DW DS DM (5) Association DS (6) Classification (7) Clustering DS = Data source DW = Data warehouse DM = Data Mining DW/DM: Data Preprocessing

… - The Chapter (2.3) (2.4) (2.4) (2.5) DW/DM: Data Preprocessing

- Chapter Outline • Introduction(2.1, 2.2) • Data Cleaning (2.3) • Data Integration (2.4) • Data Transformation (2.4) • Data Reduction (2.5) • Concept Hierarchy (2.6) DW/DM: Data Preprocessing

- Introduction • Introduction • Why Preprocess the Data(2.1) • Where Preprocess Data • Identifying Typical properties of Data (2.2) DW/DM: Data Preprocessing

-- Why Preprocess the Data … • A well-accepted multi-dimensional measure of data quality: • Accuracy • Completeness • Consistency • Timeliness • Believability • Value added • Interpretability • Accessibility DW/DM: Data Preprocessing

-- … Why Preprocess the Data • Reason for data cleaning • Incomplete data (missing data) • Noisy data (contains errors) • Inconsistent data (containing discrepancies) • Reasons for data integration • Data comes from multiple sources • Reason for data transformation • Some data must be transformed to be used for mining • Reasons for data reduction • Performance • No quality datano quality mining results! DW/DM: Data Preprocessing

-- Where Preprocess Data DS OLAP DW SD DS DM Association Data preprocessing is done here, In the Staging Database DS Classification Clustering DS = Data source DW = Data warehouse DM = Data Mining SD = Staging Database DW/DM: Data Preprocessing

-- Identifying Typical Properties of Data • Descriptive Data Summarization techniques can be used to identify the typical properties of data and helps which data values should be treated as noise. For many data preprocessing tasks it is useful to know the following measures of the data • The central tendency • The dispersion DW/DM: Data Preprocessing

--- Measuring the Central Tendency … • Central Tendency measures • Mean • Median • Mode • Midrange: (max() – min())/2 • For data mining purposes, we need to know how to compute these measures efficiently in large databases. It is important to know whether the measure is: • distributive • Algebraic or • holistic DW/DM: Data Preprocessing

… --- Measuring the Central Tendency • Distributive measure: A measure that can be computed by partitioning the data, compute the measure for each partition, and the merge the results to arrive at the measure’s value for the entire data. • eg. Sum(), count(), max(), min(). • Algebraic measure: is a measure that can be computed by applying an algebraic function to one or more distributed measures. • eg. Avg() which is sum()/count() • Holistic measure. You need the entire data to compute the measure • eg. median DW/DM: Data Preprocessing

--- Measuring the Dispersion • Dispersion or variance is the degree to which numerical data tends to spread. • The most common measures are: • Standard deviation • Range: max() – min() • Quartiles • The five-number summary • Interquartile range (IQR) • Boxplot Analysis DW/DM: Data Preprocessing

---- Quartiles • The kth percentile of a data sorted in ascending order is the value x having the property the k percent of the data entries lie at or below x. • The first quartile, Q1, is the 25th percentile, Q2 and median the 50th percentile, and Q3 is the 75th percentile. • IQR is Q3 – Q1 and is a simple measure that gives the spread of the middle half. • A common rule of thump for identifying suspected outliers is to single out values 1.5 * IQR above Q3 or below Q1. • The 5-number summary: The min, Q1, median, Q3, the max • Box plots can be plotted based on the 5-number summary and are useful tools for identifying outliers. DW/DM: Data Preprocessing

---- Boxplot Analysis • Boxplot • Data is represented with a box • The ends of the box are Q1 and Q3, • i.e., the height of the box is IRQ • The median is marked by a line within the box • Whiskers: two lines outside the box extend to Minimum and Maximum Highest value Q3 Median Q1 Whisker Lowest value DW/DM: Data Preprocessing

- Chapter Outline • Introduction(2.1, 2.2) • Data Cleaning(2.3) • Data Integration (2.4) • Data Transformation (2.4) • Data Reduction (2.5) • Concept Hierarchy (2.6) DW/DM: Data Preprocessing

- Data Cleaning • Importance • “Data cleaning is the number one problem in data warehousing” • In data cleaning, the following data problems are resolved: • Incomplete data (missing data) • Noisy data (contains errors) • Inconsistent data (containing discrepancies) DW/DM: Data Preprocessing

-- Missing Data • Data is not always available • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry DW/DM: Data Preprocessing

--- How to Handle Missing Data? • Fill in missing value manually (often unfeasible) • Fill in with a global constant. Unknown or n/a not recommended (data mining algorithm will see this as a normal value) • Fill in with attribute mean or median • Fill in with class mean or median (classes need to be known) • Fill in with most likely value (using regression, decision trees, most similar records, etc.) • Use other attributes to predict value (e.g. if a postcode is missing use suburb value) • Ignore the record DW/DM: Data Preprocessing

-- Noisy Data • Noise: random error or variance in a measured variable • Incorrect attribute values may due to • faulty data collection • data entry problems • data transmission problems • data conversion errors • Data decay problems • technology limitations, e.g. buffer overflow or field size limits DW/DM: Data Preprocessing

--- How to Handle Noisy Data? • Binning • First sort data and partition into (equal-frequency) bins, then one can smooth by bin means, or by bin median, or by bin boundaries, etc. • Regression • smooth by fitting the data into regression functions • Clustering • detect and remove outliers • Combined computer and human inspection • detect suspicious values and check by human. DW/DM: Data Preprocessing

--- Binning Methods for Data Smoothing • Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 • Partition into equal-frequency (equi-depth) bins: • Bin 1: 4, 8, 9, 15 • Bin 2: 21, 21, 24, 25 • Bin 3: 26, 28, 29, 34 • Smoothing by bin means: • Bin 1: 9, 9, 9, 9 • Bin 2: 23, 23, 23, 23 • Bin 3: 29, 29, 29, 29 • Smoothing by bin boundaries: • Bin 1: 4, 4, 4, 15 • Bin 2: 21, 21, 25, 25 • Bin 3: 26, 26, 26, 34 DW/DM: Data Preprocessing

--- Regression y Y1 y = x + 1 Y1’ x X1 DW/DM: Data Preprocessing

--- Cluster Analysis DW/DM: Data Preprocessing

-- Inconsistent data • Inconsistent data can be due to: • data entry errors • data integration errors (different formats, codes, etc.) • Handling inconsistent data • Important to have data entry verification (check both format and values of data entered) • Correct with help of external reference data DW/DM: Data Preprocessing

-- Data Cleaning as a Process • Data discrepancy detection • Use metadata (e.g., domain, range, correlation, distribution, DDS) • Check field overloading • Inconsistent use of codes (e.g. 5/12/2004 and 12/5/2004) • Check uniqueness rule, consecutive rule, and null rule • Use commercial tools • Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections • Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers) DW/DM: Data Preprocessing

--- Properties of Normal Distribution Curve • The normal (distribution) curve • From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation) • From μ–2σ to μ+2σ: contains about 95% of it • From μ–3σ to μ+3σ: contains about 99.7% of it DW/DM: Data Preprocessing

--- Correlation Negative correlation Positive correlation No correlation DW/DM: Data Preprocessing

- Chapter Outline • Introduction(2.1, 2.2) • Data Cleaning (2.3) • Data Integration(2.4) • Data Transformation (2.4) • Data Reduction (2.5) • Concept Hierarchy (2.6) DW/DM: Data Preprocessing

- Data Integration • Data integration: Combines data from multiple sources into a coherent data store • Main problems: • Entity identification problem: • Identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-# • Redundancy problem: • An attribute is redundant if it can be derived from other attribute(s). • Inconsistencies in attribute naming can cause redundancy • Solutions: • Entity identification problems can be resolved using metadata • Some redundancy problems can be also be resolved using metadata and some others can be resolved correlation analysis. DW/DM: Data Preprocessing

-- Correlation Negative correlation Positive correlation No correlation DW/DM: Data Preprocessing

--- Correlation Analysis (Numerical Data) • Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product. • If rA,B > 0, A and B are positively correlated • rA,B = 0: independent • rA,B < 0: negatively correlated DW/DM: Data Preprocessing

--- Correlation Analysis (Categorical Data) • Χ2 (chi-square) test • The larger the Χ2 value, the more likely the variables are related • The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count DW/DM: Data Preprocessing

--- Chi-Square Calculation: An Example • Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) • It shows that like_science_fiction and play_chess are correlated in the group DW/DM: Data Preprocessing

- Chapter Outline • Introduction(2.1, 2.2) • Data Cleaning (2.3) • Data Integration (2.4) • Data Transformation(2.4) • Data Reduction (2.5) • Concept Hierarchy (2.6) DW/DM: Data Preprocessing

- Data Transformation • In data transformation, data is transformed or consolidated to forms appropriate for mining. Data transformation can involve: • Smoothing: remove noise from data using binning, regression, or clustering. • Aggregation: E.g. sales data can be aggregated to monthly. • Generalization: concept hierarchy climbing. E.g. cities can be generalized to countries. Ages can be generalized to youth, middle-aged, and senior. • Normalization: Attribute data scaled to fall within a small, specified range • Attribute/feature construction: New attributes constructed from the given ones DW/DM: Data Preprocessing

-- Data Transformation: Normalization • Min-max normalization: to [new_minA, new_maxA] • Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to • Z-score normalization (μ: mean, σ: standard deviation): • Ex. Let μ = 54,000, σ = 16,000. Then • Normalization by decimal scaling Where j is the smallest integer such that Max(|ν’|) < 1 DW/DM: Data Preprocessing

-- Attribute/feature construction • Sometimes it is helpful or necessary to construct new attributes or features • Helpful for understanding and accuracy • For example: Create attribute volume based on attributes height, depth and width • Construction is based on mathematical or logical operations • Attribute construction can help to discover missing information about the relationships between data attributes DW/DM: Data Preprocessing

- Chapter Outline • Introduction(2.1, 2.2) • Data Cleaning (2.3) • Data Integration (2.4) • Data Transformation (2.4) • Data Reduction(2.5) • Concept Hierarchy (2.6) DW/DM: Data Preprocessing

- Data Reduction • The data is often too large. Reducing the data can improve performance. Data reduction consists of reducing the representation of the data set while producing the same (or almost the same) results. • Data Reduction Includes: • Reducing the number of rows (objects) • Reducing the number of attributes (features) • Compression • Discretization (will be covered in the next section) DW/DM: Data Preprocessing

-- Reducing the number of Rows • Aggregation • Aggregation of data in to a higher concept level. • We can have multiple levels of aggregation. E.g., Weekly, monthly, quarterly, yearly, and so on. • For data reduction use the highest aggregation level which is enough • Numerosity reduction • Data volume can be reduced by choosing alternative forms of data representation DW/DM: Data Preprocessing

--- Types of Numerosity reduction • Parametric: • Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) • E.g.: Linear regression: Data are modeled to fit a straight line • Non-parametric: • Histograms • Clustering • Sampling DW/DM: Data Preprocessing

---- Reduction with Histograms • A popular data reduction technique. Divide data into buckets and store representation of buckets (sum, count, etc.) • Histogram Types: • Equal-width: Divides the range into N intervals of equal size. • Equal-depth: Divides the range into N intervals, each containing approximately same number of samples • V-optimal: Considers all histogram types for a given number of buckets and chooses the one with the least variance. • MaxDiff: After sorting the data to be approximated, the borders of the buckets are defined at points where the adjacent values have the maximum difference DW/DM: Data Preprocessing

---- Example: Histogram DW/DM: Data Preprocessing

---- Reduction with Clustering • Partition data into clusters based on “closeness” in space. Retain representatives of clusters (centroids) and outliers. Effectiveness depends upon the distribution of data. Hierarchical clustering is possible (multi-resolution). Outlier x x x Centroid DW/DM: Data Preprocessing

---- Reduction with Sampling • Allows a large data set to be represented by a much smaller random sample of the data (sub-set). • Will the patterns in the sample represent the patterns in the data? • How to select a random sample? • Simple random sample without replacement (SRSWOR) • Simple random sampling with replacement (SRSWR) • Cluster sample (SRSWOR or SRSWR from clusters) • Stratified sample (stratum = group based on attribute value) DW/DM: Data Preprocessing

Raw Data ----Sampling SRSWOR (simple random sample without replacement) SRSWR DW/DM: Data Preprocessing

Sampling Example Cluster/Stratified Sample Raw Data DW/DM: Data Preprocessing

-- Reduce the number of Attributes • Reduce the number of attributes or dimensions or features. • Select a minimum set of attributes (features) that is sufficient for the data mining or analytical task. • Purpose: • Avoid curse of dimensionality which creates sparse data space and bad clusters. • Reduce amount of time and memory required by data mining algorithms • Allow data to be more easily visualized • May help to eliminate irrelevant and duplicate features or reduce noise DW/DM: Data Preprocessing

--- Reduce the number of Attributes: techniques • Step-wise forward selection: • E.g.: {} {A1}  {A1,A3}  {A1,A3,A5} • Step-wise backward elimination: • E.g.:{A1,A2,A3,A4,A5}  {A1,A3,A4,A5}  {A1,A3,A5} • Combining forward selection and backward elimination • Decision-tree induction (This will be covered in Chapter 5). DW/DM: Data Preprocessing

Data Preprocessing