Data Mining: Concepts and Techniques

Data Mining:Concepts and Techniques— Chapter 2 — Original Slides: Jiawei Han and Micheline Kamber Modification: Li Xiong Data Mining: Concepts and Techniques

Chapter 2: Data Preprocessing • Why preprocess the data? • Descriptive data summarization • Data cleaning • Data integration • Data transformation • Data reduction • Discretization and generalization Data Mining: Concepts and Techniques

Why Data Preprocessing? • Data in the real world is dirty • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • e.g., occupation=“ ” • noisy: containing errors or outliers • e.g., Salary=“-10” • inconsistent: containing discrepancies in codes or names • e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records Data Mining: Concepts and Techniques

Why Is Data Dirty? • Incomplete data may come from • “Not applicable” data value when collected • Different considerations between the time when the data was collected and when it is analyzed. • Human/hardware/software problems • Noisy data (incorrect values) may come from • Faulty data collection instruments • Human or computer error at data entry • Errors in data transmission • Inconsistent data may come from • Different data sources • Functional dependency violation (e.g., modify some linked data) • Duplicate records also need data cleaning Data Mining: Concepts and Techniques

Multi-Dimensional Measure of Data Quality • A well-accepted multidimensional view: • Accuracy • Completeness • Consistency • Timeliness • Believability • Value added • Interpretability • Accessibility • Broad categories: • Intrinsic, contextual, representational, and accessibility Data Mining: Concepts and Techniques

Major Tasks in Data Preprocessing • Data cleaning • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration • Integration of multiple databases, data cubes, or files • Data transformation • Normalization and aggregation • Data reduction • Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization • Part of data reduction but with particular importance, especially for numerical data Data Mining: Concepts and Techniques

Forms of Data Preprocessing Data Mining: Concepts and Techniques

Descriptive Data Summarization • Motivation • To better understand the data • Descriptive statistics: describe basic features of data • Graphical description • Tabular description • Summary statistics • Descriptive data summarization • Measuring central tendency – how data seem similar • Measuring statistical variability or dispersion of data – how data differ • Graphic display of descriptive data summarization Data Mining: Concepts and Techniques

Measuring the Central Tendency • Mean (sample vs. population): • Weighted arithmetic mean: • Trimmed mean: chopping extreme values • Median • Middle value if odd number of values, or average of the middle two values otherwise • Estimated by interpolation (for grouped data): • Mode • Value that occurs most frequently in the data • Unimodal, bimodal, trimodal • Empirical formula: Data Mining: Concepts and Techniques

Symmetric vs. Skewed Data • Median, mean and mode of symmetric, positively and negatively skewed data Data Mining: Concepts and Techniques

Computational Issues • Different types of measures • Distributed measure – can be computed by partitioning the data into smaller subsets. E.g. sum, count • Algebraic measure – can be computed by applying an algebraic function to one or more distributed measures. E.g. ? • Holistic measure – must be computed on the entire dataset as a whole. E.g. ? • Selection algorithm: finding kth smallest number in a list • E.g. min, max, median • Selection by sorting: O(n* logn) • Linear algorithms based on quicksort: O(n) Data Mining: Concepts and Techniques

The Long Tail • Long tail: low-frequency population (e.g. wealth distribution) • The Long Tail: the current and future business and economic models • Previous empirical studies: Amazon, Netflix • Products that are in low demand or have low sales volume can collectively make up a market share that rivals or exceeds the relatively few current bestsellers and blockbusters • The primary value of the internet: providing access to products in the long tail • Business and social implications • mass market retailers: Amazon, Netflix, eBay • content producers: YouTube • The Long Tail. Chris Anderson, Wired, Oct. 2004 • The Long Tail: Why the Future of Business is Selling Less of More. Chris Anderson. 2006 Data Mining: Concepts and Techniques

Measuring the Dispersion of Data • Dispersion or variance: the degree to which numerical data tend to spread • Range and Quartiles • Range: difference between the largest and smallest values • Percentile: the value of a variable below which a certain percent of data fall (algebraic or holistic?) • Quartiles: Q1 (25th percentile), Median (50th percentile), Q3 (75th percentile) • Inter-quartile range: IQR = Q3 –Q1 • Five number summary: min, Q1, M,Q3, max (Boxplot) • Outlier: usually, a value at least 1.5 x IQR higher/lower than Q3/Q1 • Variance and standard deviation (sample:s, population: σ) • Variance: sample vs. population (algebraic or holistic?) • Standard deviation s (or σ) is the square root of variance s2 (orσ2) Data Mining: Concepts and Techniques

Graphic Displays of Basic Statistical Descriptions • Histogram • Boxplot • Quantile plot • Quantile-quantile (q-q) plot • Scatter plot • Loess (local regression) curve Data Mining: Concepts and Techniques

Histogram Analysis • Graphical display of tabulated frequencies • univariate graphical method (one attribute) • data partitioned into disjoint buckets (typically equal-width) • a set of rectangles that reflect the counts or frequencies of values at the bucket • Bar chart for categorical values August 30, 2014 Data Mining: Concepts and Techniques

Boxplot Analysis • Visualizes five-number summary: • The ends of the box are first and third quartiles (Q1 and Q3), i.e., the height of the box is IRQ • The median (M) is marked by a line within the box • Whiskers: two lines outside the box extend to Minimum and Maximum Data Mining: Concepts and Techniques

Example Boxplot: Profit Analysis Data Mining: Concepts and Techniques

Quantile Plot • Displays all of the data for the given attribute • Plots quantile information • Each data point (xi, fi) indicates that approximately 100 fi% of the data are below or equal to the value xi Data Mining: Concepts and Techniques

Quantile-Quantile (Q-Q) Plot • Graphs the quantiles of one univariate distribution against the corresponding quantiles of another • Diagnosing differences between the probability distribution of two distributions Data Mining: Concepts and Techniques

Scatter plot • Displays values for two numerical attributes (bivariate data) • Each pair of values plotted as a point in the plane • can suggest various kinds of correlations between variables with a certain confidence level: positive (rising), negative (falling), or null (uncorrelated). Data Mining: Concepts and Techniques

Example Scatter Plot – Correlation between Wine Consumption and Heart Mortality

Positively and Negatively Correlated Data Data Mining: Concepts and Techniques

Loess Curve • Locally weighted scatter plot smoothing to provide better perception of the pattern of dependence • Fitting simple models to localized subsets of the data Data Mining: Concepts and Techniques

Data Cleaning • Importance • “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball • “Data cleaning is the number one problem in data warehousing”—DCI survey • Data cleaning tasks • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data • Resolve redundancy caused by data integration Data Mining: Concepts and Techniques

Missing Data • Data is not always available • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry • not register history or changes of the data • Missing data may need to be inferred. Data Mining: Concepts and Techniques

How to Handle Missing Values? • Ignore the tuple: usually done when class label is missing (assuming the tasks in • Fill in the missing value manually • Fill in the missing value automatically • a global constant : e.g., “unknown”, a new class?! • the attribute mean • the attribute mean for all samples belonging to the same class: smarter • the most probable value: inference-based such as Bayesian formula or decision tree (Chap 6) Data Mining: Concepts and Techniques

Noisy Data • Noise: random error or variance in a measured variable • Incorrect attribute values may due to • faulty data collection instruments • data entry problems • data transmission problems • technology limitation • inconsistency in naming convention • Other data problems which requires data cleaning • duplicate records • incomplete data • inconsistent data Data Mining: Concepts and Techniques

How to Handle Noisy Data? • Binning and smoothing • sort data and partition into bins (equi-width, equi-depth) • then smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Regression • smooth by fitting the data into a function with regression • Clustering • detect and remove outliers that fall outside clusters • Combined computer and human inspection • detect suspicious values and check by human (e.g., deal with possible outliers) Data Mining: Concepts and Techniques

Simple Discretization Methods: Binning • Equal-width (distance) partitioning • Divides the range into N intervals of equal size: uniform grid • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. • The most straightforward, but outliers may dominate presentation • Skewed data is not handled well • Equal-depth (frequency) partitioning • Divides the range into N intervals, each containing approximately same number of samples • Good data scaling • Managing categorical attributes can be tricky Data Mining: Concepts and Techniques

Binning Methods for Data Smoothing • Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 Data Mining: Concepts and Techniques

Regression y Y1 y = x + 1 Y1’ x X1 Data Mining: Concepts and Techniques

Cluster Analysis Data Mining: Concepts and Techniques

Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data integration Data transformation Data reduction Discretization and generalization August 30, 2014 Data Mining: Concepts and Techniques 35

Data Integration • Data integration: combines data from multiple sources into a unified view • Architectures • Data warehouse (tightly coupled) • Federated database systems (loosely coupled) • Database heterogeneity • Semantic integration Data Mining: Concepts and Techniques

Client Client Query & Analysis Warehouse ETL Source Source Source Data Warehouse Approach Metadata

Advantages and Disadvantages of Data Warehouse • Advantages • High query performance • Can operate when sources unavailable • Extra information at warehouse • Modification, summarization (aggregates), historical information • Local processing at sources unaffected • Disadvantages • Data freshness • Difficult to construct when only having access to query interface of local sources

Client Client Mediator Wrapper Wrapper Wrapper Source Source Source Federated Database Systems

Advantages and Disadvantages of Federated Database Systems • Advantage • No need to copy and store data at mediator • More up-to-date data • Only query interface needed at sources • Disadvantage • Query performance • Source availability

Database Heterogeneity • System Heterogeneity: use of different operating system, hardware platforms • Schematic or Structural Heterogeneity: the native model or structure to store data differ in data sources. • Syntactic Heterogeneity: differences in representation format of data • Semantic Heterogeneity: differences in interpretation of the 'meaning' of data

Semantic Integration • Problem: reconciling semantic heterogeneity • Levels • Schema matching (schema mapping) • e.g., A.cust-id  B.cust-# • Data matching (data deduplication, record linkage, entity/object matching) • e.g., Bill Clinton = William Clinton • Challenges • Semantics inferred from few information sources (data creators, documentation) -> rely on schema and data • Schema and data unreliable and incomplete • Global pair-wise matching computationally expensive • In practice, ?% of resources spent on reconciling semantic heterogeneity in data sharing project

Schema Matching • Techniques • Rule based • Learning based • Type of matches • 1-1 matches vs. complex matches (e.g. list-price = price *(1+tax_rate)) • Information used • Schema information: element names, data types, structures, number of sub-elements, integrity constraints • Data information: value distributions, frequency of words • External evidence: past matches, corpora of schemas • Ontologies. E.g. Gene Ontology • Multi-matcher architecture

Data Matching Or … ? record linkage data matching object identification entity resolution entity disambiguation duplicate detection record matching instance identification deduplication reference reconciliation database hardening … Data Mining: Concepts and Techniques

Data Matching • Techniques • Rule based • Probabilistic Record Linkage (Fellegi and Sunter, 1969) • Similarity between pairs of attributes • Combined scores representing probability of matching • Threshold based decision • Machine learning approaches • New challenges • Complex information spaces • Multiple classes Data Mining: Concepts and Techniques

Data Transformation • Smoothing: remove noise from data (data cleaning) • Aggregation: summarization • E.g. Daily sales -> monthly sales • Discretization and generalization • E.g. age -> youth, middle-aged, senior • (Statistical) Normalization: scaled to fall within a small, specified range • E.g. income vs. age • Attribute construction: construct new attributes from given ones • E.g. birthday -> age Data Mining: Concepts and Techniques

Data Aggregation • Data cubes store multidimensional aggregated information • Multiple levels of aggregation for analysis at multiple granularities • More on data warehouse and cube computation (chap 3, 4) Data Mining: Concepts and Techniques

Normalization • Min-max normalization: [minA, maxA] to [new_minA, new_maxA] • Ex. Let income [$12,000, $98,000] normalized to [0.0, 1.0]. Then $73,000 is mapped to • Z-score normalization (μ: mean, σ: standard deviation): • Ex. Let μ = 54,000, σ = 16,000. Then • Normalization by decimal scaling Where j is the smallest integer such that Max(|ν’|) < 1 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques — Chapter 2 —