Create Presentation
Download Presentation

Download Presentation
## Data Mining: Concepts and Techniques — Chapter 2 —

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Data Mining:Concepts and Techniques— Chapter 2 —**Original Slides: Jiawei Han and Micheline Kamber Modification: Li Xiong Data Mining: Concepts and Techniques**Chapter 2: Data Preprocessing**• Why preprocess the data? • Descriptive data summarization • Data cleaning • Data integration • Data transformation • Data reduction • Discretization and generalization Data Mining: Concepts and Techniques**Why Data Preprocessing?**• Data in the real world is dirty • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • e.g., occupation=“ ” • noisy: containing errors or outliers • e.g., Salary=“-10” • inconsistent: containing discrepancies in codes or names • e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records Data Mining: Concepts and Techniques**Why Is Data Dirty?**• Incomplete data may come from • “Not applicable” data value when collected • Different considerations between the time when the data was collected and when it is analyzed. • Human/hardware/software problems • Noisy data (incorrect values) may come from • Faulty data collection instruments • Human or computer error at data entry • Errors in data transmission • Inconsistent data may come from • Different data sources • Functional dependency violation (e.g., modify some linked data) • Duplicate records also need data cleaning Data Mining: Concepts and Techniques**Multi-Dimensional Measure of Data Quality**• A well-accepted multidimensional view: • Accuracy • Completeness • Consistency • Timeliness • Believability • Value added • Interpretability • Accessibility • Broad categories: • Intrinsic, contextual, representational, and accessibility Data Mining: Concepts and Techniques**Major Tasks in Data Preprocessing**• Data cleaning • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration • Integration of multiple databases, data cubes, or files • Data transformation • Normalization and aggregation • Data reduction • Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization • Part of data reduction but with particular importance, especially for numerical data Data Mining: Concepts and Techniques**Forms of Data Preprocessing**Data Mining: Concepts and Techniques**Chapter 2: Data Preprocessing**• Why preprocess the data? • Descriptive data summarization • Data cleaning • Data integration • Data transformation • Data reduction • Discretization and generalization Data Mining: Concepts and Techniques**Descriptive Data Summarization**• Motivation • To better understand the data • Descriptive statistics: describe basic features of data • Graphical description • Tabular description • Summary statistics • Descriptive data summarization • Measuring central tendency – how data seem similar • Measuring statistical variability or dispersion of data – how data differ • Graphic display of descriptive data summarization Data Mining: Concepts and Techniques**Measuring the Central Tendency**• Mean (sample vs. population): • Weighted arithmetic mean: • Trimmed mean: chopping extreme values • Median • Middle value if odd number of values, or average of the middle two values otherwise • Estimated by interpolation (for grouped data): • Mode • Value that occurs most frequently in the data • Unimodal, bimodal, trimodal • Empirical formula: Data Mining: Concepts and Techniques**Symmetric vs. Skewed Data**• Median, mean and mode of symmetric, positively and negatively skewed data Data Mining: Concepts and Techniques**Computational Issues**• Different types of measures • Distributed measure – can be computed by partitioning the data into smaller subsets. E.g. sum, count • Algebraic measure – can be computed by applying an algebraic function to one or more distributed measures. E.g. ? • Holistic measure – must be computed on the entire dataset as a whole. E.g. ? • Selection algorithm: finding kth smallest number in a list • E.g. min, max, median • Selection by sorting: O(n* logn) • Linear algorithms based on quicksort: O(n) Data Mining: Concepts and Techniques**The Long Tail**• Long tail: low-frequency population (e.g. wealth distribution) • The Long Tail: the current and future business and economic models • Previous empirical studies: Amazon, Netflix • Products that are in low demand or have low sales volume can collectively make up a market share that rivals or exceeds the relatively few current bestsellers and blockbusters • The primary value of the internet: providing access to products in the long tail • Business and social implications • mass market retailers: Amazon, Netflix, eBay • content producers: YouTube • The Long Tail. Chris Anderson, Wired, Oct. 2004 • The Long Tail: Why the Future of Business is Selling Less of More. Chris Anderson. 2006 Data Mining: Concepts and Techniques**Measuring the Dispersion of Data**• Dispersion or variance: the degree to which numerical data tend to spread • Range and Quartiles • Range: difference between the largest and smallest values • Percentile: the value of a variable below which a certain percent of data fall (algebraic or holistic?) • Quartiles: Q1 (25th percentile), Median (50th percentile), Q3 (75th percentile) • Inter-quartile range: IQR = Q3 –Q1 • Five number summary: min, Q1, M,Q3, max (Boxplot) • Outlier: usually, a value at least 1.5 x IQR higher/lower than Q3/Q1 • Variance and standard deviation (sample:s, population: σ) • Variance: sample vs. population (algebraic or holistic?) • Standard deviation s (or σ) is the square root of variance s2 (orσ2) Data Mining: Concepts and Techniques**Graphic Displays of Basic Statistical Descriptions**• Histogram • Boxplot • Quantile plot • Quantile-quantile (q-q) plot • Scatter plot • Loess (local regression) curve Data Mining: Concepts and Techniques**Histogram Analysis**• Graphical display of tabulated frequencies • univariate graphical method (one attribute) • data partitioned into disjoint buckets (typically equal-width) • a set of rectangles that reflect the counts or frequencies of values at the bucket • Bar chart for categorical values August 30, 2014 Data Mining: Concepts and Techniques**Boxplot Analysis**• Visualizes five-number summary: • The ends of the box are first and third quartiles (Q1 and Q3), i.e., the height of the box is IRQ • The median (M) is marked by a line within the box • Whiskers: two lines outside the box extend to Minimum and Maximum Data Mining: Concepts and Techniques**Example Boxplot: Profit Analysis**Data Mining: Concepts and Techniques**Quantile Plot**• Displays all of the data for the given attribute • Plots quantile information • Each data point (xi, fi) indicates that approximately 100 fi% of the data are below or equal to the value xi Data Mining: Concepts and Techniques**Quantile-Quantile (Q-Q) Plot**• Graphs the quantiles of one univariate distribution against the corresponding quantiles of another • Diagnosing differences between the probability distribution of two distributions Data Mining: Concepts and Techniques**Scatter plot**• Displays values for two numerical attributes (bivariate data) • Each pair of values plotted as a point in the plane • can suggest various kinds of correlations between variables with a certain confidence level: positive (rising), negative (falling), or null (uncorrelated). Data Mining: Concepts and Techniques**Example Scatter Plot – Correlation between Wine**Consumption and Heart Mortality**Positively and Negatively Correlated Data**Data Mining: Concepts and Techniques**Loess Curve**• Locally weighted scatter plot smoothing to provide better perception of the pattern of dependence • Fitting simple models to localized subsets of the data Data Mining: Concepts and Techniques**Chapter 2: Data Preprocessing**• Why preprocess the data? • Descriptive data summarization • Data cleaning • Data integration • Data transformation • Data reduction • Discretization and generalization Data Mining: Concepts and Techniques**Data Cleaning**• Importance • “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball • “Data cleaning is the number one problem in data warehousing”—DCI survey • Data cleaning tasks • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data • Resolve redundancy caused by data integration Data Mining: Concepts and Techniques**Missing Data**• Data is not always available • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry • not register history or changes of the data • Missing data may need to be inferred. Data Mining: Concepts and Techniques**How to Handle Missing Values?**• Ignore the tuple: usually done when class label is missing (assuming the tasks in • Fill in the missing value manually • Fill in the missing value automatically • a global constant : e.g., “unknown”, a new class?! • the attribute mean • the attribute mean for all samples belonging to the same class: smarter • the most probable value: inference-based such as Bayesian formula or decision tree (Chap 6) Data Mining: Concepts and Techniques**Noisy Data**• Noise: random error or variance in a measured variable • Incorrect attribute values may due to • faulty data collection instruments • data entry problems • data transmission problems • technology limitation • inconsistency in naming convention • Other data problems which requires data cleaning • duplicate records • incomplete data • inconsistent data Data Mining: Concepts and Techniques**How to Handle Noisy Data?**• Binning and smoothing • sort data and partition into bins (equi-width, equi-depth) • then smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Regression • smooth by fitting the data into a function with regression • Clustering • detect and remove outliers that fall outside clusters • Combined computer and human inspection • detect suspicious values and check by human (e.g., deal with possible outliers) Data Mining: Concepts and Techniques**Simple Discretization Methods: Binning**• Equal-width (distance) partitioning • Divides the range into N intervals of equal size: uniform grid • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. • The most straightforward, but outliers may dominate presentation • Skewed data is not handled well • Equal-depth (frequency) partitioning • Divides the range into N intervals, each containing approximately same number of samples • Good data scaling • Managing categorical attributes can be tricky Data Mining: Concepts and Techniques**Binning Methods for Data Smoothing**• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 Data Mining: Concepts and Techniques**Regression**y Y1 y = x + 1 Y1’ x X1 Data Mining: Concepts and Techniques**Cluster Analysis**Data Mining: Concepts and Techniques**Chapter 2: Data Preprocessing**Why preprocess the data? Descriptive data summarization Data cleaning Data integration Data transformation Data reduction Discretization and generalization August 30, 2014 Data Mining: Concepts and Techniques 35**Data Integration**• Data integration: combines data from multiple sources into a unified view • Architectures • Data warehouse (tightly coupled) • Federated database systems (loosely coupled) • Database heterogeneity • Semantic integration Data Mining: Concepts and Techniques**Client**Client Query & Analysis Warehouse ETL Source Source Source Data Warehouse Approach Metadata**Advantages and Disadvantages of Data Warehouse**• Advantages • High query performance • Can operate when sources unavailable • Extra information at warehouse • Modification, summarization (aggregates), historical information • Local processing at sources unaffected • Disadvantages • Data freshness • Difficult to construct when only having access to query interface of local sources**Client**Client Mediator Wrapper Wrapper Wrapper Source Source Source Federated Database Systems**Advantages and Disadvantages of Federated Database Systems**• Advantage • No need to copy and store data at mediator • More up-to-date data • Only query interface needed at sources • Disadvantage • Query performance • Source availability**Database Heterogeneity**• System Heterogeneity: use of different operating system, hardware platforms • Schematic or Structural Heterogeneity: the native model or structure to store data differ in data sources. • Syntactic Heterogeneity: differences in representation format of data • Semantic Heterogeneity: differences in interpretation of the 'meaning' of data**Semantic Integration**• Problem: reconciling semantic heterogeneity • Levels • Schema matching (schema mapping) • e.g., A.cust-id B.cust-# • Data matching (data deduplication, record linkage, entity/object matching) • e.g., Bill Clinton = William Clinton • Challenges • Semantics inferred from few information sources (data creators, documentation) -> rely on schema and data • Schema and data unreliable and incomplete • Global pair-wise matching computationally expensive • In practice, ?% of resources spent on reconciling semantic heterogeneity in data sharing project**Schema Matching**• Techniques • Rule based • Learning based • Type of matches • 1-1 matches vs. complex matches (e.g. list-price = price *(1+tax_rate)) • Information used • Schema information: element names, data types, structures, number of sub-elements, integrity constraints • Data information: value distributions, frequency of words • External evidence: past matches, corpora of schemas • Ontologies. E.g. Gene Ontology • Multi-matcher architecture**Data Matching Or … ?**record linkage data matching object identification entity resolution entity disambiguation duplicate detection record matching instance identification deduplication reference reconciliation database hardening … Data Mining: Concepts and Techniques**Data Matching**• Techniques • Rule based • Probabilistic Record Linkage (Fellegi and Sunter, 1969) • Similarity between pairs of attributes • Combined scores representing probability of matching • Threshold based decision • Machine learning approaches • New challenges • Complex information spaces • Multiple classes Data Mining: Concepts and Techniques**Chapter 2: Data Preprocessing**Why preprocess the data? Descriptive data summarization Data cleaning Data integration Data transformation Data reduction Discretization and generalization August 30, 2014 Data Mining: Concepts and Techniques 46**Data Transformation**• Smoothing: remove noise from data (data cleaning) • Aggregation: summarization • E.g. Daily sales -> monthly sales • Discretization and generalization • E.g. age -> youth, middle-aged, senior • (Statistical) Normalization: scaled to fall within a small, specified range • E.g. income vs. age • Attribute construction: construct new attributes from given ones • E.g. birthday -> age Data Mining: Concepts and Techniques**Data Aggregation**• Data cubes store multidimensional aggregated information • Multiple levels of aggregation for analysis at multiple granularities • More on data warehouse and cube computation (chap 3, 4) Data Mining: Concepts and Techniques**Normalization**• Min-max normalization: [minA, maxA] to [new_minA, new_maxA] • Ex. Let income [$12,000, $98,000] normalized to [0.0, 1.0]. Then $73,000 is mapped to • Z-score normalization (μ: mean, σ: standard deviation): • Ex. Let μ = 54,000, σ = 16,000. Then • Normalization by decimal scaling Where j is the smallest integer such that Max(|ν’|) < 1 Data Mining: Concepts and Techniques**Chapter 2: Data Preprocessing**Why preprocess the data? Descriptive data summarization Data cleaning Data integration Data transformation Data reduction Discretization and generalization August 30, 2014 Data Mining: Concepts and Techniques 50