850 likes | 1.09k Views
The Software Infrastructure for Electronic Commerce. Databases and Data Mining Lecture 3: An Introduction To Data Mining (I) Johannes Gehrke johannes@cs.cornell.edu http://www.cs.cornell.edu/johannes. Lectures Three and Four. Data preprocessing Multidimensional data analysis Data mining
E N D
The Software Infrastructurefor Electronic Commerce Databases and Data Mining Lecture 3: An Introduction To Data Mining (I) Johannes Gehrke johannes@cs.cornell.edu http://www.cs.cornell.edu/johannes
Lectures Three and Four • Data preprocessing • Multidimensional data analysis • Data mining • Association rules • Classification trees • Clustering
Why Data Preprocessing? • Quality decisions come from quality data. • Problems with real life data: • Data needs to be integrated from different sources • Missing values • Noisy and inconsistent values • Data is not at the right level of aggregation
Recall: The Relational Data Model • A relational database is a set of relations • A relation has two components: • The relation instance. Basically a table, with rows (also: records, tuples) and columns (also: fields, attributes).Number of records in the relation instance: Cardinality. • The relation schema. Specifies name of the relation, plus name and type of each column. • Turing award (Nobel price in CS) for Codd in 1981 for his work on the relational model
Example: Customer Relation • Relation schema:Customers (cid: integer, name: string, byear: integer, state: string) • Relation instance:
Data Integration • Integrate data from multiple sources into a common format for data mining. • Note: A good data warehouse has already taken care of this step. Data Warehouse Server OLTPDBMSs Extract, clean,transform, aggregate,load, update Other Data Sources Data Marts
Data Integration (Contd.) Problem: Heterogeneous schema integration • Different attribute names • Different units: Sales in $, sales in Yen, sales in DM
Data Integration (Contd.) Problem: Heterogeneous schema integration • Different scales: Sales in dollars versus sales in pennies • Derived attributes: Annual salary versus monthly salary
Data Integration (Contd.) Problem: Inconsistency due to redundancy • Customer with customer-id 150 has three children in relation1 and four children in relation2 • Computation of annual salary from monthly salary in relation1 does not match “annual-salary” attribute in relation2
Missing Values Often the values of some attributes in a record are not known. Reasons for missing values: • Attribute does not apply (e.g., maiden name) • Inconsistency with other recorded data • Equipment malfunction • Human errors • Attribute introduced recently (e.g., email address)
Missing Values: Approaches • Ignore the record • Complete the missing value: • Manual completion: Tedious and likely to be infeasible • Fill in a global constant, e.g., “NULL”, “unknown” • Use the attribute mean or mode • Construct a data mining model that predicts the missing value
Noisy Data • Examples: • Faulty data collection instruments • Data entry problems, misspellings • Data transmission problems • Technology limitation • Inconsistency in naming conventions • Duplicate records with different values for a common field
Noisy Data: Smoothing y Y1 Y1’ x X1
Noisy Data: Normalization • Scale data to fall within a small, specified range • Leave out extreme order statistics • Min-max normalization • Z-score normalization • Normalization by decimal scaling
Data Reduction Problem: • Data might not be at the right scale for analysis.Example: Individual phone calls versus monthly phone call usage • Complex data mining tasks might run a very long time.Example: Multi-terabyte data warehousesOne disk drive: About 20MB/s
Data Reduction: Attribute Selection • Select the “relevant” attributes for the data mining task • If there are k attributes, there are 2k-1 different subsets • Example: {salary,children,byear}, {salary,children}, {salary,byear}, {children,byear}, {salary}, {children}, {byear} • Choice of the right subset depends on: • Data mining task • Underlying probability distribution
Attribute Selection (Contd.) • How to choose relevant attributes: • Forward selection: Select greedily one attribute at a time • Backward elimination: Start with all attributes, eliminate irrelevant attributes • Combination of forward selection and backward elimination
Data Reduction: Parametric Models • Main idea: • Fit a parametric model to the data (e.g., multivariate normal distribution) • Store the model parameters, discard the data (except for outliers)
Instead of storing (x,y) pairs, store only the x-value.Then recompute the y-value usingy = ax + b Parametric Models: Example y x
Data Reduction: Sampling • Choose a representative subset of the data • Simple random sampling may have very poor performance in the presence of skew
Data Reduction: Sampling (Contd.) Stratified sampling: Biased sampling • Example: Keep population group ratios • Example: Keep minority population group count
Data Reduction: Histograms • Divide data into buckets and store average (sum) for each bucket • Can be constructed “optimally” for one attribute using dynamic programming • Example:Dataset: 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12Histogram: (range, count, sum)(1-2,12,16), (3-6,8,36), (7-9,6,48), (10-12,6,66)
Histograms (Contd.) • Equal-width histogram • Divides the domain of an attribute into k intervals of equal size • Interval width = (Max – Min)/k • Computationally easy • Problems with data skew and outliers • Example:Dataset: 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12Histogram: (range, count, sum)(1-3,14,22), (4-6,6,30), (7-9,6,48), (10-12,6,66)
Histograms (Contd.) • Equal-depth histogram • Divides the domain of an attribute into k intervals, each containing the same number of records • Variable interval width • Computationally easy • Example:Dataset: 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12Histogram: (range, count, sum)(1,8,8), (2-4,8,22), (5-8,8,52), (9-12,8,84)
Data Reduction: Discretization • Same concept as histograms • Divide domain of a numerical attribute into intervals. • Replace attribute value with label for interval. • Example: • Dataset (age; salary):(25;30,000),(30;80,000),(27;50,000),(60;70,000),(50;55,000),(28;25,000) • Discretized dataset (age, discretizedSalary):(25,low),(30,high),(27,medium),(60,high),(50,medium),(28,low)
Discretization: Natural Hierarchies • Replace low-level concepts with high-level concepts • Example replacements: • Product SKU by category • City by state Year Industry Country Quarter Category State Month Week Product City Day
Data Reduction: Aggregation • Natural hierarchies on attributes can be used to aggregate data along the hierarchy Year Industry Country Quarter Category State Month Week Product City Day
Data Reduction: Other Methods • Principal component analysis • Fourier transformation • Wavelet transformation
Problems during data integration: Different attribute names Different units Different scales Derived attributes Redundant data Missing values Imputation Prediction Noisy data: Outlier removal Smoothing Normalization Data Reduction: Attribute selection Fitting parametric models Sampling Histograms Discretization Aggregation Data Preprocessing: Summary
Data Analysis • Data preprocessing • Multidimensional data analysis • Data mining • Association rules • Classification trees • Clustering
Multidimensional Data Analysis Recall: Transactions(ckey, timekey, pkey, units, price) Customers(ckey, cid, name, byear, city, state, country) Time(tkey, day, month, quarter, year) Products(pkey, pname, price, pid, category, industry) Hierarchies on dimensions: Year Industry Country Quarter Category State Month Week Product City Day
Multidimensional Data Analysis Year Industry Country=“USA” Quarter Category State Month Week Product City Day
Corresponding Query in SQL • SELECT SUM(units)FROM Transactions T, Products P, Customers CWHERE T.pkey = P.pkey AND T.ckey = C.ckey AND C.country = “USA”GROUP BY P.industry, C.state • We think that Industry3 in CA is interesting. Year Industry Country=“USA” Quarter Category State Month Week Product City Day
Slice and Drill-Down Year Industry=“Industry3” Country Quarter Category State=“CA” Month Week Product City Day
Corresponding Query in SQL • SELECT SUM(units)FROM Transactions T, Products P, Customers CWHERE T.pkey = P.pkey AND T.ckey = C.ckey AND P.industry = “Industry3” AND C.state = “CA”GROUP BY P.category, C.city • We think that Category3 is interesting. Year Industry=“Industry3” Country Quarter Category State=“CA” Month Week Product City Day
Slice and Drill-Down Year Country Industry Quarter State=“CA” Category=“Category3” Month Week City Product Day
Corresponding Query in SQL • SELECT SUM(units)FROM Transactions T, Products P, Customers CWHERE T.pkey = P.pkey AND T.ckey = C.ckey AND C.state = “CA” AND P.category = “Category3”GROUP BY P.product, C.city • Nothing new in this view of the data. Year Country Industry Quarter State=“CA” Category=“Category3” Month Week City Product Day
Pivot To (City, Year) Year Country Industry Quarter State=“CA” Category=“Category3” Month Week City Day Product
Corresponding Query in SQL • SELECT SUM(units)FROM Transactions T, Products P, Customers CWHERE T.pkey = P.pkey AND T.ckey = C.ckey AND C.state = “CA” AND P.category = “Category3”GROUP BY C.city, T.year Year Country Industry Quarter State=“CA” Category=“Category3” Month Week City Day Product
Multidimensional Data Analysis Set of data manipulation operators • Roll-up: Go up one step in a dimension hierarchy. Example: month -> quarter • Drill-down: Go down one step in a dimension hierarchy. Example: quarter -> month • Slice: Select a subset of the values of a dimension. Example: All categories -> only Category3 • Dice: Select all values of a dimension. Example: Only Category3 -> all categories • Pivot: Select new dimensions to visualize the data. Example: Pivot to Time(quarter) and Customer(state)
Visual Intuition: Cube roll-up to category Customer Data Mart roll-up to state SH SF LA Product1 Product2Product3 Product4Product5Product6 20 30 20 15 10 50 Product roll-up to week M T W Th F S S Time 50 Units of Product6 sold on Monday in LA
OLAP Server Architectures • Relational OLAP (ROLAP) • Relational DBMS stores data mart • OLAP middleware: • Aggregation and navigation logic • Optimized for DBMS in the background, but slow and complex • Basically only one vendor: Microstrategy • Multidimensional OLAP (MOLAP) • Specialized array-based storage structure • Vendors: Hyperion (Essbase), Appix (iTM1), Oracle, Microsoft
OLAP Server Architectures • Desktop OLAP (DOLAP) • Performs OLAP directly at your PC • Vendors: Cognos (Powerplay), Business Objects, Brio Technology, Hummingbird • Hybrids and Application OLAP • More: www.olapreport.com
Summary: Multidimensional Analysis • Spreadsheet style data analysis • Roll-up, drill-down, slice, dice, and pivot your way to interesting cells in the CUBE • Mainstream technology • Established enterprises already have OLAP installations • When establishing your e-business, OLAP will be your first step in data analysis
Definition Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Example pattern (Census Bureau Data):If (relationship = husband), then (gender = male). 99.6%