the software infrastructure for electronic commerce
Skip this Video
Download Presentation
The Software Infrastructure for Electronic Commerce

Loading in 2 Seconds...

play fullscreen
1 / 58

The Software Infrastructure for Electronic Commerce - PowerPoint PPT Presentation

  • Uploaded on

The Software Infrastructure for Electronic Commerce. Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke [email protected] Lectures Three and Four. Data preprocessing Multidimensional data analysis Data mining

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'The Software Infrastructure for Electronic Commerce' - aurek

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the software infrastructure for electronic commerce

The Software Infrastructurefor Electronic Commerce

Databases and Data Mining

Lecture 4: An Introduction To Data Mining (II)

Johannes Gehrke

[email protected]

lectures three and four
Lectures Three and Four
  • Data preprocessing
  • Multidimensional data analysis
  • Data mining
    • Association rules
    • Classification trees
    • Clustering
types of attributes
Types of Attributes
  • Numerical: Domain is ordered and can be represented on the real line (e.g., age, income)
  • Nominal or categorical: Domain is a finite set without any natural ordering (e.g., occupation, marital status, race)
  • Ordinal: Domain is ordered, but absolute differences between values is unknown (e.g., preference scale, severity of an injury)

Goal: Learn a function that assigns a record to one of several predefined classes.

classification example
Classification Example
  • Example training database
    • Two predictor attributes:Age and Car-type (Sport, Minivan and Truck)
    • Age is ordered, Car-type iscategorical attribute
    • Class label indicateswhether person boughtproduct
    • Dependent attribute is categorical
regression example
Regression Example
  • Example training database
    • Two predictor attributes:Age and Car-type (Sport, Minivan and Truck)
    • Spent indicates how much person spent during a recent visit to the web site
    • Dependent attribute is numerical
types of variables review
Types of Variables (Review)
  • Numerical: Domain is ordered and can be represented on the real line (e.g., age, income)
  • Nominal or categorical: Domain is a finite set without any natural ordering (e.g., occupation, marital status, race)
  • Ordinal: Domain is ordered, but absolute differences between values is unknown (e.g., preference scale, severity of an injury)
  • Random variables X1, …, Xk (predictor variables) and Y (dependent variable)
  • Xi has domain dom(Xi), Y has domain dom(Y)
  • P is a probability distribution on dom(X1) x … x dom(Xk) x dom(Y)Training database D is a random sample from P
  • A predictor d is a functiond: dom(X1) … dom(Xk)  dom(Y)
classification problem
Classification Problem
  • If Y is categorical, the problem is a classification problem, and we use C instead of Y.|dom(C)| = J.
  • C is called the class label, d is called a classifier.
  • Take r be record randomly drawn from P. Define the misclassification rate of d:RT(d,P) = P(d(r.X1, …, r.Xk) != r.C)
  • Problem definition: Given dataset D that is a random sample from probability distribution P, find classifier d such that RT(d,P) is minimized.
regression problem
Regression Problem
  • If Y is numerical, the problem is a regression problem.
  • Y is called the dependent variable, d is called a regression function.
  • Take r be record randomly drawn from P. Define mean squared error rate of d:RT(d,P) = E(r.Y - d(r.X1, …, r.Xk))2
  • Problem definition: Given dataset D that is a random sample from probability distribution P, find regression function d such that RT(d,P) is minimized.
goals and requirements
Goals and Requirements
  • Goals:
    • To produce an accurate classifier/regression function
    • To understand the structure of the problem
  • Requirements on the model:
    • High accuracy
    • Understandable by humans, interpretable
    • Fast construction for very large training databases
different types of classifiers
Different Types of Classifiers
  • Linear discriminant analysis (LDA)
  • Quadratic discriminant analysis (QDA)
  • Density estimation methods
  • Nearest neighbor methods
  • Logistic regression
  • Neural networks
  • Fuzzy set theory
  • Decision Trees
difficulties with lda and qda
Difficulties with LDA and QDA
  • Multivariate normal assumption often not true
  • Not designed for categorical variables
  • Form of classifier in terms of linear or quadratic discriminant functions is hard to interpret
histogram density estimation
Histogram Density Estimation
  • Curse of dimensionality
  • Cell boundaries are discontinuities. Beyond boundary cells, estimate falls abruptly to zero.
kernel density estimation
Kernel Density Estimation
  • How to choose kernel bandwith h?
    • The optimal h depends on a criterion
    • The optimal h depends on the form of the kernel
    • The optimal h might depend on the class label
    • The optimal h might depend on the part of the predictor space
  • How to choose form of the kernel?
k nearest neighbor methods
K-Nearest Neighbor Methods
  • Difficulties:
    • Data must be stored; for classification of a new record, all data must be available
    • Computationally expensive in high dimensions
    • Choice of k is unknown
difficulties with logistic regression
Difficulties with Logistic Regression
  • Few goodness of fit and model selection techniques
  • Categorical predictor variables have to be transformed into dummy vectors.
neural networks and fuzzy set theory
Neural Networks and Fuzzy Set Theory


  • Classifiers are hard to understand
  • How to choose network topology and initial weights?
  • Categorical predictor variables?
what are decision trees









What are Decision Trees?




Car Type



Sports, Truck



decision trees
Decision Trees
  • A decision tree T encodes d (a classifier or regression function) in form of a tree.
  • A node t in T without children is called a leaf node. Otherwise t is called an internal node.
internal nodes
Internal Nodes
  • Each internal node has an associated splitting predicate. Most common are binary predicates.Example predicates:
    • Age <= 20
    • Profession in {student, teacher}
    • 5000*Age + 3*Salary – 10000 > 0
internal nodes splitting predicates
Internal Nodes: Splitting Predicates
  • Binary Univariate splits:
    • Numerical or ordered X: X <= c, c in dom(X)
    • Categorical X: X in A, A subset dom(X)
  • Binary Multivariate splits:
    • Linear combination split on numerical variables:Σ aiXi <= c
  • k-ary (k>2) splits analogous
leaf nodes
Leaf Nodes

Consider leaf node t

  • Classification problem: Node t is labeled with one class label c in dom(C)
  • Regression problem: Two choices
    • Piecewise constant model:t is labeled with a constant y in dom(Y).
    • Piecewise linear model:t is labeled with a linear model Y = yt + Σ aiXi
Encoded classifier:

If (age<30 and carType=Minivan)Then YES

If (age <30 and(carType=Sports or carType=Truck))Then NO

If (age >= 30)Then NO





Car Type



Sports, Truck



choice of classification algorithm
Choice of Classification Algorithm?
  • Example study: (Lim, Loh, and Shih, Machine Learning 2000)
    • 33 classification algorithms
    • 16 (small) data sets (UC Irvine ML Repository)
    • Each algorithm applied to each data set
  • Experimental measurements:
    • Classification accuracy
    • Computational speed
    • Classifier complexity
classification algorithms
Classification Algorithms
  • Tree-structure classifiers:
    • IND, S-Plus Trees, C4.5, FACT, QUEST, CART, OC1, LMDT, CAL5, T1
  • Statistical methods:
  • Neural networks:
    • LVQ, RBF
experimental details
Experimental Details
  • 16 primary data sets, created 16 more data sets by adding noise
  • Converted categorical predictor variables to 0-1 dummy variables if necessary
  • Error rates for 6 data sets estimated from supplied test sets, 10-fold cross-validation used for the other data sets
ranking by mean error rate
Ranking by Mean Error Rate

Rank Algorithm Mean Error Time

1 Polyclass 0.195 3 hours

2 Quest Multivariate 0.202 4 min

3 Logistic Regression 0.204 4 min

6 LDA 0.208 10 s

8 IND CART 0.215 47 s

12 C4.5 Rules 0.220 20 s

16 Quest Univariate 0.221 40 s

other results
Other Results
  • Number of leaves for tree-based classifiers varied widely (median number of leaves between 5 and 32 (removing some outliers))
  • Mean misclassification rates for top 26 algorithms are not statistically significantly different, bottom 7 algorithms have significantly lower error rates
decision trees summary
Decision Trees: Summary
  • Powerful data mining model for classification (and regression) problems
  • Easy to understand and to present to non-specialists
  • TIPS:
    • Even if black-box models sometimes give higher accuracy, construct a decision tree anyway
    • Construct decision trees with different splitting variables at the root of the tree
  • Input: Relational database with fixed schema
  • Output: k groups of records called clusters, such that the records within a group are more similar to records in other groups
  • More difficult than classification (unsupervised learning: no record labels are given)
  • Usage:
    • Exploratory data mining
    • Preprocessing step (e.g., outlier detection)
clustering contd
Clustering (Contd.)
  • In clustering we partitioning a set of records into meaningful sub-classes called clusters.
  • Cluster: a collection of data objects that are “similar” to one another and thus can be treated collectively as one group.
  • Clustering helps users to detect inherent groupings and structure in a data set.
clustering contd33
Example input database: Two numerical variables

How many groups are here?

Requirements: Need to define “similarity” between records

Clustering (Contd.)
clustering contd35
Clustering (Contd.)
  • Output of clustering:
    • Representative points for each cluster
    • Labeling of each record with each cluster number
    • Other description of each cluster
  • Important: Use the “right” distance function
    • Scale or normalize all attributes. Example: seconds, hours, days
    • Assign different weights associated with importance of the attribute
clustering summary
Clustering: Summary
  • Finding natural groups in data
  • Common post-processing steps:
    • Build a decision tree with the cluster label as class label
    • Try to explain the groups using the decision tree
    • Visualize the clusters
    • Examine the differences between the clusters with respect to the fields of the dataset
  • Try different number of clusters
web usage mining
Web Usage Mining
  • Data sources:
    • Web server log
    • Information about the web site:
      • Site graph
      • Metadata about each page (type, objects shown)
      • Object concept hierarchies
  • Preprocessing:
    • Detect session and user context (Cookies, user authentication, personalization)
web usage mining contd
Web Usage Mining (Contd.)
  • Data Mining
    • Association Rules
    • Sequential Patterns
    • Classification
  • Action
    • Personalized pages
    • Cross-selling
  • Evaluation and Measurement
    • Deploy personalized pages selectively
    • Measure effectiveness of each implemented action
large case study churn
Large Case Study: Churn
  • Telecommunications industry
  • Try to predict churn (whether customer will switch long-distance carrier)
  • Dataset:
    • 5000 records (tiny dataset, but manageable here in class)
    • 21 attributes, both numerical and categorical attributes (very few attributes)
    • Data is already cleaned! No missing values, inconsistencies, etc. (again, for classroom purposes)
churn example dataset columns
Churn Example: Dataset Columns
  • State
  • Account length: Number of months the customer has been with the company
  • Area code
  • Phone number
  • International plan: yes/no
  • Voice mail: yes/no
  • Number of voice: Average number of voice messages per day
  • Total (day, evening, night, international) minutes: Average number of minutes charged
  • Total (day, evening, night, international) calls: Average number of calls made
  • Total (day, evening, night, international) charge: Average amount charged per day
  • Number customer service calls: Number of calls made to customer support in the last six months
  • Churned: Did the customer switch long-distance carriers in the last six months
churn example analysis
Churn Example: Analysis
  • We start out by getting familiar with the dataset
    • Record viewer
    • Statistics visualization
    • Evidence classifier
    • Visualizing joint distributions
    • Visualizing geographic distribution of churn
churn example analysis contd
Churn Example: Analysis (Contd.)
  • Building and interpreting data mining models
    • Decision trees
    • Clustering
evaluating data mining tools44
Evaluating Data Mining Tools
  • Checklist:
    • Integration with current applications and your data management infrastructure
    • Ease of usage
    • Automation
    • Scalability to large datasets
      • Number of records
      • Number of attributes
      • Datasets larger than main memory
      • Support of sampling
    • Export of models into your enterprise
    • Stability of the company that offers the product
integration with data management
Integration With Data Management
  • Proprietary storage format?
  • Native support of major database systems:
    • IBM DB2, Informix, Oracle, SQL Server, Sybase
    • ODBC
    • Support of parallel database systems
  • Integration with your data warehouse
cost considerations
Cost Considerations
  • Proprietary or commodity hardware and operating system
    • Client and server might be different
    • What server platforms are supported?
  • Support staff needed
  • Training of your staff members
    • Online training, tutorials
    • On-site training
    • Books, course material
data mining projects
Data Mining Projects
  • Checklist:
    • Start with well-defined business questions
    • Have a champion within the company
    • Define measures of success and failure
  • Main difficulty: No automation
    • Understanding the business problem
    • Selecting the relevant data
    • Data transformation
    • Selection of the right mining methods
    • Interpretation
understand the business problem
Understand the Business Problem

Important questions:

  • What is the problem that we need to solve?
  • Are there certain aspects of the problem that are especially interesting?
  • Do we need data mining to solve the problem?
  • What information is actionable, and when?
  • Are there important business rules that constrain our solution?
  • What people should we keep in the loop, and with whom should we discuss intermediate results?
  • Who are the (internal) customers of the effort?
hiring outside experts
Hiring Outside Experts?


  • One-time problem versus ongoing process
  • Source of data
  • Deployment of data mining models
  • Availability and skills of your own staff
hiring experts
Hiring Experts

Types of experts:

  • Your software vendor
  • Consulting companies/centers/individuals

Your goal: Develop in-house expertise

the data mining market
The Data Mining Market
  • Revenues for the data mining market:$8 billion (Mega Group 1/1999)
  • Sales of data mining software (Two Crows Corporation 6/99)
    • 1998: $50 million
    • 1999: $75 million
    • 2000: $120 million
  • Hardware companies often use their data mining software as loss-leaders (Examples: IBM, SGI)
knowledge management in general
Knowledge Management in General

Percent of information technology executives citing the systems used in their knowledge management strategy (IW 4/1999)

  • Relational Database 95%
  • Text/Document Search 80%
  • Groupware 71%
  • Data Warehouse 65%
  • Data Mining Tools 58%
  • Expert Database/AI Tools 25%
crossing the chasm
Crossing the Chasm
  • Data mining is currently trying to cross this chasm.
  • Great opportunities, but also great perils.
    • You have a unique advantage by applying data mining “the right way”.
    • It is not yet common knowledge how to apply data mining “the right way”.
    • No major cooking recipes to make a data mining project work (yet).
  • Database and data mining technology is crucial for any enterprise
  • We talked about the complete data management infrastructure
    • DBMS technology
    • Querying
    • WWW/DBMS integration
    • Data warehousing and dimensional modeling
    • OLAP
    • Data mining
additional material web sites
Additional Material: Web Sites
  • Data mining companies, jobs, courses, publications, datasets,
  • ACM Special Interest Group on Knowledge Discovery and Data
additional material books
Additional Material: Books
  • U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996
  • Michael Berry & Gordon Linoff, Data Mining Techniques for Marketing, Sales and Customer Support, John Wiley & Sons, 1997.
  • Ian Witten and Eibe Frank, Data Mining, Practical Machine Learning Tools and Techniques with Java Implementations, Oct 1999
  • Michael Berry & Gordon Linoff, Mastering Data Mining, John Wiley & Sons, 2000.
additional material database systems
Additional Material: Database Systems
  • IBM DB2:
  • Oracle:
  • Sybase:
  • Informix:
  • Microsoft:
  • NCR Teradata:

“Prediction is very difficult, especially about the future.”

Niels Bohr