data mining chapter 26 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Data Mining Chapter 26 PowerPoint Presentation
Download Presentation
Data Mining Chapter 26

Loading in 2 Seconds...

play fullscreen
1 / 31

Data Mining Chapter 26 - PowerPoint PPT Presentation


  • 209 Views
  • Uploaded on

Data Mining Chapter 26. Chapter 1. Introduction. Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Are all the patterns interesting? Major issues in data mining. Motivation: “Necessity is the Mother of Invention”.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data Mining Chapter 26' - paul2


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
chapter 1 introduction
Chapter 1. Introduction
  • Motivation: Why data mining?
  • What is data mining?
  • Data Mining: On what kind of data?
  • Data mining functionality
  • Are all the patterns interesting?
  • Major issues in data mining
motivation necessity is the mother of invention
Motivation: “Necessity is the Mother of Invention”
  • Data explosion problem
    • Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories
  • We are drowning in data, but starving for knowledge!
  • Solution: Data warehousing and data mining
    • Data warehousing andon-line analytical processing
    • Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
evolution of database technology
Evolution of Database Technology
  • 1960s:
    • Data collection, database creation, IMS and network DBMS
  • 1970s:
    • Relational data model, relational DBMS implementation
  • 1980s:
    • RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)
  • 1990s—2000s:
    • Data mining and data warehousing, multimedia databases, and Web databases
what is data mining
What Is Data Mining?
  • Data mining (knowledge discovery in databases):
    • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)information or patterns from data in large databases
  • Alternative names:
    • Data mining: a misnomer?
    • Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
  • What is not data mining?
    • (Deductive) query processing.
    • Expert systems or small ML/statistical programs
why data mining potential applications
Why Data Mining? — Potential Applications
  • Database analysis and decision support
    • Market analysis and management
      • target marketing, customer relation management, market basket analysis, cross selling, market segmentation
    • Risk analysis and management
      • Forecasting, customer retention, improved underwriting, quality control, competitive analysis
    • Fraud detection and management
  • Other Applications
    • Text mining (news group, email, documents)
    • Stream data mining
    • Web mining.
    • DNA data analysis
market analysis and management 1
Market Analysis and Management (1)
  • Where are the data sources for analysis?
    • Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies
  • Target marketing
    • Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.
  • Determine customer purchasing patterns over time
    • Conversion of single to a joint bank account: marriage, etc.
  • Cross-market analysis
    • Associations/co-relations between product sales
    • Prediction based on the association information
market analysis and management 2
Market Analysis and Management (2)
  • Customer profiling
    • data mining can tell you what types of customers buy what products (clustering or classification)
  • Identifying customer requirements
    • identifying the best products for different customers
    • use prediction to find what factors will attract new customers
  • Provides summary information
    • various multidimensional summary reports
    • statistical summary information (data central tendency and variation)
corporate analysis and risk management
Corporate Analysis and Risk Management
  • Finance planning and asset evaluation
    • cash flow analysis and prediction
    • contingent claim analysis to evaluate assets
    • cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)
  • Resource planning:
    • summarize and compare the resources and spending
  • Competition:
    • monitor competitors and market directions
    • group customers into classes and a class-based pricing procedure
    • set pricing strategy in a highly competitive market
fraud detection and management 1
Fraud Detection and Management (1)
  • Applications
    • widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.
  • Approach
    • use historical data to build models of fraudulent behavior and use data mining to help identify similar instances
  • Examples
    • auto insurance: detect a group of people who stage accidents to collect on insurance
    • money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network)
    • medical insurance: detect professional patients and ring of doctors and ring of references
fraud detection and management 2
Fraud Detection and Management (2)
  • Detecting inappropriate medical treatment
    • Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).
  • Detecting telephone fraud
    • Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.
    • British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.
  • Retail
    • Analysts estimate that 38% of retail shrink is due to dishonest employees.
other applications
Other Applications
  • Sports
    • IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat
  • Astronomy
    • JPL and the Palomar Observatory discovered 22 quasars with the help of data mining
  • Internet Web Surf-Aid
    • IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.
data mining a kdd process
Data Mining: A KDD Process

Knowledge

Pattern Evaluation

  • Data mining: the core of knowledge discovery process.

Data Mining

Task-relevant Data

Selection

Data Warehouse

Data Cleaning

Data Integration

Databases

steps of a kdd process
Steps of a KDD Process
  • Learning the application domain:
    • relevant prior knowledge and goals of application
  • Creating a target data set: data selection
  • Data cleaning and preprocessing: (may take 60% of effort!)
  • Data reduction and transformation:
    • Find useful features, dimensionality/variable reduction, invariant representation.
  • Choosing functions of data mining
    • summarization, classification, regression, association, clustering.
  • Choosing the mining algorithm(s)
  • Data mining: search for patterns of interest
  • Pattern evaluation and knowledge presentation
    • visualization, transformation, removing redundant patterns, etc.
  • Use of discovered knowledge
data mining on what kind of data
Data Mining: On What Kind of Data?
  • Relational databases
  • Data warehouses
  • Transactional databases
  • Advanced DB and information repositories
    • Object-oriented and object-relational databases
    • Spatial and temporal data
    • Time-series data and stream data
    • Text databases and multimedia databases
    • Heterogeneous and legacy databases
    • WWW
association rule mining
Association Rule Mining
  • Association rule mining:
    • Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
    • Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database
  • Motivation: finding regularities in data
    • What products were often purchased together? — Beer and diapers?!
    • What are the subsequent purchases after buying a PC?
    • What kinds of DNA are sensitive to this new drug?
    • Can we automatically classify web documents?
association rule mining cont

Customer

buys both

Customer

buys diapers

Customer

buys beer

Association Rule Mining (cont.)
  • Itemset X={x1, …, xk}
  • Find all the rules XYwith min confidence and support
    • support, s, probability that a transaction contains XY
    • confidence, c,conditional probability that a transaction having X also contains Y.
  • Let min_support = 50%, min_conf = 50%:
    • A  C (50%, 66.7%)
    • C  A (50%, 100%)
mining association rules an example
Mining Association Rules—an Example

Min. support 50%

Min. confidence 50%

For rule AC:

support = support({A}{C}) = 50%

confidence = support({A}{C})/support({A}) = 66.6%

apriori a candidate generation and test approach
Apriori: A Candidate Generation-and-test Approach
  • Any subset of a frequent itemset must be frequent
    • if {beer, diaper, nuts} is frequent, so is {beer, diaper}
    • every transaction having {beer, diaper, nuts} also contains {beer, diaper}
  • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!
  • Method:
    • generate length (k+1) candidate itemsets from length k frequent itemsets, and
    • test the candidates against DB
  • The performance studies show its efficiency and scalability
the apriori algorithm an example
The Apriori Algorithm — An Example

L1

Database TDB

C1

1st scan

C2

C2

2nd scan

L2

L3

C3

3rd scan

the apriori algorithm
The Apriori Algorithm
  • Pseudo-code:

Ck: Candidate itemset of size k

Lk : frequent itemset of size k

L1 = {frequent items};

for(k = 1; Lk !=; k++) do begin

Ck+1 = candidates generated from Lk;

for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support

end

returnkLk;

important details of apriori
Important Details of Apriori
  • How to generate candidates?
    • Step 1: self-joining Lk
    • Step 2: pruning
  • Example of Candidate-generation
    • L3={abc, abd, acd, ace, bcd}
    • Self-joining: L3*L3
      • abcd from abc and abd
      • acde from acd and ace
    • Pruning:
      • acde is removed because ade is not in L3
    • C4={abcd}
how to generate candidates
How to Generate Candidates?
  • Suppose the items in Lk-1 are listed in an order
  • Step 1: self-joining Lk-1

insert intoCk

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

  • Step 2: pruning

forall itemsets c in Ckdo

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

classification and prediction
Classification and Prediction
  • Finding models (functions) that describe and distinguish classes or concepts for future prediction
  • E.g., classify countries based on climate, or classify cars based on gas mileage
  • Presentation: decision-tree, classification rule, neural network
  • Prediction: Predict some unknown or missing numerical values
classification process model construction

Training

Data

Classifier

(Model)

Classification Process: Model Construction

Classification

Algorithms

IF rank = ‘professor’

OR years > 6

THEN tenured = ‘yes’

classification process use the model in prediction

Classifier

Testing

Data

Unseen Data

Classification Process: Use the Model in Prediction

(Jeff, Professor, 4)

Tenured?

decision trees
Decision Trees

Training set

output a decision tree for buys computer
Output: A Decision Tree for “buys_computer”

age?

<=30

overcast

>40

30..40

student?

credit rating?

yes

no

yes

fair

excellent

no

yes

no

yes

cluster and outlier analysis
Cluster and outlier analysis
  • Cluster analysis
    • Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns
    • Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
  • Outlier analysis
    • Outlier: a data object that does not comply with the general behavior of the data
    • It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis