fundamentos de miner a de datos n.
Download
Skip this Video
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 44

- PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on

Fundamentos de Minería de Datos. Reglas de asociación. Fernando Berzal fberzal@decsai.ugr.es. Motivation. Association mining searches for interesting relationships among items in a given data set EXAMPLES Diapers and six-packs are bought together, specially on Thursday evening (a myth?)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - sasha


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
fundamentos de miner a de datos

Fundamentos de Minería de Datos

Reglas de asociación

Fernando Berzalfberzal@decsai.ugr.es

motivation
Motivation

Association mining searches for interesting relationships among items in a given data set

EXAMPLES

  • Diapers and six-packs are bought together, specially on Thursday evening (a myth?)
  • A sequence such as buying first a digital camera and then a memory card is a frequent (sequential) pattern
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
motivation1
Motivation

MARKET BASKET ANALYSIS

The earliest form of association rule mining

Applications:

Catalog design, store layout, cross-marketing…

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
definition
Definition

Item

  • In transactional databases:

Any of the items included in a transaction.

  • In relational databases:

(Attribute, value) pair

k-itemset

Set of k items

Itemset support support(I) = P(I)

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
definition1
Definition

Association rule

X  Y

  • Support

support(XY) = support(XUY) = P(XUY)

  • Confidence

confidence(XY) = support(XUY) / support(X)

= P(Y|X)  

NOTE: Both support and confidence are relative

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
discovery
Discovery

Association rule mining

Find all frequent itemsets

Generate strong association rules from the frequent itemsetsStrong association rules are those that satisfy both a minimum support threshold and a minimum confidence threshold.

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
discovery1
Discovery

Apriori

Observation:

All non-empty subsets of a frequent itemset must also be frequent

Algorithm:

Frequent k-itemsets are used to explore potentially frequent (k+1)-itemsets (i.e. candidates)

 Agrawal & Skirant: "Fast Algorithms for Mining Association Rules", VLDB'94

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
discovery2
Discovery

Apriori improvements (I)

  • Reducing the number of candidates Park, Chen & Yu: "An Effective Hash-Based Algorithm for Mining Association Rules", SIGMOD'95
  • SamplingToivonen: "Sampling Large Databases for Association Rules", VLDB'96 Park, Yu & Chen: "Mining Association Rules with Adjustable Accuracy", CIKM'97
  • PartitioningSavasere, Omiecinski & Navathe: "An Efficient Algorithm for Mining Association Rules in Large Databases", VLDB'95
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
discovery3
Discovery

Apriori improvements (II)

  • Transaction reduction Agrawal & Skirant: "Fast Algorithms for Mining Association Rules", VLDB'94 (AprioriTID)
  • Dynamic itemset countingBrin, Motwani, Ullman & Tsur: "Dynamic Itemset Counting and Implication Rules for Market Basket Data", SIGMOD'97 (DIC)Hidber: "Online Association Rule Mining", SIGMOD'99 (CARMA)
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
discovery4
Discovery

Apriori-like algorithm:

TBAR

(Tree-based association rule mining)

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR

Berzal, Cubero, Sánchez & Serrano

“TBAR: An efficient method for association rule mining in relational databases”

Data & Knowledge Engineering, 2001

discovery tbar

D #5

D #5

D #7

C #6

D #5

D #5

D #8

C #7

B #9

A #7

B #6

Discovery: TBAR
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR

L1

7 instances wih A

6 instances withAB

L2

5 instances withAD

6 instances withBC

5 instances withABD

L3

discovery5
Discovery

An alternative to Apriori:

Compress the database representing frequent items into a frequent-pattern tree (FP-tree)…

 Han, Pei & Yin: "Mining Frequent Patterns without Candidate Generation", SIGMOD'2000

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
discovery6
Discovery

A challenge

When an itemset is frequent,all its subsets are also frequent

  • Closed itemset C:There exists no proper super-itemset S such that support(S)=support(C)
  • Maximal (frequent) itemset M:M is frequent and there exists no super-itemset Y such that MY and Y is frequent.
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
variations
Variations

Based on the kinds of patterns to be mined:

  • Frequent itemset mining(transactional and relational data)
  • Sequential pattern mining(sequence data sets, e.g. bioinformatics)
  • Structured pattern mining(structured data, e.g. graphs)
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
variations1
Variations

Based on the types of values handled:

  • Boolean association rules
  • Quantitative association rules
  • Fuzzy association rules
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR

 Delgado, Marín, Sánchez & Vila

“Fuzzy association rules: General model and applications”

IEEE Transactions on Fuzzy Systems, 2003

variations2
Variations

More options:

  • Generalized association rules(a.k.a. multilevel association rules)
  • Constraint-based association rule mining
  • Incremental algorithms
  • Top-k algorithms
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR

ICDM FIMIWorkshop on Frequent Itemset Mining Implementations

http://fimi.cs.helsinki.fi/

visualization
Visualization

Integrated into data mining tools to help users understand data mining results:

  • Table-based approache.g. SAS Enterprise Miner, DBMiner…
  • 2D Matrix-based approache.g. SGI MineSet, DBMiner…
  • Graph-based techniquese.g. DBMiner ball graphs
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
visualization tables
Visualization: Tables
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
visualization visual aids
Visualization: Visual aids
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
visualization 2d matrix
Visualization: 2D Matrix
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
visualization graphs
Visualization: Graphs
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
visualization visar
Visualization: VisAR

Based on parallel coordinates

(Techapichetvanich & Datta, ADMA’2005)

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
extensions
Extensions

Confidence is not the best possibleinterestingness measure for rules

e.g. A very frequent item will always appear in rule consequents, regardless its true relationship with the rule antecedent

X went to war  X did not serve in Vietnam

(from the US Census)

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
extensions1
Extensions

Desirable properties for interestingness measuresPiatetsky-Shapiro, 1991

P1 ACC(A⇒C) = 0 when supp(A⇒C) = supp(A)supp(C)

P2 ACC(A⇒C) monotonically increases with supp(A⇒C)

P3 ACC(A⇒C) monotonically decreases with supp(A) (or supp(C))

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
extensions2
Extensions

Certainty factors…

  • … satisfy Piatetsky-Shapiro’s properties
  • … are widely-used in expert systems
  • … are not symmetric (as interest/lift)
  • … can substitute conviction when CF>0
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR

 Berzal, Blanco, Sánchez & Vila:“Measuring the accuracy and interest of association rules: A new framework", Intelligent Data Analysis, 2002

extensions3
Extensions

References:

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR

 Hilderman & Hamilton: “Evaluation of interestingness measures for ranking discovered knowledge”. PAKDD, 2001

 Tan, Kumar & Srivastava: “Selecting the right objective measure for association analysis”. Information Systems, vol. 29, pp. 293-313, 2004.

 Berzal, Cubero, Marín, Sánchez, Serrano & Vila: “Association rule evaluation for classification purposes” TAMIDA’2005

applications
Applications

Two sample applications where associations rules have been successful

  • Classification (ART)
  • Anomaly detection (ATBAR)
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR

Berzal, Cubero, Sánchez & Serrano

“ART: A hybrid classification model”

Machine Learning Journal, 2004

Balderas, Berzal, Cubero, Eisman & Marín “Discovering Hidden Association Rules ”

KDD’2005, Chicago, Illinois, USA

classification
Classification

Classification models based on association rules

  • Partial classification models

vg: Bayardo

  • “Associative” classification models vg: CBA (Liu et al.)
  • Bayesian classifiers

vg: LB (Meretakis et al.)

  • Emergent patterns

vg: CAEP (Dong et al.)

  • Rule trees

vg: Wang et al.

  • Rules with exceptions

vg: Liu et al.

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
classification1
Classification

GOAL

Simple, intelligible, and robust

classification models

obtained in an efficient and scalable way

MEANS

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR

Decision Tree Induction

+

Association Rule Mining

=

ART

[Association Rule Trees]

art classification model
ART Classification Model

IDEA

Make use of efficient association rule mining algorithms to build a decision-tree-shaped classification model.

ART = Association Rule Tree

KEY

Association rules + “else” branches

Hybrid between decision trees and decision lists

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
art classification model1
ART Classification Model

SPLICE

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
example art vs tdidt

ART classification model

Example ART vs. TDIDT

ART

TDIDT

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
final comments

ART classification model

Final comments

Classification models

  • Acceptable accuracy
  • Reduced complexity
  • Attribute interactions
  • Robustness (noise & primary keys)

Classifier building method

  • Efficient algorithm
  • Good scalability properties
  • Automatic parameter selection
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
anomaly detection
Anomaly detection

It is often more interesting to find surprising non-frequent events than frequent ones

EXAMPLES

  • Abnormal network activity patterns in intrusion detection systems.
  • Exceptions to “common” rules in Medicine (useful for diagnosis, drug evaluation, detection of conflicting therapies…)
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
anomaly detection1
Anomaly detection

Anomalous association rule

Confident rule representing homogeneous deviations from common behavior.

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
anomaly detection2

X usually implies Y (dominant rule)

X Y frequent and confident

Anomaly detection
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR

When X does not imply Y, then it usually implies A (the Anomaly)

X

¬Y

A

confident

Anomalous association rule

X Y  ¬A

confident

anomaly detection3
Anomaly detection
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR

X Y is the dominant rule

X A when ¬ Yis the anomalous rule

anomaly detection4
Anomaly detection

Suzuki et al.’s “Exception Rules”

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR

X Y is an association rule

X 

I

¬ Y

is the exception rule

I is the “interacting” itemset

X  I is the reference rule

  • Too many exceptions
  • The “cause” needs to be present
anomaly detection atbar

A#7 AB#6 AC#4 AD#5 AE#3 AF#3

B #9

C #7

D #8

A #7

B #6

D #5

A #7 A*

Non-frequent

Anomaly detection: ATBAR

Anomalous association rules

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR

First scan

Second scan

anomaly detection atbar1

A #7 A*

C #7C*

B #9B*

D #5

B #6

D #8D*

D #5

C #6

D #8

C #7

B #9

D #7

A #7

Anomaly detection: ATBAR

Anomalous association rules

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR

First scan

Second scan

anomaly detection atbar2
Anomaly detection: ATBAR

Anomalous association rules

Rule generation is immediate from the frequent and extended itemsets obtained by ATBAR

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
anomaly detection results
Anomaly detection: Results

Experiments on health-related datasetsfrom the UCI Machine Learning Repository

  • Relatively small set of anomalous rules (typically, >90% reduction with respect to standard association rules)
  • Reasonable overhead needed to obtain anomalous association rules(about 20% in ATBAR w.r.t. TBAR)
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR
anomaly detection results1

“Anomaly”

Usual consequent

Anomaly detection: Results

An example from the Census dataset:

  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR

if WORKCLASS: Local-gov

then

CAPGAIN: [99999.0 , 99999.0] (7 out of 7)

when not CAPGAIN: [0.0 , 20051.0]

anomaly detection results2
Anomaly detection: Results
  • Anomalous association rules(novel characterization of potentially interesting knowledge)
  • An efficient algorithm for discovering anomalous association rules: ATBAR
  • Some heuristics for filtering the discovered anomalous association rules
  • Motivation
  • Definition
  • Discovery
  • Variations
  • Visualization
  • Extensions
  • Applications
    • ART
    • ATBAR