Fundamentos de miner a de datos
This presentation is the property of its rightful owner.
Sponsored Links
1 / 44

Fundamentos de Minería de Datos PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on
  • Presentation posted in: General

Fundamentos de Minería de Datos. Reglas de asociación. Fernando Berzal [email protected] Motivation. Association mining searches for interesting relationships among items in a given data set EXAMPLES Diapers and six-packs are bought together, specially on Thursday evening (a myth?)

Download Presentation

Fundamentos de Minería de Datos

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Fundamentos de miner a de datos

Fundamentos de Minería de Datos

Reglas de asociación

Fernando [email protected]


Motivation

Motivation

Association mining searches for interesting relationships among items in a given data set

EXAMPLES

  • Diapers and six-packs are bought together, specially on Thursday evening (a myth?)

  • A sequence such as buying first a digital camera and then a memory card is a frequent (sequential) pattern

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Motivation1

Motivation

MARKET BASKET ANALYSIS

The earliest form of association rule mining

Applications:

Catalog design, store layout, cross-marketing…

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Definition

Definition

Item

  • In transactional databases:

    Any of the items included in a transaction.

  • In relational databases:

    (Attribute, value) pair

    k-itemset

    Set of k items

    Itemset support support(I) = P(I)

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Definition1

Definition

Association rule

X  Y

  • Support

    support(XY) = support(XUY) = P(XUY)

  • Confidence

    confidence(XY) = support(XUY) / support(X)

    = P(Y|X)  

    NOTE: Both support and confidence are relative

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Discovery

Discovery

Association rule mining

Find all frequent itemsets

Generate strong association rules from the frequent itemsetsStrong association rules are those that satisfy both a minimum support threshold and a minimum confidence threshold.

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Discovery1

Discovery

Apriori

Observation:

All non-empty subsets of a frequent itemset must also be frequent

Algorithm:

Frequent k-itemsets are used to explore potentially frequent (k+1)-itemsets (i.e. candidates)

Agrawal & Skirant: "Fast Algorithms for Mining Association Rules", VLDB'94

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Discovery2

Discovery

Apriori improvements (I)

  • Reducing the number of candidates Park, Chen & Yu: "An Effective Hash-Based Algorithm for Mining Association Rules", SIGMOD'95

  • SamplingToivonen: "Sampling Large Databases for Association Rules", VLDB'96 Park, Yu & Chen: "Mining Association Rules with Adjustable Accuracy", CIKM'97

  • PartitioningSavasere, Omiecinski & Navathe: "An Efficient Algorithm for Mining Association Rules in Large Databases", VLDB'95

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Discovery3

Discovery

Apriori improvements (II)

  • Transaction reduction Agrawal & Skirant: "Fast Algorithms for Mining Association Rules", VLDB'94 (AprioriTID)

  • Dynamic itemset countingBrin, Motwani, Ullman & Tsur: "Dynamic Itemset Counting and Implication Rules for Market Basket Data", SIGMOD'97 (DIC)Hidber: "Online Association Rule Mining", SIGMOD'99 (CARMA)

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Discovery4

Discovery

Apriori-like algorithm:

TBAR

(Tree-based association rule mining)

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR

Berzal, Cubero, Sánchez & Serrano

“TBAR: An efficient method for association rule mining in relational databases”

Data & Knowledge Engineering, 2001


Discovery tbar

D #5

D #5

D #7

C #6

D #5

D #5

D #8

C #7

B #9

A #7

B #6

Discovery: TBAR

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR

L1

7 instances wih A

6 instances withAB

L2

5 instances withAD

6 instances withBC

5 instances withABD

L3


Discovery5

Discovery

An alternative to Apriori:

Compress the database representing frequent items into a frequent-pattern tree (FP-tree)…

 Han, Pei & Yin: "Mining Frequent Patterns without Candidate Generation", SIGMOD'2000

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Discovery6

Discovery

A challenge

When an itemset is frequent,all its subsets are also frequent

  • Closed itemset C:There exists no proper super-itemset S such that support(S)=support(C)

  • Maximal (frequent) itemset M:M is frequent and there exists no super-itemset Y such that MY and Y is frequent.

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Variations

Variations

Based on the kinds of patterns to be mined:

  • Frequent itemset mining(transactional and relational data)

  • Sequential pattern mining(sequence data sets, e.g. bioinformatics)

  • Structured pattern mining(structured data, e.g. graphs)

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Variations1

Variations

Based on the types of values handled:

  • Boolean association rules

  • Quantitative association rules

  • Fuzzy association rules

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR

Delgado, Marín, Sánchez & Vila

“Fuzzy association rules: General model and applications”

IEEE Transactions on Fuzzy Systems, 2003


Variations2

Variations

More options:

  • Generalized association rules(a.k.a. multilevel association rules)

  • Constraint-based association rule mining

  • Incremental algorithms

  • Top-k algorithms

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR

ICDM FIMIWorkshop on Frequent Itemset Mining Implementations

http://fimi.cs.helsinki.fi/


Visualization

Visualization

Integrated into data mining tools to help users understand data mining results:

  • Table-based approache.g. SAS Enterprise Miner, DBMiner…

  • 2D Matrix-based approache.g. SGI MineSet, DBMiner…

  • Graph-based techniquese.g. DBMiner ball graphs

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Visualization tables

Visualization: Tables

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Visualization visual aids

Visualization: Visual aids

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Visualization 2d matrix

Visualization: 2D Matrix

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Visualization graphs

Visualization: Graphs

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Visualization visar

Visualization: VisAR

Based on parallel coordinates

(Techapichetvanich & Datta, ADMA’2005)

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Extensions

Extensions

Confidence is not the best possibleinterestingness measure for rules

e.g. A very frequent item will always appear in rule consequents, regardless its true relationship with the rule antecedent

X went to war  X did not serve in Vietnam

(from the US Census)

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Extensions1

Extensions

Desirable properties for interestingness measuresPiatetsky-Shapiro, 1991

P1ACC(A⇒C) = 0 when supp(A⇒C) = supp(A)supp(C)

P2 ACC(A⇒C) monotonically increases with supp(A⇒C)

P3ACC(A⇒C) monotonically decreases with supp(A) (or supp(C))

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Extensions2

Extensions

Certainty factors…

  • … satisfy Piatetsky-Shapiro’s properties

  • … are widely-used in expert systems

  • … are not symmetric (as interest/lift)

  • … can substitute conviction when CF>0

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR

Berzal, Blanco, Sánchez & Vila:“Measuring the accuracy and interest of association rules: A new framework", Intelligent Data Analysis, 2002


Extensions3

Extensions

References:

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR

Hilderman & Hamilton: “Evaluation of interestingness measures for ranking discovered knowledge”. PAKDD, 2001

Tan, Kumar & Srivastava: “Selecting the right objective measure for association analysis”. Information Systems, vol. 29, pp. 293-313, 2004.

Berzal, Cubero, Marín, Sánchez, Serrano & Vila: “Association rule evaluation for classification purposes” TAMIDA’2005


Applications

Applications

Two sample applications where associations rules have been successful

  • Classification (ART)

  • Anomaly detection (ATBAR)

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR

Berzal, Cubero, Sánchez & Serrano

“ART: A hybrid classification model”

Machine Learning Journal, 2004

Balderas, Berzal, Cubero, Eisman & Marín“Discovering Hidden Association Rules ”

KDD’2005, Chicago, Illinois, USA


Classification

Classification

Classification models based on association rules

  • Partial classification models

    vg: Bayardo

  • “Associative” classification models vg: CBA (Liu et al.)

  • Bayesian classifiers

    vg: LB (Meretakis et al.)

  • Emergent patterns

    vg: CAEP (Dong et al.)

  • Rule trees

    vg: Wang et al.

  • Rules with exceptions

    vg: Liu et al.

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Classification1

Classification

GOAL

Simple, intelligible, and robust

classification models

obtained in an efficient and scalable way

MEANS

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR

Decision Tree Induction

+

Association Rule Mining

=

ART

[Association Rule Trees]


Art classification model

ART Classification Model

IDEA

Make use of efficient association rule mining algorithms to build a decision-tree-shaped classification model.

ART = Association Rule Tree

KEY

Association rules + “else” branches

Hybrid between decision trees and decision lists

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Art classification model1

ART Classification Model

SPLICE

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Example art vs tdidt

ART classification model

ExampleART vs. TDIDT

ART

TDIDT

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Final comments

ART classification model

Final comments

Classification models

  • Acceptable accuracy

  • Reduced complexity

  • Attribute interactions

  • Robustness (noise & primary keys)

    Classifier building method

  • Efficient algorithm

  • Good scalability properties

  • Automatic parameter selection

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Anomaly detection

Anomaly detection

It is often more interesting to find surprising non-frequent events than frequent ones

EXAMPLES

  • Abnormal network activity patterns in intrusion detection systems.

  • Exceptions to “common” rules in Medicine (useful for diagnosis, drug evaluation, detection of conflicting therapies…)

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Anomaly detection1

Anomaly detection

Anomalous association rule

Confident rule representing homogeneous deviations from common behavior.

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Anomaly detection2

X usually implies Y (dominant rule)

X Y frequent and confident

Anomaly detection

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR

When X does not imply Y, then it usually implies A (the Anomaly)

X

¬Y

A

confident

Anomalous association rule

X Y  ¬A

confident


Anomaly detection3

Anomaly detection

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR

X Y is the dominant rule

X A when ¬ Yis the anomalous rule


Anomaly detection4

Anomaly detection

Suzuki et al.’s “Exception Rules”

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR

X Y is an association rule

X 

I

¬ Y

is the exception rule

I is the “interacting” itemset

X  I is the reference rule

  • Too many exceptions

  • The “cause” needs to be present


Anomaly detection atbar

A#7 AB#6 AC#4 AD#5 AE#3 AF#3

B #9

C #7

D #8

A #7

B #6

D #5

A #7 A*

Non-frequent

Anomaly detection: ATBAR

Anomalous association rules

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR

First scan

Second scan


Anomaly detection atbar1

A #7 A*

C #7C*

B #9B*

D #5

B #6

D #8D*

D #5

C #6

D #8

C #7

B #9

D #7

A #7

Anomaly detection: ATBAR

Anomalous association rules

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR

First scan

Second scan


Anomaly detection atbar2

Anomaly detection: ATBAR

Anomalous association rules

Rule generation is immediate from the frequent and extended itemsets obtained by ATBAR

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Anomaly detection results

Anomaly detection: Results

Experiments on health-related datasetsfrom the UCI Machine Learning Repository

  • Relatively small set of anomalous rules (typically, >90% reduction with respect to standard association rules)

  • Reasonable overhead needed to obtain anomalous association rules(about 20% in ATBAR w.r.t. TBAR)

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


Anomaly detection results1

“Anomaly”

Usual consequent

Anomaly detection: Results

An example from the Census dataset:

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR

if WORKCLASS: Local-gov

then

CAPGAIN: [99999.0 , 99999.0] (7 out of 7)

when not CAPGAIN: [0.0 , 20051.0]


Anomaly detection results2

Anomaly detection: Results

  • Anomalous association rules(novel characterization of potentially interesting knowledge)

  • An efficient algorithm for discovering anomalous association rules: ATBAR

  • Some heuristics for filtering the discovered anomalous association rules

  • Motivation

  • Definition

  • Discovery

  • Variations

  • Visualization

  • Extensions

  • Applications

    • ART

    • ATBAR


  • Login