Efficient Mining of Graph-Based Data

1 / 28

Efficient Mining of Graph-Based Data - PowerPoint PPT Presentation

Efficient Mining of Graph-Based Data. Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department of Computer Science and Engineering http://cygnus.uta.edu/subdue. Motivation. Structural/relational data Ease of graph representation.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Efficient Mining of Graph-Based Data

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Efficient Mining of Graph-Based Data

Jesus Gonzalez, Istvan Jonyer,

Larry Holder and Diane Cook

University of Texas at Arlington

Department of Computer Science and Engineering

http://cygnus.uta.edu/subdue

SRL Workshop

Motivation
• Structural/relational data
• Ease of graph representation

SRL Workshop

Graph-Based Discovery

Input Database

Substructure S1

(graph form)

Compressed Database

T1

triangle

shape

C1

C1

S1

B1

object

R1

R1

on

square

S1

S1

S1

shape

T2

T3

T4

object

B2

B3

B4

SRL Workshop

triangle

on

circle

square

on

on

rectangle

on

on

on

triangle

triangle

triangle

on

on

on

square

square

square

Algorithm
• Create substructure for each unique vertex label

Substructures:

triangle (4), square (4),

circle (1), rectangle (1)

SRL Workshop

triangle

triangle

on

on

circle

square

square

on

on

rectangle

on

on

on

triangle

triangle

triangle

rectangle

on

on

on

on

square

circle

square

square

square

triangle

on

on

rectangle

rectangle

Algorithm
• Expand best substructure by an edge or edge+neighboring vertex

Substructures:

SRL Workshop

Algorithm
• Keep only best beam-width substructures on queue
• Terminate when queue is empty or #discovered substructures >= limit
• Compress graph and repeat to generate hierarchical description

Note: polynomially constrained

SRL Workshop

Evaluation Metric
• Substructures evaluated based on ability to compress input graph
• Compression measured using minimum description length (DL)
• Best substructure S in graph G minimizes: DL(S) + DL(G|S)

SRL Workshop

Examples

SRL Workshop

Inexact Graph Match
• Some variations may occur between instances
• Want to abstract over minor differences
• Difference = cost of transforming one graph to isomorphism of another
• Match if cost/size < threshold

SRL Workshop

Parallel/Distributed Discovery
• Divide graph into P partitions using Metis, distribute to P processors
• Each processor performs serial Subdue on local partition
• Broadcast best substructures, evaluate on other processors
• Master processor stores best global substructures
• Close to linear speedup

SRL Workshop

Graph-Based Concept Learning
• One graph stores positive examples
• One graph stores negative examples
• Find substructure that compresses positive graph but not negative graph
• (PosEgsNotCovered) + (NegEgsCovered)
• Multiple iterations implements set-covering approach

SRL Workshop

shape

object

triangle

on

shape

object

square

on

object

Concept-Learning Example

SRL Workshop

Concept-Learning Results
• Chess endgames (19,257 examples)
• Black King is (+) or is not (-) in check
• 99.8% FOIL, 99.21% Subdue

SRL Workshop

More Concept-Learning Results
• Tic-Tac-Toe endgames
• + is win for X (958 examples)
• 100% Subdue, 92.35% FOIL
• Bach chorales
• Musical sequences (20 sequences)
• 100% Subdue, 85.71% FOIL

SRL Workshop

Graph-Based Clustering
• Iterate Subdue until single vertex
• Each cluster (substructure) inserted into a classification lattice

Root

SRL Workshop

Name

Body Cover

Heart Chamber

Body Temp.

Fertilization

mammal

hair

four

regulated

internal

bird

feathers

four

regulated

internal

reptile

cornified-skin

imperfect-four

unregulated

internal

mammal

Name

four

hair

BodyCover

amphibian

moist-skin

three

unregulated

external

HeartChamber

animal

Fertilization

BodyTemp

regulated

internal

fish

scales

two

unregulated

external

Clustering Example: Animals

SRL Workshop

Animals

HeartChamber: four

BodyTemp: regulated

Fertilization: internal

BodyTemp: unregulated

Name: mammal

BodyCover: hair

Name: bird

BodyCover: feathers

Name: reptile

BodyCover: cornified-skin

HeartChamber: imperfect-four

Fertilization: internal

Fertilization: external

Name: amphibian

BodyCover: moist-skin

HeartChamber: three

Name: fish

BodyCover: scales

HeartChamber: two

Graph-Based Clustering Results

SRL Workshop

animals

amphibian/fish

mammal/bird

reptile

fish

amphibian

mammal

bird

Cobweb Results
• Comparison of Subdue and Cobweb results
• Subdue lattice produced better generalization, resulting in less clusters at higher levels
• Subdue lattice identifies overlap between (reptile) and (amphibian/fish)

SRL Workshop

DNA

O

|

O == P — OH

C — N

C — C

C — C

\

O

C

\

N — C

\

C

O

|

O == P — OH

|

O

|

CH2

O

\

C

/ \

C — C N — C

/ \

O C

Graph-Based Clustering Results

Coverage

• 61%
• 68%
• 71%

SRL Workshop

Evaluation of Clusterings
• Not applicable to hierarchical domains
• Does not make sense to compare clusters in different subtrees
• Not applicable to relational clusterings

SRL Workshop

Properties of Good Clusterings
• Small number of clusters
• Large coverage  good generality
• Big cluster descriptions
• More features  more inferential power
• Minimal or no overlap between clusters
• More distinct clusters  better defined concepts

SRL Workshop

New Evaluation Heuristic for Hierarchical Clusterings
• Clustering rooted at C with c children Hi having |Hi| instances Hi,k
• distance() measured by inexact graph match
• Animals: SubdueCQ=2.6, CobwebCQ=1.7

SRL Workshop

web_page

web_page

web_page

home

Graph-Based Data Mining: Application Domains
• Biochemical domains
• Protein data
• DNA data
• Toxicology (cancer) data
• Spatial-temporal domains
• Earthquake data
• Aircraft Safety and Reporting System
• Telecommunications data
• Program source code
• Web topology

SRL Workshop

Theoretical Analysis
• Galois lattice [Lequiere et al.]
• Conceptual graphs [Sowa et al.]
• PAC analysis [Jappy et al.]

SRL Workshop

Graph-based Data Mining
• Pattern (substructure) discovery
• Hierarchical discovery
• Distributed discovery
• Concept learning
• Clustering
• Compression heuristic based on minimum description length

SRL Workshop

Future Work
• Concept learning
• Theoretical analysis
• Comparison to ILP systems
• Clustering
• Classification lattice
• Hierarchical relational conceptual clustering evaluation metric
• Probabilistic substructures
• Domains: WWW, source code

SRL Workshop

Subdue Source Code and Data

http://cygnus.uta.edu/subdue

SRL Workshop