topk interesting subgraph discovery in information networks n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
TopK Interesting Subgraph Discovery in Information Networks PowerPoint Presentation
Download Presentation
TopK Interesting Subgraph Discovery in Information Networks

Loading in 2 Seconds...

play fullscreen
1 / 26

TopK Interesting Subgraph Discovery in Information Networks - PowerPoint PPT Presentation


  • 153 Views
  • Uploaded on

TopK Interesting Subgraph Discovery in Information Networks. Manish Gupta Jing Gao Xifeng Yan Hasan Cam Jiawei Han. Real World Problems. Network Bottlenecks Discovery. Computer Networks. Organization Networks. Team Selection.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'TopK Interesting Subgraph Discovery in Information Networks' - stacey-burgess


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
topk interesting subgraph discovery in information networks

TopK Interesting Subgraph Discovery in Information Networks

Manish Gupta Jing GaoXifeng Yan Hasan Cam Jiawei Han

gmanish@microsoft.com

real world problems
Real World Problems

Network Bottlenecks

Discovery

Computer Networks

Organization Networks

Team Selection

Interestingness = Highest Historical Compatibility

Interestingness = Lowest Bandwidth

Suspicious Relationships

Discovery

Battlefield Networks

Resource Allocation

Social Networks

Interestingness = Highest Negative Association

Strength of Attribute Values

Interestingness = Lowest Distance between Entities

gmanish@microsoft.com

the basic underlying problem
The Basic Underlying Problem

Team Selection

Network Bottlenecks

Discovery

Interestingness =

Lowest Bandwidth

Interestingness = Highest Historical Compatibility

Suspicious Relationships

Discovery

Resource Allocation

Interestingness = Highest Negative Association

Strength

Interestingness = Lowest Distance

  • Given
    • Edge-weighted Typed Network G
    • Typed Subgraph Query Q
    • Edge Interestingness measure
  • Find
    • TopK matching subgraphs

gmanish@microsoft.com

na ve solution ranking after matching
Naïve Solution: Ranking After Matching

4

3

2

1

A

A

A

B

0.8

0.7

0.2

12

13

0.2

Network G

Query Q

C

C

0.4

0.5

0.4

0.3

6

5

4

3

2

1

2

3

6

5

4

4

3

3

2

B

A

A

A

A

A

A

B

Ranking

0.6

0.8

0.8

0.7

0.2

B

A

A

A

A

A

A

Why compute all matches?

We need only top-2!

0.6

0.8

0.8

0.8

0.7

0.9

0.1

0.7

0.1

10

9

8

7

0.7

11

1

4

10

9

8

7

B

C

A

A

A

B

A

0.3

0.6

0.5

0.2

A

A

A

B

B

0.3

0.6

0.5

Matching

4

3

2

6

5

A

A

A

B

A

0.8

0.7

0.6

0.1

0.9

7

10

9

5

B

6

5

A

A

0.3

4

5

A

A

A

B

A

0.8

0.6

0.9

0.9

0.1

0.9

9

8

7

9

7

9

8

A

A

B

A

B

7

A

A

0.6

0.5

0.6

gmanish@microsoft.com

our contributions
Our Contributions
  • New notion: TopK interesting subgraph detection in information networks
  • Three new low-cost indexes
    • Graph topology index
    • Sorted edge lists
    • Graph maximum metapath weight index
  • Novel top-K algorithm to answer interestingness queries on large graphs
  • Detailed effectiveness and efficiency validation on several synthetic and real datasets

gmanish@microsoft.com

relationship with previous work
Relationship with Previous Work
  • Subgraph matching
    • Approximate: fuzzy node/edge similarity
    • Exact: Matching without ranking
    • RDF graphs, probabilistic graphs, temporal graphs
  • TopK querying on graphs
    • H-hop aggregate queries
    • Keyword queries on RDF graphs
    • K most frequent patterns
    • Twig queries

gmanish@microsoft.com

system overview
System Overview

2

Network G

Breadth First Traversal

from each Node

up to Distance D

Graph

Topology

Index

Offline Index Construction

Distance D

Sort Edges

3

Graph Maximum

MetaPathWeight

Index

1

Sorted Edge Lists

Find Candidate Nodes

Query Q

Candidate Nodes

Top-K Computation

Online Query Processing

Top-K Subgraphs

gmanish@microsoft.com

index structures

G=(V,E), B=avg #neighbors, T=#types

Index Structures

12

13

0.2

Network G

C

C

0.4

0.5

0.4

0.3

6

5

4

3

2

1

B

A

A

A

A

B

0.6

0.8

0.8

0.7

0.2

0.9

0.1

0.7

0.1

10

9

8

7

11

C

A

A

A

B

0.3

0.6

0.5

0.2

gmanish@microsoft.com

find candidate nodes
Find Candidate Nodes

Graph

Topology

Index

Query Q

Query Q

Graph Topology Index

2

3

A

A

1

4

B

A

Query Topology

gmanish@microsoft.com

finding and scoring matches key idea
Finding and Scoring MatchesKey Idea

Query Q

Top-K Computation

2

3

Start

Y

Generate a Size-1 Candidate

A

A

More valid edges?

N

1

4

Y

B

A

TopK Quit?

Compute Actual and UB Score

N

Y

N

Candidate Size==|Q|?

B

A

A

A

Grow Candidates

N

Y

Y

Top-K Heap

TopK Quit?

Compute Actual and UB Score

Update Heap

Compute Max UB Score

N

Y

TopK Quit?

Done!

gmanish@microsoft.com

finding and scoring matches g enerating size 1 candidates
Finding and Scoring MatchesGenerating Size-1 Candidates

Size-1 Candidates

Query Q

9

9

2

9

5

5

9

9

9

9

3

5

5

5

5

5

5

9

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

5

1

9

4

B

B

B

B

B

B

B

B

B

B

A

A

A

A

A

A

A

A

A

A

Query Edge with both endpoints of same type

Multiple query edges of the same type

Candidate Growth

B

A

A

A

Order

(5,9)

(3,4)

(4,5)

(2,3)

(2,7)

Heapify?

Discard?

Prune?

Grow?

8

6

6

10

Prune?

Grow?

8

10

Heapify?

Discard?

Prune?

Grow?

gmanish@microsoft.com

finding and scoring matches actual score and upper bound score
Finding and Scoring MatchesActual Score and Upper Bound Score

Candidate Growth

9

9

9

9

5

5

5

5

Prune?

Grow?

Prune?

Grow?

Heapify?

Discard?

6

8

8

A

A

A

A

A

A

A

A

B

B

B

B

A

A

A

A

Actual Score= 0.9

B

A

A

A

UB Score = 0.9+ UB(NonConsidered Edges)

= 0.9+ (0.6+0.6) = 2.1

  • Partially grown candidate
    • Prune if UBScore< min(heap)
    • Grow otherwise
  • Fully grown candidate
    • Discard if UBScore< min(heap)
    • Update heap otherwise

Useful Edge Lists

gmanish@microsoft.com

finding and scoring matches global top k quit
Finding and Scoring MatchesGlobal Top-K Quit

12

13

0.2

Network G

C

C

Query Q

0.4

0.5

0.4

0.3

6

5

4

3

2

1

2

3

B

A

A

A

A

A

A

B

0.6

0.8

0.8

0.7

0.2

0.9

0.1

0.7

1

4

0.1

10

9

8

7

11

B

A

C

A

A

A

B

0.3

0.6

0.5

0.2

B

A

A

A

K=2

TopK Heap

(4,3,2,7): 2.2

(3,4,5,6): 2.2

Stop

0.7+0.6+0.7 = 2 <2.2

gmanish@microsoft.com

faster query processing using graph maximum metapath weight index
Faster Query Processing using Graph Maximum MetaPath Weight Index

Slight complication

1

1

1

4

3

5

C

4

3

5

C

C

A

B

C

A

B

C

2

2

2

C

C

C

Query

6

7

1

B

C

Query

Partial

Instantiation

UB Score = Actual Score(1-2)

+ UB(1-3) + UB(2-3)

+ UB(3-4) + UB(4-5)

C

1

4

3

5

C

2

C

4

A

B

C

B

Partial Candidate

7

3

6

7

C

A

UB Score = Actual Score(1-2)

+ UB(1-3-4-5)

+ UB(2-3)

2

B

C

1

C

4

3

5

C

Paths to cover

Non-Considered

Edges

Edges to

Consider

Separately

A

B

C

3

Paths to cover

Non-Considered

Edges

A

UB Score = Actual Score(1-2)

+ UB(1-3-4-5-7) + UB(2-3)

+ UB(4-6) +UB(6-7)

2

Using MMW Index!

C

gmanish@microsoft.com

faster query processing using graph maximum metapath weight index1
Faster Query Processing using Graph Maximum MetaPathWeight Index

5

A

A

Prune?

Grow?

9

B

A

Edge-based UBScore

0.9+0.8+0.7

=2.4 > 2.0

B

A

A

A

Grow

K=2

TopK Heap

(8,9,5,6): 2.1

(5,9,8,7): 2.0

Path-based UBScore

0.9+UB(5-A-B)

=0.9+0.9

=1.8 < 2.0

Prune

MMW Index

gmanish@microsoft.com

discussions
Discussions
  • Queries with multiple edge semantics
  • Directed graphs
  • Homogeneous networks
  • Weighted query edges
    • Weights signify expected amount of interestingness
    • Weights signify importance of query edge
  • Faster computations versus index size

gmanish@microsoft.com

low cost index structures
Low-cost Index Structures

gmanish@microsoft.com

faster query execution
Faster Query Execution

Query Execution Time (msec) for Clique

Queries (Graph G2 and indexes with D=2)

Query Execution Time (msec) for Path

Queries (Graph G2 and indexes with D=2)

RAM: Ranking After Matching baseline

RWM0: without using the candidate node filtering

RWM1: without using the MMW index

RWM2: same as RWM1 without the

pruning any partially grown candidates

RWM3: same as RWM1 without the global top-K quit check

RWM4: same as RWM1 with the MMW index

Query Execution Time (msec) for Subgraph

Queries (Graph G2 and indexes with D=2)

gmanish@microsoft.com

good scalability
Good Scalability

Good Scalability thanks to Effective Pruning

Running time (msec) for different Query Sizes and

Graph Sizes (D=2)

Number of Candidates as Percentage of Total

Matches for Different Query Sizes

and Candidate Sizes

Query Execution Time for Different Values of K

gmanish@microsoft.com

real dataset case studies
Real Dataset Case Studies

2

2

4

1

1

Author

Conf

Author

Conf

Keyword

3

3

Author

Author

Q1

Q2

2

2

4

1

1

Person

Film

Person

Company

Settlement

3

3

Person

Person

Q3

Q4

gmanish@microsoft.com

real dataset case studies1
Real Dataset Case Studies
  • DBLP
    • 1: Rohit Gupta, 2: BICoB, 3: Vipin Kumar
      • Rohit Gupta -- computer networking
      • Vipin Kumar -- Data and Information Systems
      • BICoB -- International Conference on Bioinformatics and Computational Biology
    • 1: Jimeng Sun, 2: Operating Systems Review (SIGOPS), 3: Christos Faloutsos, 4: mining
      • Jimeng Sun and Christos Faloutsos -- Data and Information Systems, Artificial intelligence, and Computational biology
      • "mining" -- Data and Information Systems
      • "Operating Systems Review (SIGOPS)" -- Operating systems, Computer architecture, Computer networking

gmanish@microsoft.com

real dataset case studies2
Real Dataset Case Studies
  • Wikipedia
    • 1: Stacy Keach, 2: The Biggest Battle, 3: John Huston
      • Stacy Keach and John Huston starred in the movie “The Biggest Battle”
      • Stacy Keach (American), John Huston (American), movie is Italian
      • Stacy (narration, comedy, music), John (drama, documentary, adventure), movie (war)
    • 1: Medha Patkar, 2: BBC, 3: Felix D’Alviella, 4: Mogilino
      • Medha Patkar -- Indian social activist -- won Best International Political Campaigner by BBC
      • Felix D’Alviella -- Belgian actor in the BBC soap opera Doctors
      • Mogilino -- village in Bulgaria -- BBC showed the popular film "Bulgaria’s Abandoned Children" in 2007
      • British company rewarding an Indian woman, covering a place in Bulgaria or linked to a person from Belgium is rare

gmanish@microsoft.com

related work 1
Related Work (1)

Theory literature on subgraph isomorphism [Cordella et al., 2004; McKay, 1981; Ullmann, 1976]

Exact subgraph matching [Cheng et al., 2008; He and Singh, 2008; Sun et al., 2012; Zhang et al., 2007; Zhang et al., 2009; Zhao and Han, 2010; Zou et al., 2009]

Approximate subgraph matching [Zou et al., 2007; Zeng et al., 2012; Tian et al., 2007; Zhang et al., 2010]

gmanish@microsoft.com

related work 2
Related Work (2)
  • Matching in graph databases [Ranu and Singh, 2009; Yan et al., 2005; Zhu et al., 2012]
  • Matching for RDF graphs [Liu et al., 2012], probabilistic graphs [Yuan et al., 2012] and temporal graphs [Bogdanov et al., 2011]
  • Top-K queries
    • h-hop aggregate queries [Yan et al., 2010]
    • K most frequent patterns [Yang et al., 2012; Zhu et al., 2011]
    • Top-K keyword queries on RDF graphs [Tran et al., 2009]
    • Top-K similarity queries [Zou et al., 2007]
    • Twig queries [Gou and Chirkova, 2008]

gmanish@microsoft.com

conclusion
Conclusion
  • Given
    • Typed unweighted query
    • A heterogeneous edge-weighted information network
    • Edge interestingness measure
  • Find
    • Top-K interesting subgraphs
  • Investigated ranking after matching baseline
  • Proposed three new graph indexes and exploited them for building a top-K solution
  • Showed efficiency, scalability and effectiveness on multiple synthetic and real datasets

gmanish@microsoft.com

thanks
Thanks!

gmanish@microsoft.com