simrank a measure of structural context similarity n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
SimRank : A Measure of Structural-Context Similarity PowerPoint Presentation
Download Presentation
SimRank : A Measure of Structural-Context Similarity

Loading in 2 Seconds...

play fullscreen
1 / 34

SimRank : A Measure of Structural-Context Similarity - PowerPoint PPT Presentation


  • 638 Views
  • Uploaded on

SimRank : A Measure of Structural-Context Similarity. Glen Jeh and Jennifer Widom KDD 2002. CS 519 Class Presentation Presenter: Anh Pham. Outline of the talk. Introduction to Structural Context Similarity SimRank Computing SimRank Naïve method Pruning Example

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'SimRank : A Measure of Structural-Context Similarity' - joanne


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
simrank a measure of structural context similarity

SimRank: A Measure of Structural-Context Similarity

Glen Jeh and Jennifer Widom

KDD 2002

CS 519 Class Presentation

Presenter: Anh Pham

outline of the talk
Outline of the talk
  • Introduction to Structural Context Similarity
  • SimRank
  • Computing SimRank
    • Naïve method
    • Pruning
  • Example
  • Limited information problem
  • Random surfer pair model
  • Experimental results
  • Strong and weak points
  • Quiz
finding similarity objects problem
Finding similarity objects problem
  • There are a lot of applications
  • Find similar documents:
  • Collaborative filtering:
    • Find similar users
    • Find similar items
aspects of objects for similarity
Aspects of objects for similarity
  • Many aspects making similarity
    • Documents: common words, sentence…
    • Users: common preferences
structure similarity
Structure similarity
  • This paper proposes a general approach

which can be applied when the data can be

represented as graph

    • Web page cases:
    • Users preferences:
    • Scientific network:
example of structure similarity
Example of structure similarity
  • Intuition: similar objects are related to similar objects
  • Example:

Prof. A has student A & Prof. B has student B

Prof. A and Prof. B are similar, since they from the same univ.

Recursively, student A and student B are similar.

If we know the similarity of Prof. A and B, we may estimate the similarity btw student A and B

some basic notations in graph models
Some basic notations in graph models
  • Graph G=(V,E) where V represent the nodes, and E represent the edges.
  • If nodes p and q, then <p,q> denotes the edge from p to q.
  • I(v) denotes the in-neighbors of v
  • O(v) denotes the out-neighbors of v

I(C)={A,B} and O(A)={C,D}

C

A

D

B

E

node pair graph
Node pair graph
  • Creating a node pair graph G2 from G
  • <(p,q),(a,b)> is in G2 if <p,a> and <q,b> are in G
  • Example:
simrank motivation
Simrank motivation
  • Intuition: similar objects are related to similar objects

Univ=Univ Sim(Univ, Univ)=1

Prof. A related to Univ

Prof. B related to Univ

 Sim(Univ, Univ)=.414 <1

Student A related to Prof. A

Student B related to Prof. B

 Sim(SA, SB)=.331 <1

simrank equation
Simrank equation
  • Similarity btw a and b:
  • Example:
    • Assume C=1

F

1

S(F,D)=

[S(A,A)+S(B,A)]

A

*

|2|*|1|

D

B

=1/2*(1+0.5)=0.75

E

simrank equation 1
Simrank equation (1)
  • Similarity btw a and b:
  • s(a,b) is symmetric
  • s(a,a)=1
  • s(a,x)=0 if x has no neighbor
simrank equation 2
Simrank equation (2)
  • Similarity btw a and b:
  • s(a,b) is normalized into (0,1)
  • Proof: By induction
    • C<1
    • s(Ii(a),Ij(b))<1

A

B

A

B

simrank equation 21
Simrank equation (2)
  • Similarity btw a and b:
  • Factor C should be <1
  • C represent the confidence level,

propagated from the parent nodes

bipartite simrank
Bipartite Simrank
  • Consider a recommendation system:
  • How we can recommend a item to a new buyer?
  • A and B are similar since they both buy frosting and eggs  recommend flour for A
bipartite simrank mutually reinforcing rule
Bipartite Simrank (mutually-reinforcing rule)
  • Rule 1: People are similar if they purchase similar items
  • Rule 2: Items are similar if they are purchased by similar people
  • Rule 1 reinforces Rule 2, and vice versa
  • Example:
  • If frosting and eggs are similar, then
  • A and B also similar.
  • 2. If A and B are similar then frosting and eggs are similar.
  • Observation: We can magically see the
  • similar of sugar and flour, even though
  • there is no common customer.
bipartite simrank formula
Bipartite Simrank (formula)

Rule 1: People are similar if they purchase similar items

Rule 2: Items are similar if they are purchased by similar people

Rule 1 (in math form)

Rule 2 (in math form)

bipartite simrank homogeneous domain extension
Bipartite Simrank (Homogeneous domain extension)
  • Previously:
  • Why use Out-links also  the extension:
  • Depend on the application, use either score or both (remember about HITS algorithm)
minimax extension
Minimax extension
  • Example: Given CS students A and B.
    • Both A, B take CS-required courses
    • For elective courses, A takes sociology
    • For elective courses, B takes English
  • Previously:
  • How to only compare A’s CS courses with B’s CS course and A’s elective courses with B elective courses???
  • Meaningless to compare
  • A’s CS courses with B’s elective
  • course !!!
minimax extension cont
Minimax extension (Cont.)
  • Example: Given CS students A and B.
    • Both A, B take CS-required courses. For elective courses, A takes sociology and B takes English.

Only compare a course of A with the most similar course of B

na ve method to compute simrank
Naïve method to compute Simrank
  • Naïve method is an iterative method
  • Rk(a,b) stores similarity of a and b in iteration kth:
    • Initialize R0(a,b)=1 if a==b and R0(a,b)=0 o.w.
    • Update Rk+1 from Rk
    • Until converge
time analysis of na ve method
Time analysis of Naïve method
  • Assume there are n nodes in G  the required space is O(n2) to store pairs.
  • Assume d is the average of |I(a)||I(b)| each iteration take O(d) for each pair.
  • Assume K is the number of iterations
  • 1,2,3 time complexity is O(dn2K)
  • Empirical note: K≈5 in practice
pruning to save time complexity
Pruning to save time complexity
  • Previously, we assume the size of the node-pairs graph is n2  we consider all pairs.
  • In practice, given a node a, node v is far from a will have s(v,a)=0  it is efficient to consider only r-radius neighbor of a

v

v

a

a

sk+1(a,v) = 0, since they are far way

sk+1(a,v) = … sk

time analysis of pruning
Time analysis of pruning
  • Previously, full n2 pairs O(dn2K)
  • Now, r-radius pairs O(dnrK)

sk+1(a,v) = … sk

v

v

a

a

sk+1(a,v) = 0, since they are far way

see how simrank solve limited information problem
See how Simrank solve “limited information problem”
  • Limited information problem :
    • Find similar paper to A?
    • There is little information (only B cite A)
    • Among A1, A2,…, Am, which one is more similar to A?
  • Co-citation algorithm cannot solve LIP:
    • All A1, A2,…, Am share 1 common in link with A  they are equally similar to A
  • Simrankcansolve LIP!!!
    • A is cited by B’, and B’ is similar to B Am is more similar to A than other Ai

Limited information problem

random surfer pair model
Random Surfer Pair model
  • Random surfer pair model provides an intuitive way of SimRank
  • Example: SimRank(m,d) can be explained in random walk:

m

d

m

d

Case 1: high probability that m and d meet together in one step

random surfer pair model cont
Random Surfer Pair model (Cont.)

a

a

m

d

m

d

Case 1: high probability that m and d meet together  SimRank(m,d) is high

a

y

a

m

d

m

d

Case 2: high probability that m and d meet together  SimRank(m,d) is lower

random surfer pair model cont1
Random Surfer Pair model (Cont.)

a

y

Step 1

How to compute m(m,d)

m

d

SimRank(m,d)= expect meeting distance (m,d)

= m(m,d)

Step 2

where

=

experimental set up
Experimental set up
  • Dataset:
    • Research Index dataset: papers and their citation
      • Almost 700,000 cross citations among 270,000 papers
    • Student and course dataset: students and their courses (bipartie graph)
      • 1030 students, each take around 40 courses
experimental set up1
Experimental set up
  • Baseline method:
    • Co-citation: Measure the number of shared objects
  • How to evaluate the algorithm:
    • Select objects p
    • Select top N similar object
    • Average the similar scores of them, based on a domain specific measure
trend on computing simrank on mapreduce
Trend on computing SimRank on MapReduce
  • Delta-SimRank Computing on MapReduce. Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. (BigMine’12).

zeros

We only need to send values greater than zeros  save communication cost over MapReduce!!!

good points
Good points
  • The paper proposes a novel method to compute the similarity of objects, in general, based on the structure of data
  • The paper proposes a method to compute and efficient pruning technique
  • The paper provides an intuition for the method
  • There are good experiments results prove their idea
weak points
Weak points
  • Scalability: The paper should mention about very huge size graph.
  • It may incorporate distributed design. Since the algorithm is fixed point process, it should be a research problem on how to parallelize it.
slide34
Quiz
  • Intuitively, in which graph, the SimRank of a and b are higher ?

a

b

a

b