1 / 33

Search Engine Technology (11)

Search Engine Technology (11). Prof. Dragomir R. Radev radev@cs.columbia.edu. SET Fall 2013. … 17. continued …. [Slide from Reka Albert]. [Slide from Reka Albert]. The strength of weak ties. Granovetter’s study: finding jobs

owen
Download Presentation

Search Engine Technology (11)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search Engine Technology(11) Prof. Dragomir R. Radev radev@cs.columbia.edu

  2. SET Fall 2013 … 17. continued …

  3. [Slide from Reka Albert]

  4. [Slide from Reka Albert]

  5. The strength of weak ties • Granovetter’s study: finding jobs • Weak ties: more people can be reached through weak ties than strong ties (e.g., through your 7th and 8th best friends) • More here: http://en.wikipedia.org/wiki/Weak_tie

  6. Prestige and centrality • Degree centrality: how many neighbors each node has. • Closeness centrality: how close a node is to all of the other nodes • Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes • Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects. • Prestige = same as centrality but for directed graphs.

  7. SET Fall 2013 … 18. Graph-based methods Harmonic functions Random walks PageRank …

  8. Random walks and harmonic functions • Drunkard’s walk: • Start at position 0 on a line • What is the prob. of reaching 5 before reaching 0? • Harmonic functions: • P(0) = 0 • P(N) = 1 • P(x) = ½*p(x-1)+ ½*p(x+1), for 0<x<N • (in general, replace ½ with the bias in the walk) 0 1 2 3 4 5

  9. (**) The original Dirichlet problem • Distribution of temperature in a sheet of metal. • One end of the sheet has temperature t=0, the other end: t=1. • Laplace’s differential equation: • This is a special (steady-state) case of the (transient) heat equation : • In general, the solutions to this equation are called harmonic functions.

  10. Learning harmonic functions • The method of relaxations • Discrete approximation. • Assign fixed values to the boundary points. • Assign arbitrary values to all other points. • Adjust their values to be the average of their neighbors. • Repeat until convergence. • Monte Carlo method • Perform a random walk on the discrete representation. • Compute f as the probability of a random walk ending in a particular fixed point. • Eigenvector methods • Look at the stationary distribution of a random walk

  11. Eigenvectors and eigenvalues • An eigenvector is an implicit “direction” for a matrix where v (eigenvector)is non-zero, though λ (eigenvalue) can be any complex number in principle • Computing eigenvalues:

  12. Eigenvectors and eigenvalues • Example: • Det (A-lI) = (-1-l)*(-l)-3*2=0 • Then: l+l2-6=0; l1=2; l2=-3 • For l1=2: • Solutions: x1=x2

  13. Stochastic matrices • Stochastic matrices: each row (or column) adds up to 1 and no value is less than 0. Example: • The largest eigenvalue of a stochastic matrix E is real: λ1 = 1. • For λ1, the left (principal) eigenvector is p, the right eigenvector = 1 • In other words, GTp = p.

  14. 1 Ω Electrical networks and random walks c • Ergodic (connected) Markov chain with transition matrix P 1 Ω 1 Ω w=Pw b a 0.5 Ω 0.5 Ω d From Doyle and Snell 2000

  15. 1 Ω Electrical networks and random walks c 1 Ω 1 Ω b a 0.5 Ω 0.5 Ω • vxis the probability that a random walk starting at x will reach a before reaching b. d • The random walk interpretation allows us to use Monte Carlo methods to solve electrical circuits. 1 V

  16. Markov chains • A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E. • Path = sequence (x0, x1, …, xn).Xi = xi-1*E • The probability of a path can be computed as a product of probabilities for each step i. • Random walk = find Xjgiven x0, E, and j.

  17. Stationary solutions • The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions: • E is stochastic • E is irreducible • E is aperiodic • To make these conditions true: • All rows of E add up to 1 (and no value is negative) • Make sure that E is strongly connected • Make sure that E is not bipartite • Example: PageRank [Brin and Page 1998]: use “teleportation”

  18. t=0 1 6 8 2 7 t=1 5 3 4 Example This graph E has a second graph E’(not drawn) superimposed on it:E’ is the uniform transition graph.

  19. Eigenvectors • An eigenvector is an implicit “direction” for a matrix. Ev = λv, where v is non-zero, though λ can be any complex number in principle. • The largest eigenvalue of a stochastic matrix E is real: λ1 = 1. • For λ1, the left (principal) eigenvector is p, the right eigenvector = 1 • In other words, ETp = p.

  20. Computing the stationary distribution functionPowerStatDist (E): begin p(0) = u; (or p(0) = [1,0,…0]) i=1; repeat p(i) = ETp(i-1) L = ||p(i)-p(i-1)||1; i = i + 1; untilL <  returnp(i) end Solution for thestationary distribution Convergence rate is O(m)

  21. t=0 1 6 8 2 7 t=1 5 3 4 t=10 Example

  22. PageRank • Developed at Stanford and allegedly still being used at Google. • Not query-specific, although query-specific varieties exist. • In general, each page is indexed along with the anchor texts pointing to it. • Among the pages that match the user’s query, Google shows the ones with the largest PageRank. • Google also uses vector-space matching, keyword proximity, anchor text, etc.

  23. SET Fall 2013 … 19. Hubs and authorities Bipartite graphs HITS and SALSA Models of the web …

  24. Honda Ford VW Car and Driver HITS • Hypertext-induced text selection. • Developed by Jon Kleinberg and colleagues at IBM Almaden as part of the CLEVER engine. • HITS is query-specific. • Hubs and authorities, e.g. collections of bookmarks about cars vs. actual sites about cars.

  25. HITS • Each node in the graph is ranked for hubness (h) and authoritativeness (a). • Some nodes may have high scores on both. • Example authorities for the query “java”: • www.gamelan.com • java.sun.com • digitalfocus.com/digitalfocus/… (The Java developer) • lightyear.ncsa.uiuc.edu/~srp/java/javabooks.html • sunsite.unc.edu/javafaq/javafaq.html

  26. HITS • HITS algorithm: • obtain root set (using a search engine) related to the input query • expand the root set by radius one on either side (typically to size 1000-5000) • run iterations on the hub and authority scores together • report top-ranking authorities and hubs • Eigenvector interpretation:

  27. Example [slide from Baldi et al.]

  28. HITS • HITS is now used by Ask.com and Teoma.com . • It can also be used to identify communities (e.g., based on synonyms as well as controversial topics. • Example for “jaguar” • Principal eigenvector gives pages about the animal • The positive end of the second nonprincipal eigenvector gives pages about the football team • The positive end of the third nonprincipal eigenvector gives pages about the car. • Example for “abortion” • The positive end of the second nonprincipal eigenvector gives pages on “planned parenthood” and “reproductive rights” • The negative end of the same eigenvector includes “pro-life” sites. • SALSA (Lempel and Moran 2001)

  29. a A B b Models of the Web • Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology • Erdös/Rényi 59, 60 • Barabási/Albert 99 • Watts/Strogatz 98 • Kleinberg 98 • Menczer 02 • Radev 03

  30. Observations: Links are made based on topics Topics are expressed with words Words are distributed very unevenly (Zipf, Benford, self-triggerability laws) Model Pick n Generate n lengths according to a power-law distribution Generate n documents using a trigram model Model (cont’d) Pick words in decreasing order of r. Generate hyperlinks with random directionality Outcome Generates power-law degree distributions Generates topical communities Natural variation of PageRank: LexRank Evolving Word-based Web

  31. Readings • paper by Church and Gale (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.3957)

More Related