1 / 72

Web Information retrieval (Web IR)

Web Information retrieval (Web IR). Handout #9: Connectivity Ranking. Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir. Outline. PageRank HITS Personalized PageRank HostRank Distance Rank. Ranking : Definition.

ulla-burt
Download Presentation

Web Information retrieval (Web IR)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Information retrieval (Web IR) Handout #9: Connectivity Ranking Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

  2. Outline • PageRank • HITS • Personalized PageRank • HostRank • Distance Rank

  3. Ranking : Definition • Ranking is the process which estimates the quality of a set of results retrieved by a search engine • Ranking is the most important part of a search engine

  4. Ranking Types • Content-based • Classical IR • Connectivity based (web) • Query independent • Query dependent • User-behavior based

  5. Web information retrieval • Queries are short: 2.35 terms in avg. • Huge variety in documents: language, quality, duplication • Huge vocabulary: 100s millions terms • Deliberate misinformation • Spamming! • Its rank is completely under the control of Web page’s author

  6. Docs Docs Words 1 1 1 2 2 2 Query n n w Web graph Ranking in Web IR • Ranking is a function of the query terms and of the hyperlink structure • Using content of other pages to rank current pages • It is out of the control of the page’s author • Spamming is hard

  7. Connectivity-based Ranking • Query independent • PageRank • Query dependent • HITS

  8. Google’s PageRank Algorithm • Idea: Mine structure of the web graph • Each web page is a node • Each hyperlink is a directed edge

  9. PageRank • Assumption: A link from page A to page B is a recommendation of page B by the author of A(we say B is successor of A) • Quality of a page is related to its in-degree • Recursion: Quality of a page is related to • its in-degree, and to • the quality of pages linking to it • PageRank A B Successor

  10. Surfer s p Definition of PageRank • Consider the following infinite random walk (surf): • Initially the surfer is at a random page • At each step, the surfer proceeds to a randomly chosen successor of the current page (With probability 1/outdegree) • The PageRank of a page p is the fraction of steps the surfer spends at p in the limit. • Random surfer model

  11. PageRank (cont.) By previous theorem: • PageRank = stationary probability for this Markov chain, i.e.

  12. PageRank (cont.) PageRank of P is P(A)/4 + P(B)/3 B A P

  13. PageRank (cont.)

  14. Damping Factor (d) • Web graph is not strongly connected • Convergence of PageRank is not guaranteed • Effects of sinking web pages • Pages without outputs • Trapping pages • Damping factor (d) • Surfer proceeds to a randomly chosen successor of the current page with probability d or to a randomly chosen web page with probability (1-d) where n is the total number of nodes in the graph

  15. PageRank Vector (Linear Algebra) • R is the rank vector (eigen vector) • ri is rank value of page i • P is a matrix in that pij=1/O(i) if i points to j then else pij=0 • Goal is to find eigen vector of matrix P with eigen value one • It iterates to converge (power method) • Using damping factor we have (ei=1/n)

  16. PageRank Properties • Advantages • Finds popularity • It is offline • Disadvantages • It is query independent • All of pages will compete together • Unfairness

  17. HITS (An online query dependent) • Hypertext Induced Topic Search • By Kleinberg

  18. HITS (Hypertext Induced Topic Selection) • The algorithm produces two types of pages: - Authority:A page is very authoritative if it receives many citations. Citation from important pages weight more than citations from less-important pages - Hub:Hubness shows the importance of a page. A good hub is a page that links to many authoritative sites • For each vertex v Є V in a graph of interest: • a(v) - the authority of v • h(v) - the hubness of v

  19. HITS 5 2 3 1 1 6 4 7 h(1) = a(5) + a(6) + a(7) a(1) = h(2) + h(3) + h(4)

  20. Authority and Hubness Convergence • Authorities and hubs exhibit a mutually reinforcing relationship: a better hub points to many good authorities, and a better authority is pointed to by many good hubs

  21. HITS Example Find a base subgraph: • Start with a root set R {1, 2, 3, 4} • {1, 2, 3, 4} - nodes relevant to the topic • Expand the root set R to include all the children and a fixed number of parents (d) of nodes in R • A new set S (base subgraph) • Real version of HITS is based on site relations

  22. Topic Sensitive PageRank (TSPR) • It precomputes the importance scores online, as with ordinary PageRank. • However, it compute multiple importance scores for each page; • It computes a set of scores of the importance of • a page with respect to various topics. • At query time, these importance scores are combined based on the topics of the query to form a composite PageRank score for those pages matching the query.

  23. TSPR (Cont.) • We have n topics on the web, The rank of page v in topic t • Difference with original PageRank is in E vector (it is not uniform and we have n E vectors) • The are n ranking values for each page • Problem: finding the topic of a page and a query (we do not know user interest)

  24. TSPR (Cont.) • Cj = category j • Given a query q, let q’ be the context of q (here q’=q) • P(q’|cj) is computed from the class term-vector Dj (number of the terms in the documents below each of the 16 top-level categories, Djt simply gives the total number of occurrences of term t in documents listed below class cj)

  25. TSPR (Cont.) • The quantity P(cj ) is not as straightforward. It is used uniformly, although we could personalize the query results for deferent users by varying this distribution. • In other words, for some user k, we can use a prior distribution Pk(cj ) that reflects the interests of user k. • This method provides an alternative framework for user-based personalization, rather than directly varying the damping vector E

  26. TrustRank • Spamming in Web • Good and bad pages • TrustRank is used to overcome Spamming • It proposes techniques to semi-automatically separate reputable, good pages from spam. • It first selects a small set of seed pages to be evaluated by an expert. • Once it manually identifies the reputable seed pages, then it uses the link structure of the web to discover other pages that are likely to be good.

  27. TrustRank (cont.) • Idea: Good page links to other Good page and bad pages link to other bad pages

  28. TrustRank (cont.) • It formalizes the notion of a human checking a page for spam by a binary oracle function O over all pages p :

  29. Trust damping and Trust Splitting

  30. Computing Trustiness of each page • Goal: • Trust Propagation (E(i) is computed from Normalized Oracle vector for example if O(5)=O(10)=O(15)=1 then E(5)=E(10)=E(15)=0.33 ):

  31. HostRank • Pervious link analysis algorithms generally work on a flat link graph, ignoring the hierarchal structure of the Web graph. • They suffer from two problems: the sparsity of link graph and biased ranking of newly-emerging pages. • HostRank considers both the hierarchical structure and the link structure of the Web.

  32. Example of Domain, Host, Directory

  33. Supper node & Hierarchical Structure of a Web Graph • Tupper-layer is an aggregated link graph which consists with supernodes (such as domain, host and directory). • The lower-layer graph is the hierarchical tree structure, in which each node is the individual Web page in the supernode, and the edges are the hierarchical links between the pages.

  34. Hierarchical Random Walk Model • At the beginning of each browsing session, a user randomly selects the supernode. • 2. After the user finished reading a page in a supernode, he may select one of the following three actions with a certain probability: • Going to another page within current supernode. • Jumping to another supernode that is linked by current supernode. • Ending the browsing.

  35. Two stages in HostRank • First computing score of each suppernode by random walk (PageRank) • Second propagating score among pages inside a supernode • Dissipative Heat Conductance (DHC) model

  36. Ranking & Crawling Challenges • Rich-get-richer problem • Unfairness • Low precision • Spamming phenomenon

  37. Popularity and Quality • Definition1:We define the popularity of page p at time t, P(p; t), as the fraction of Web users who like the page • We can interpret the PageRank of page as its popularity on the web • Definition 2 : We define the qualityof a page p, Q(p), as the probability that an average user will like the page when user sees the page for the first time

  38. Rich-get-richer Problem • It causes young high quality pages receive less popularity • It is from search-engine bias • Entrenchment Effect

  39. --------- • --------- • --------- • --------- • --------- • --------- … user attention entrenched pages new unpopular pages Entrenchment Effect • Search engines show entrenched (already-popular) pages at the top • Users discover pages via search engines; tend to focus on top results

  40. Quality Popularity Popularity as a Surrogate for Quality • Search engines want to measure the “quality” of pages • Quality is hard to define and measure • Various “popularity” measures are used in ranking • e.g., in-links, PageRank, usertraffic

  41. Measuring Search-Engine Bias • Random-surfer model • Users follow links randomly • Never use search engines • Search-dominant model • Users always start with a search engine • Only visit pages returned by search engines • It has been found that it takes 60 times longer for a new page to become popular under Search-Dominant than Random-Surfer model.

  42. Popularity Evaluation Random Surfer Search Dominant

  43. Some Definitions

  44. Relation between Popularity & Visit rate in Random Surfer Model • r1 is constant • We can consider PageRank as Popularity (the current PageRank of a page represents the probability that a person arrives at the page if the person follows links on the Web randomly)

  45. Popularity evolution

  46. Popularity evolution (Q(p)=1)

  47. Relation in Search Dominant Model

  48. Random Surfer vs. Search Dominant

  49. Search Dominant Formula Detail(Found from AltaVista log-power law)

  50. Rank Promotion (by Pandey)

More Related