1 / 17

LIS6 18 lecture 9 Web retrieval

LIS6 18 lecture 9 Web retrieval. Thomas Krichel 2011-11-21. structure. A mini Web theory Google “theory”, mainly page rank. literature. Brin and Page “The Anatomy of a Large-Scale Hypertextual Web Search Engine” http ://infolab.stanford.edu /~backrub/ google.html

juro
Download Presentation

LIS6 18 lecture 9 Web retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LIS618 lecture 9Web retrieval Thomas Krichel 2011-11-21

  2. structure • A mini Web theory • Google “theory”, mainly page rank

  3. literature • Brin and Page “The Anatomy of a Large-Scale Hypertextual Web Search Engine” http://infolab.stanford.edu/~backrub/ google.html • Calishain and Dornfest's “Google hacks”, O'Reilley 2003 • Schneider & alii “How to do everything with Google”, McGraw Hill Osborne, 2004 • Google web site

  4. web information retrieval • We can think of the web as a pile of documents called pages. • Some pages are hard to index • PDF documents • Pictures • Sound files • But a majority of pages are written in HTML • easy to index • have a loose structure

  5. Google uses the structure of HTML • Google finds the title of the page, i.e. the contents of the <title> element. • Google analysis headings and large font sizes and gives priority weight to terms found there. • Most importantly, Google uses the link structure of the web to find important pages.

  6. classic IR and the web • In classic information retrieval, every document has the same importance. They differ as to their relevance to a query. • In classic information retrieval, a document d is relevant if the query terms appears relatively frequently in d rather than in other documents. • But if a web page contains the words “Bill Clinton sucks” and a picture, it is not a good hit for “Bill Clinton”.

  7. Google finds important pages • The idea is that the documents on the web have different degrees of importance. • Google will show the most important pages first. • The ideas is that more important pages are likely to be more relevant to any query than non-important pages.

  8. Google’s monkey • Imagine that the web has P pages. Each page has its own address (URL). • Imagine a monkey who sits at a terminal. He follows links at random, but on rare occasions he gets bored and types in an address of a random page out of those P. • Will the monkey visit all pages with equal probability?

  9. page rank • Google page rank of a pageis the probability that the Google’s monkey will visit the page. • The monkey will come frequently to pages that have a lot of links to them. • Once he is there, he will likely go to a page that it linked by one of the pages that an important page links to. • The structure of all the links on the entire web reveals the importance of the page.

  10. many page ranks • There is an infinite number of ways to calculate the page rank depending on • how likely the monkey gets bored. • the probability of the monkey to visit each page. • Potentially, there is a page rank for each user of the web. Google tries to observe users and may be associating personal page ranks.

  11. notation • Assume that a monkey gets bored with probability d. If bored, it will visit page p with probability π_p. • For any page p, let o_p the number of outgoing links. • Let l(p',p) be the number of links from page p' to page p.

  12. page rank formula • The page rank for a page p is r_p = π_p d + (1-d) ∑ l(p',p) r_p' / o_p' • In words, it is likelihood that, if bored the money goes to the page p plus the likelihood that he gets there from another page p'. The likelihood getting there from p' is the likelihood of being there, times the number of links between p' and p, divided by the number of outgoing links on p'.

  13. example • Let there be a web of four pages A B C D • A links to B. • B links to C. • C links to A and D. • D links to A. • Let the probability to get bored be ¼ and there be a ¼ chance to move to any page when bored.

  14. page ranks • The following system calculates the ranks r_A= ¼ ¼ + ¾ (r_C/ 2 + r_D) r_B= ¼ ¼ + ¾ r_A r_C= ¼ ¼ + ¾ r_B r_D= ¼ ¼ + ¾ r_C/ 2 • Since this is fairly complicated, Google uses an iterative approximationto calculate the rank. • Note that the sum of all ranks is 1.

  15. a sample program • imagine we start in our previous web with a situation where all pages have a rank of .25. • Then we run updates of the rank according to our formula 1000 times, say • We find • r_A is 0.287151702786378 • r_B is 0.277863777089783 • r_C is 0.270897832817337 • r_D is 0.164086687306502 • sum is 1

  16. my $r_A=.25; my $r_B=.25; my $r_C=.25; my $r_D=.25; my $count=0; while($count < 1000) { $r_A = 1/16 + 3/4 * ($r_C / 2 + $r_D); $r_B = 1/16 + 3/4 * $r_A; $r_C = 1/16 + 3/4 * $r_B; $r_D = 1/16 + 3/4 * $r_C / 2; $count++; } print "r_A is $r_A, r_B is $r_B, r_C is $r_C, r_D is $r_D\n"; my $sum=$r_A + $r_B + $r_C + $r_D; print "sum is $sum\n";

  17. http://openlib.org/home/krichel Please shutdown the computers when you are done. Thank you for your attention!

More Related