LIS6 18 lecture 9 Web retrieval

LIS618 lecture 9Web retrieval Thomas Krichel 2011-11-21

structure • A mini Web theory • Google “theory”, mainly page rank

literature • Brin and Page “The Anatomy of a Large-Scale Hypertextual Web Search Engine” http://infolab.stanford.edu/~backrub/ google.html • Calishain and Dornfest's “Google hacks”, O'Reilley 2003 • Schneider & alii “How to do everything with Google”, McGraw Hill Osborne, 2004 • Google web site

web information retrieval • We can think of the web as a pile of documents called pages. • Some pages are hard to index • PDF documents • Pictures • Sound files • But a majority of pages are written in HTML • easy to index • have a loose structure

Google uses the structure of HTML • Google finds the title of the page, i.e. the contents of the <title> element. • Google analysis headings and large font sizes and gives priority weight to terms found there. • Most importantly, Google uses the link structure of the web to find important pages.

classic IR and the web • In classic information retrieval, every document has the same importance. They differ as to their relevance to a query. • In classic information retrieval, a document d is relevant if the query terms appears relatively frequently in d rather than in other documents. • But if a web page contains the words “Bill Clinton sucks” and a picture, it is not a good hit for “Bill Clinton”.

Google finds important pages • The idea is that the documents on the web have different degrees of importance. • Google will show the most important pages first. • The ideas is that more important pages are likely to be more relevant to any query than non-important pages.

Google’s monkey • Imagine that the web has P pages. Each page has its own address (URL). • Imagine a monkey who sits at a terminal. He follows links at random, but on rare occasions he gets bored and types in an address of a random page out of those P. • Will the monkey visit all pages with equal probability?

page rank • Google page rank of a pageis the probability that the Google’s monkey will visit the page. • The monkey will come frequently to pages that have a lot of links to them. • Once he is there, he will likely go to a page that it linked by one of the pages that an important page links to. • The structure of all the links on the entire web reveals the importance of the page.

many page ranks • There is an infinite number of ways to calculate the page rank depending on • how likely the monkey gets bored. • the probability of the monkey to visit each page. • Potentially, there is a page rank for each user of the web. Google tries to observe users and may be associating personal page ranks.

notation • Assume that a monkey gets bored with probability d. If bored, it will visit page p with probability π_p. • For any page p, let o_p the number of outgoing links. • Let l(p',p) be the number of links from page p' to page p.

page rank formula • The page rank for a page p is r_p = π_p d + (1-d) ∑ l(p',p) r_p' / o_p' • In words, it is likelihood that, if bored the money goes to the page p plus the likelihood that he gets there from another page p'. The likelihood getting there from p' is the likelihood of being there, times the number of links between p' and p, divided by the number of outgoing links on p'.

example • Let there be a web of four pages A B C D • A links to B. • B links to C. • C links to A and D. • D links to A. • Let the probability to get bored be ¼ and there be a ¼ chance to move to any page when bored.

page ranks • The following system calculates the ranks r_A= ¼ ¼ + ¾ (r_C/ 2 + r_D) r_B= ¼ ¼ + ¾ r_A r_C= ¼ ¼ + ¾ r_B r_D= ¼ ¼ + ¾ r_C/ 2 • Since this is fairly complicated, Google uses an iterative approximationto calculate the rank. • Note that the sum of all ranks is 1.

a sample program • imagine we start in our previous web with a situation where all pages have a rank of .25. • Then we run updates of the rank according to our formula 1000 times, say • We find • r_A is 0.287151702786378 • r_B is 0.277863777089783 • r_C is 0.270897832817337 • r_D is 0.164086687306502 • sum is 1

my $r_A=.25; my $r_B=.25; my $r_C=.25; my $r_D=.25; my $count=0; while($count < 1000) { $r_A = 1/16 + 3/4 * ($r_C / 2 + $r_D); $r_B = 1/16 + 3/4 * $r_A; $r_C = 1/16 + 3/4 * $r_B; $r_D = 1/16 + 3/4 * $r_C / 2; $count++; } print "r_A is $r_A, r_B is $r_B, r_C is $r_C, r_D is $r_D\n"; my $sum=$r_A + $r_B + $r_C + $r_D; print "sum is $sum\n";

http://openlib.org/home/krichel Please shutdown the computers when you are done. Thank you for your attention!

LIS6 18 lecture 9 Web retrieval

LIS6 18 lecture 9 Web retrieval

Presentation Transcript

INFSCI 2140 Information Storage and Retrieval Lecture 11: Web Information Retrieval

Lecture 18 chapter 9

LIS6 18 lecture 8 Credo and Gale

LIS6 54 lecture repository interoperability

LIS6 18 lecture 0 Introduction to the course

Web Information Retrieval

LIS6 18 lecture 2 preparing and preprocessing

LIS6 54 lecture 5 repository interoperability

LIS6 18 lecture 4 before searching + introduction to dialog

LIS6 18 lecture 6 Vector Model and ProQuest

LIS6 18 lecture 2 the Boolean model

LIS6 18 lecture 4 before searching + introduction to dialog

Lecture 4: Information Retrieval and Web Mining

Web Information retrieval

Information Retrieval and Web Search Lecture 8: Evaluation

Lecture 9: Probabilistic Retrieval

Web Information Retrieval

Web Information retrieval (Web IR)

Web Information Retrieval

Lecture 9: Probabilistic Retrieval