Web Search Engines

Web Search Engines PowerPoint PPT Presentation


  • 180 Views
  • Updated On :
  • Presentation posted in: Internet / Web

Do a search on any given day = 33 M. Have used Internet to search = 85% //www.pewinternet. ... Search on the Web. Corpus:The publicly accessible Web: static ...

Download Presentation

Web Search Engines

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Slide 1:Search Engines & Question Answering

Giuseppe Attardi Universitŕ di Pisa

Slide 2:Topics

Web Search Search engines Architecture Crawling: parallel/distributed, focused Link analysis (Google PageRank) Scaling

Source: Jupiter Communications, 2000

Slide 3:Top Online Activities

Slide 4:Pew Study (US users July 2002)

Total Internet users = 111 M Do a search on any given day = 33 M Have used Internet to search = 85% //www.pewinternet.org/reports/toc.asp?Report=64

Slide 5:Search on the Web

Corpus:The publicly accessible Web: static + dynamic Goal: Retrieve high quality results relevant to the user’s need (not docs!) Need Informational – want to learn about something (~40%) Navigational – want to go to that page (~25%) Transactional – want to do something (web-mediated) (~35%) Access a service Downloads Shop Gray areas Find a good hub Exploratory search “see what’s there” Low hemoglobin United Airlines Car rental Finland

Slide 6:Results

Static pages (documents) text, mp3, images, video, ... Dynamic pages = generated on request data base access “the invisible web” proprietary content, etc.

Slide 7:Terminology

http://www.cism.it/cism/hotels_2001.htm Host name Page name Access method URL = Universal Resource Locator

Slide 8:Scale

Immense amount of content 2-10B static pages, doubling every 8-12 months Lexicon Size: 10s-100s of millions of words Authors galore (1 in 4 hosts run a web server) http://www.netcraft.com/Survey

Slide 9:Diversity

Languages/Encodings Hundreds (thousands ?) of languages, W3C encodings: 55 (Jul01) [W3C01] Home pages (1997): English 82%, Next 15: 13% [Babe97] Google (mid 2001): English: 53%, JGCFSKRIP: 30% Document & query topic Popular Query Topics (from 1 million Google queries, Apr 2000)

Slide 10:Rate of change

[Cho00] 720K pages from 270 popular sites sampled daily from Feb 17 – Jun 14, 1999 Mathematically, what does this seem to be?

Slide 11:Web idiosyncrasies

Distributed authorship Millions of people creating pages with their own style, grammar, vocabulary, opinions, facts, falsehoods … Not all have the purest motives in providing high-quality information - commercial motives drive “spamming” - 100s of millions of pages. The open web is largely a marketing tool. IBM’s home page does not contain computer.

Slide 12:Other characteristics

Significant duplication Syntactic - 30%-40% (near) duplicates [Brod97, Shiv99b] Semantic - ??? High linkage ~ 8 links/page in the average Complex graph topology Not a small world; bow-tie structure [Brod00] More on these corpus characteristics later how do we measure them?

Slide 13:Web search users

Ill-defined queries Short AV 2001: 2.54 terms avg, 80% < 3 words) Imprecise terms Sub-optimal syntax (80% queries without operator) Low effort Wide variance in Needs Expectations Knowledge Bandwidth Specific behavior 85% look over one result screen only (mostly above the fold) 78% of queries are not modified (one query/session) Follow links – “the scent of information” ...

Slide 14:Evolution of search engines

First generation -- use only “on page”, text data Word frequency, language Second generation -- use off-page, web-specific data Link (or connectivity) analysis Click-through data (What results people click on) Anchor-text (How people refer to this page) Third generation -- answer “the need behind the query” Semantic analysis -- what is this about? Focus on user need, rather than on query Context determination Helping the user Integration of search and text analysis 1995-1997 AV, Excite, Lycos, etc From 1998. Made popular by Google but everyone now Still experimental

Slide 15:Third generation search engine: answering “the need behind the query”

Query language determination Different ranking (if query Japanese do not return English) Hard & soft matches Personalities (triggered on names) Cities (travel info, maps) Medical info (triggered on names and/or results) Stock quotes, news (triggered on stock symbol) Company info, … Integration of Search and Text Analysis

Slide 16:Answering “the need behind the query” Context determination

Context determination spatial (user location/target location) query stream (previous queries) personal (user profile) explicit (vertical search, family friendly) implicit (use AltaVista from AltaVista France) Context use Result restriction Ranking modulation

Slide 17:The spatial context - geo-search

Two aspects Geo-coding encode geographic coordinates to make search effective Geo-parsing the process of identifying geographic context. Geo-coding Geometrical hierarchy (squares) Natural hierarchy (country, state, county, city, zip-codes, etc) Geo-parsing Pages (infer from phone nos, zip, etc). About 10% feasible. Queries (use dictionary of place names) Users From IP data Mobile phones In its infancy, many issues (display size, privacy, etc)

Slide 18:AV barry bonds

Slide 19:Lycos palo alto

Slide 20:Geo-search example - Northern Light (Now Divine Inc)

Slide 21:Helping the user

UI spell checking query refinement query suggestion context transfer …

Slide 22:Context sensitive spell check

Crawl Control

Slide 23:Search Engine Architecture

Crawlers Ranking Page Repository Query Engine Link Analysis Text Structure Document Store Queries Results Snippet Extraction Indexer

Slide 24:Terms

Crawler Crawler control Indexes – text, structure, utility Page repository Indexer Collection analysis module Query engine Ranking module

Slide 25:Repository

“Hidden Treasures”

Slide 26:Storage

The page repository is a scalable storage system for web pages Allows the Crawler to store pages Allows the Indexer and Collection Analysis to retrieve them Similar to other data storage systems – DB or file systems Does not have to provide some of the other systems’ features: transactions, logging, directory

Slide 27:Storage Issues

Scalability and seamless load distribution Dual access modes Random access (used by the query engine for cached pages) Streaming access (used by the Indexer and Collection Analysis) Large bulk update – reclaim old space, avoid access/update conflicts Obsolete pages - remove pages no longer on the web

Slide 28:Designing a Distributed Web Repository

Repository designed to work over a cluster of interconnected nodes Page distribution across nodes Physical organization within a node Update strategy

Slide 29:Page Distribution

How to choose a node to store a page Uniform distribution – any page can be sent to any node Hash distribution policy – hash page ID space into node ID space

Slide 30:Organization Within a Node

Several operations required Add / remove a page High speed streaming Random page access Hashed organization Treat each disk as a hash bucket Assign according to a page’s ID Log organization Treat the disk as one file, and add the page at the end Support random access using a B-tree Hybrid Hash map a page to an extent and use log structure within an extent.

Slide 31:Distribution Performance

Slide 32:Update Strategies

Updates are generated by the crawler Several characteristics Time in which the crawl occurs and the repository receives information Whether the crawl’s information replaces the entire database or modifies parts of it

Slide 33:Batch vs. Steady

Batch mode Periodically executed Allocated a certain amount of time Steady mode Run all the time Always send results back to the repository

Slide 34:Partial vs. Complete Crawls

A batch mode crawler can Do a complete crawl every run, and replace entire collection Recrawl only a specific subset, and apply updates to the existing collection – partial crawl The repository can implement In place update Quickly refresh pages Shadowing, update as another stage Avoid refresh-access conflicts

Slide 35:Partial vs. Complete Crawls

Shadowing resolves the conflicts between updates and read for the queries Batch mode suits well with shadowing Steady crawler suits with in place updates

Slide 36:Indexing

Slide 37: The Indexer Module

Creates Two indexes : Text (content) index : Uses “Traditional” indexing methods like Inverted Indexing Structure(Links( index : Uses a directed graph of pages and links. Sometimes also creates an inverted graph ???? ?????? indexing ????? ?????????. ???? ?? ?????? ????. ??? ????? : (??? ????, ???? ????) ????? ??????? ????? ??? ???? ?? ???????. ??? ? ???? ????? inverted graph ???? ???? ?? ????? ?? ??, ??? ?? ?? ?? ????? ???. ??? ?? ????? ?? ????? ????? ???? “????” ???? ?????? indexing ????? ?????????. ???? ?? ?????? ????. ??? ????? : (??? ????, ???? ????) ????? ??????? ????? ??? ???? ?? ???????. ??? ? ???? ????? inverted graph ???? ???? ?? ????? ?? ??, ??? ?? ?? ?? ????? ???. ??? ?? ????? ?? ????? ????? ???? “????”

Slide 38:The Link Analysis Module

Uses the 2 basic indexes created by the indexer module in order to assemble “Utility Indexes” e.g. : A site index. ???? ?????? ?? ????? ???? ?????? ????? ???? ???? ?????? ?? ???? ????? ????. ???? : ??? ??? ????? ?????? ???? (?????? ?????? ?? postings, ???? ???? ?? ???????, ?????????).???? ?????? ?? ????? ???? ?????? ????? ???? ???? ?????? ?? ???? ????? ????. ???? : ??? ??? ????? ?????? ???? (?????? ?????? ?? postings, ???? ???? ?? ???????, ?????????).

Slide 39:Inverted Index

A Set of inverted lists, one per each index term (word) Inverted list of a term: A sorted list of locations in which the term appeared. Posting: A pair (w,l) where w is word and l is one of its locations Lexicon: Holds all index’s terms with statistics about the term (not the posting) . .???? ????? ?? ???? ???? ?????? ???? “??? <b> “ ?? “??? <h1>” . ???? ???? ??????? ?? ?term ??? ?????. ??? ?????? ??? ????? ????? ?????? postings, ???? ?postings ??????? ???????.. .???? ????? ?? ???? ???? ?????? ???? “??? <b> “ ?? “??? <h1>” . ???? ???? ??????? ?? ?term ??? ?????. ??? ?????? ??? ????? ????? ?????? postings, ???? ?postings ??????? ???????.

Slide 40:Challenges

Index build must be: Fast Economic (unlike traditional index buildings) Incremental Indexing must be supported Storage: compression vs. speed .????????? ?????? ??????? ????? ?? ?????? ??? ????? ???? ?????. ???? ?????? ?????? ????, ?? ????? ?? ?????. ?? ?? ???? ???? ??? ?? ???? ?????? ??? ??????. ?? ????? ?????? ?? ?????? ?????. ???? : ??? ??? ????? ?????? ?? ???????? ?? ??? ???? ?????? ? .????????? ?????? ??????? ????? ?? ?????? ??? ????? ???? ?????. ???? ?????? ?????? ????, ?? ????? ?? ?????. ?? ?? ???? ???? ??? ?? ???? ?????? ??? ??????. ?? ????? ?????? ?? ?????? ?????. ???? : ??? ??? ????? ?????? ?? ???????? ?? ??? ???? ?????? ?

Slide 41:Index Partitioning

A distributed text indexing can be done by: Local inverted file (IFL) Each nodes contain disjoint random pages Query is broadcasted Result is the joined query answers Global inverted file (IFG) Each node is responsible only for a subset of terms in the collection Query sent only to the apropriate node

Slide 42:Indexing, Conclusion

Web pages indexing is complicated due to it’s scale (millions of pages, hundreds of gigabytes) Challenges: Incremental indexing and personalization ????? ???? ????? ?? ???? … ??????????? : ????? ?????? ????? ??????? ????? ?????? ????? ?????? ??? (???? ????? ???? ???????????? ????? ?????? ????? ?????? ???).????? ???? ????? ?? ???? … ??????????? : ????? ?????? ????? ??????? ????? ?????? ????? ?????? ??? (???? ????? ???? ???????????? ????? ?????? ????? ?????? ???).

Slide 43:Scaling

Slide 44:Scaling

Google (Nov 2002): Number of pages: 3 billion Refresh interval: 1 month (1200 pag/sec) Queries/day: 150 million = 1700 q/s Avg page size:10KB Avg query size: 40 B Avg result size: 5 KB Avg links/page: 8

Slide 45:Size of Dataset

Total raw HTML data size: 3G x 10 KB = 30 TB Inverted index ~= corpus = 30 TB Using compression 3:1 20 TB data on disk

Slide 46:Single copy of index

Index (10 TB) / (100 GB per disk) = 100 disk Document (10 TB) / (100 GB per disk) = 100 disk

Slide 47:Query Load

1700 queries/sec Rule of thumb: 20 q/s per CPU 85 clusters to answer queries Cluster: 100 machine Total = 85 x 100 = 8500 Document server Snippet search: 1000 snippet/s (1700 * 10 / 1000) * 100 = 1700

Slide 48:Limits

Redirector: 4000 req/sec Bandwidth: 1100 req/sec Server: 22 q/s each Cluster: 50 nodes = 1100 q/s = 95 million q/day

Slide 49:Scaling the Index

Slide 50:Pooled Shard Architecture

Slide 51:Replicated Index Architecture

Slide 52:Index Replication

100 Mb/s bandwidth 20 TB x 8 / 100 20 TB require one full day

Slide 53:Ranking

Slide 54:First generation ranking

Extended Boolean model Matches: exact, prefix, phrase,… Operators: AND, OR, AND NOT, NEAR, … Fields: TITLE:, URL:, HOST:,… AND is somewhat easier to implement, maybe preferable as default for short queries Ranking TF like factors: TF, explicit keywords, words in title, explicit emphasis (headers), etc IDF factors: IDF, total word count in corpus, frequency in query log, frequency in language

Slide 55:Second generation search engine

Ranking -- use off-page, web-specific data Link (or connectivity) analysis Click-through data (What results people click on) Anchor-text (How people refer to this page) Crawling Algorithms to create the best possible corpus

Slide 56:Connectivity analysis

Idea: mine hyperlink information in the Web Assumptions: Links often connect related pages A link between pages is a recommendation “people vote with their links”

Slide 57:Citation Analysis

Citation frequency Co-citation coupling frequency Cocitations with a given author measures “impact” Cocitation analysis [Mcca90] Bibliographic coupling frequency Articles that co-cite the same articles are related Citation indexing Who is a given author cited by? (Garfield [Garf72]) Pinsker and Narin

Slide 58:Query-independent ordering

First generation: using link counts as simple measures of popularity Two basic suggestions: Undirected popularity: Each page gets a score = the number of in-links plus the number of out-links (3+2=5) Directed popularity: Score of a page = number of its in-links (3)

Slide 59:Query processing

First retrieve all pages meeting the text query (say venture capital) Order these by their link popularity (either variant on the previous page)

Slide 60:Spamming simple popularity

Exercise: How do you spam each of the following heuristics so your page gets a high score? Each page gets a score = the number of in-links plus the number of out-links Score of a page = number of its in-links

Slide 61:PageRank scoring

Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably “In the steady state” each page has a long-term visit rate - use this as the page’s score

Slide 62:Not quite enough

The web is full of dead-ends Random walk can get stuck in dead-ends. Makes no sense to talk about long-term visit rates. ??

Slide 63:Teleporting

At each step, with probability 10%, jump to a random web page With remaining probability (90%), go out on a random link If no out-link, stay put in this case

Slide 64:Result of teleporting

Now cannot get stuck locally There is a long-term rate at which any page is visited (not obvious, will show this) How do we compute this visit rate?

Slide 65:PageRank

Tries to capture the notion of “Importance of a page” Uses Backlinks for ranking Avoids trivial spamming: Distributes pages’ “voting power” among the pages they are linking to “Important” page linking to a page will raise it’s rank more the “Not Important” one * * ?? ????? ???? ???? ?????? ????, ?? ?????. * ???? ?????? ??? ???? ?? ??? ?????? ????? ??? ????? ??. * ?? ???? ????? ?? ??, ?? ???? ??? ???? ??? ????? ?? ??.* * ?? ????? ???? ???? ?????? ????, ?? ?????. * ???? ?????? ??? ???? ?? ??? ?????? ????? ??? ????? ??. * ?? ???? ????? ?? ??, ?? ???? ??? ???? ??? ????? ?? ??.

Slide 66:Simple PageRank

Given by : Where B(i) : set of pages links to i N(j) : number of outgoing links from j Well defined if link graph is strongly connected Based on “Random Surfer Model” - Rank of page equals to the probability of being in this page ?????? ??? ?? ?? ?? ???? ??? ????? ?? ??? ?? ???? ????? ??? ????? ???????. ???? ???????? ?????? : ??? ?????? ?????. ?? ?? ???? ????? ??? ?? ??? ?? ?? … ???? ????? ?random walk ?? ??? ????? - ?? ???????? ?????? ???? ??? ??? ???? ???? ???, ???? ??? ???? ??? ???? ???? ??? ?????. ????? ???? ????? ?? ????. ????? ??? ????? ???? ?????? ?transpose. ?????? ??? ?? ?? ?? ???? ??? ????? ?? ??? ?? ???? ????? ??? ????? ???????. ???? ???????? ?????? : ??? ?????? ?????. ?? ?? ???? ????? ??? ?? ??? ?? ?? … ???? ????? ?random walk ?? ??? ????? - ?? ???????? ?????? ???? ??? ??? ???? ???? ???, ???? ??? ???? ??? ???? ???? ??? ?????. ????? ???? ????? ?? ????. ????? ??? ????? ???? ?????? ?transpose.

Slide 67:Computation Of PageRank (1)

??? ????? : ???? ?? ??? ?????? ??????? ?? ???? ?????? ???? ???? ?????? ?? ?? (????? ???? ?? 1) ?? ?????? ??? ?????? ???????? ???? ???? ???????? ?????. ?? ????? ???? ?????. ??? ????? : ???? ?? ??? ?????? ??????? ?? ???? ?????? ???? ???? ?????? ?? ?? (????? ???? ?? 1) ?? ?????? ??? ?????? ???????? ???? ???? ???????? ?????. ?? ????? ???? ?????.

Slide 68:Given a matrix A, an eigenvalue c and the corresponding eigenvector v is defined if Av = cv Hence r is eigenvector of At for eigenvalue “1” If G is strongly connected then r is unique

Computation of PageRank (2) ???? ????? ????? ???? ??????? ????? DB. ??? ?? ????? ?? Hub & Authorities . ????? ????? ? 100,000 ?????? ????? ????? ????? trawling. * HITS ???? ?? ?? : ??? ??? ???? ???? ?”? ?? ? ?Authority ???? ???? ?Hub ??? ?? ?? ???? Hubs ? Authorities ????? ???? ????. * ??? ????????? ?? ????? ? alta vista. ?? ??????? ?? ?HITS ????? ??.???? ????? ????? ???? ??????? ????? DB. ??? ?? ????? ?? Hub & Authorities . ????? ????? ? 100,000 ?????? ????? ????? ????? trawling. * HITS ???? ?? ?? : ??? ??? ???? ???? ?”? ?? ? ?Authority ???? ???? ?Hub ??? ?? ?? ???? Hubs ? Authorities ????? ???? ????. * ??? ????????? ?? ????? ? alta vista. ?? ??????? ?? ?HITS ????? ??.

Slide 69:Computation of PageRank (3)

Simple PageRank can be computed by: ?? ????? ??”? ?????? ??? ??? ???? ? 1000 ?????????? ????? ??”? ?????? ??? ??? ???? ? 1000 ????????

Slide 70:PageRank Example

2 ???? ?? ??? ??? ? 1 ?? 3. 1 ???? ?????? (??? ? : 2,3,5) 5 ???? ?? ? 4. ??? ????? ?????? ??? 1 (???? ???????? ????? ????? ????? ???? ????? 1)2 ???? ?? ??? ??? ? 1 ?? 3. 1 ???? ?????? (??? ? : 2,3,5) 5 ???? ?? ? 4. ??? ????? ?????? ??? 1 (???? ???????? ????? ????? ????? ???? ????? 1)

Slide 71:Practical PageRank: Problem

Web is not a strongly connected graph. It contains: “Rank Sinks”: cluster of pages without outgoing links. Pages outside cluster will be ranked 0. “Rank Leaks”: a page without outgoing links. All pages will be ranked 0. ??? ???? ???? ????? ????????? ???? ???? ?? ????? ? 1 ???? ???? ????? Rank Lick ?????? ??, ?? ??? ?? ????? ???? ???? ?????? ????? ????? ???? ?????? ???? ????? ?0. ???????? ???? ???? ????? ????? ???? ???.??? ???? ???? ????? ????????? ???? ???? ?? ????? ? 1 ???? ???? ????? Rank Lick ?????? ??, ?? ??? ?? ????? ???? ???? ?????? ????? ????? ???? ?????? ???? ????? ?0. ???????? ???? ???? ????? ????? ???? ???.

Slide 72:Practical PageRank: Solution

Remove all Page Leaks Add decay factor d to Simple PageRank Based on “Board Surfer Model” ?? ?????? ????? ?? ?? ????????.???? ????? ??? ??? ?????? ???? ??? ?????? ?????. ??? ?? ???? ?????? ????? ??? ???? ?????? ???? ????? ???? ???? ???? ?????? ????? ??? ???? ?? ??????. ???? ????? ????? ?? d= 1 ??? ???? ???? ?? ??? ?? ????? ????????? ?????? ??????. ?? ?????? ????? ?? ?? ????????.???? ????? ??? ??? ?????? ???? ??? ?????? ?????. ??? ?? ???? ?????? ????? ??? ???? ?????? ???? ????? ???? ???? ???? ?????? ????? ??? ???? ?? ??????. ???? ????? ????? ?? d= 1 ??? ???? ???? ?? ??? ?? ????? ????????? ?????? ??????.

Slide 73:HITS: Hypertext Induced Topic Search

A query dependent technique Produces two scores: Authority: A most likely to be relevant page to a given query Hub: Points to many Authorities Contains two part: Identifying the focused subgraph Link analysis ?????? ?PageRank ???? ?????? ?????? ????? ?????? ???? ????? ????? ?????? ????????? ????? ???? ??????? ??????? ?? IR/ Hub : ?????? ?? ????? ?????, Authority : ????. ??? ?????? ?? ????? ???? ?????. ?? mutually reinforcement ???? ???? ???? ???: ??? ????? ???? authorities ? hubs ??????. ?????? ?PageRank ???? ?????? ?????? ????? ?????? ???? ????? ????? ?????? ????????? ????? ???? ??????? ??????? ?? IR/ Hub : ?????? ?? ????? ?????, Authority : ????. ??? ?????? ?? ????? ???? ?????. ?? mutually reinforcement ???? ???? ???? ???: ??? ????? ???? authorities ? hubs ??????.

Slide 74:HITS: Identifying The Focused Subgraph

Subgraph creation from t-sized page set: (d reduces the influence of extremely popular pages like yahoo.com) ?? ????? ?? d ???? ?????? ?? ???? ???????? ?? ??? ?????? ??? ???? ??????? ??? ???? ?? ????? ??????? ?????? ?? ?????? (?? ?????? ???? ?????? ???) ?????? S ????? ????? ????? ? Authorities & hubs.?? ????? ?? d ???? ?????? ?? ???? ???????? ?? ??? ?????? ??? ???? ??????? ??? ???? ?? ????? ??????? ?????? ?? ?????? (?? ?????? ???? ?????? ???) ?????? S ????? ????? ????? ? Authorities & hubs.

Slide 75:HITS: Link Analysis

Calculates Authorities & Hubs scores (ai & hi) for each page in S

Slide 76:HITS:Link Analysis Computation

Eigenvectors computation can be used by: Where a: Vector of Authorities’ scores h: Vector of Hubs’ scores A: Adjacency matrix in which ai,j = 1 if points to j

Slide 77:Markov chains

A Markov chain consists of n states, plus an n?n transition probability matrix P At each step, we are in exactly one of the states For 1 ? i,j ? n, the matrix entry Pij tells us the probability of j being the next state, given we are currently in state i Pii>0 is OK.

Slide 78:Markov chains

Clearly, for all i, Markov chains are abstractions of random walks Exercise: represent the teleporting random walk from 3 slides ago as a Markov chain, for this case:

Slide 79:Ergodic Markov chains

A Markov chain is ergodic if you have a path from any state to any other you can be in any state at every time step, with non-zero probability Not ergodic (even/ odd).

Slide 80:Ergodic Markov chains

For any ergodic Markov chain, there is a unique long-term visit rate for each state Steady-state distribution Over a long time-period, we visit each state in proportion to this rate It doesn’t matter where we start

Slide 81:Probability vectors

A probability (row) vector x = (x1, … xn) tells us where the walk is at any point E.g., (000…1…000) means we’re in state i i n 1 More generally, the vector x = (x1, … xn) means the walk is in state i with probability xi

Slide 82:Change in probability vector

If the probability vector is x = (x1, … xn) at this step, what is it at the next step? Recall that row i of the transition prob. Matrix P tells us where we go next from state i So from x, our next state is distributed as xP

Slide 83:Computing the visit rate

The steady state looks like a vector of probabilities a = (a1, … an): ai is the probability that we are in state i 1 2 3/4 1/4 3/4 1/4 For this example, a1=1/4 and a2=3/4

Slide 84:How do we compute this vector?

Let a = (a1, … an) denote the row vector of steady-state probabilities If we our current position is described by a, then the next step is distributed as aP But a is the steady state, so a=aP Solving this matrix equation gives us a So a is the (left) eigenvector for P (Corresponds to the “principal” eigenvector of P with the largest eigenvalue)

Slide 85:One way of computing a

Recall, regardless of where we start, we eventually reach the steady state a Start with any distribution (say x=(10…0)) After one step, we’re at xP after two steps at xP2 , then xP3 and so on “Eventually” means for “large” k, xPk = a Algorithm: multiply x by increasing powers of P until the product looks stable

Slide 86:Lempel: Salsa

By applying ergodic theorem, Lempel has proved that: ai is proportional to number of incoming links

Slide 87:Pagerank summary

Preprocessing: Given graph of links, build matrix P From it compute a The entry ai is a number between 0 and 1: the PageRank of page i Query processing: Retrieve pages meeting query Rank them by their PageRank Order is query-independent

Slide 88:The reality

Pagerank is used in Google, but so are many other clever heuristics

  • Login