Search engines
This presentation is the property of its rightful owner.
Sponsored Links
1 / 52

Search Engines PowerPoint PPT Presentation


  • 219 Views
  • Uploaded on
  • Presentation posted in: General

Search Engines. Indexing Page Ranking. The W W W. WebSite1. Page 1. Page 1. WebSite2. Page 2. Page 3. Page 3. Page 2. Page 5. Page 1. WebSite5. Page 4. Page 1. Page 6. Page 1. Page 2. WebSite4. WebSite3. The Web Search Problem. Query: set of key words or phrase. Search

Download Presentation

Search Engines

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Search engines

Search Engines

Indexing

Page Ranking


The w w w

The W W W

WebSite1

Page 1

Page 1

WebSite2

Page 2

Page 3

Page 3

Page 2

Page 5

Page 1

WebSite5

Page 4

Page 1

Page 6

Page 1

Page 2

WebSite4

WebSite3


The web search problem

The Web Search Problem

Query: set of key words or phrase

Search

Engine

  • Response: list of documents (pages) containing

  • the key words or phrase

  • Important requirements:

  • Response must be quick

  • Documents must be relevant


Tasks of a search engine

Discover documents around the WWW

Search keywords in documents

Filter/rank documents according to their relevance

Tasks of a Search Engine

WebCrawlers (spiders, bots, wanderers, etc)

Based on graph searching algorithms (BFS or DFS ?)

For obvious performance reasons, this cannot be done by string searching after every query !

Solution: Indexing;Web Search Engine Architectures


Web search engine architecture

Web Search Engine Architecture

WebCrawler

Page

Repository

Query

Text Analysis

Link Analysis

Text Index

PageRank

Ranker


Outline

Outline

  • Data structures and algorithms for indexing the web

  • The PageRank algorithm


Outline1

Outline

  • Data structures and algorithms for indexing the web

  • The PageRank algorithm


Indexing the web

Indexing the web

  • Once a crawl has collected pages, their text is compressed and stored in a repository

  • Each URL mapped to a unique ID

  • A lexicon (sorted list of all words) is created

  • A hit list (“Inverted index”) is created for every word in the lexicon

  • Terminology:

    • Forward index: Document -> list of contained words

    • Inverted index: Word -> list of containing documents


Simple inverted indexing words pageids

Simple Inverted IndexingWords -> PageIDs


Using simple inverted indexes for queries

Using Simple Inverted Indexes for Queries

  • Simple indexes help searching for keywords or sets of keywords

    • Example:

      • Search “cat” => found in pages 1 and 3

      • Search “cat” AND “dog” => found in page 3

  • Simple indexes cannot help performing phrase queries:

    • Example:

      • Search “cat sat” => found in pages 1 and 3, but actually only page 1 contains the phrase “cat sat”

    • Solution: indexing contains also the in-page location


Fully inverted indexing words pageid s in page locations

Fully Inverted Indexing Words -> PageID’s + in-page locations


Using fully inverted indexing for queries

Using Fully Inverted Indexing for Queries

  • Performing queries for phrases:

    • Search “cat sat”

      • “cat” found at 1-2, 3-2

      • “sat” found at 1-3, 3-7

      • “cat” AND “sat”:

        • in page 1, at 1-2 AND 1-3 => distance 1 between words

        • in page 3, at 3-2 AND 3-7 => distance 5 between words

        • Using the distance between words, only page 1 matches the search phrase


Using metainformation

Using Metainformation

If the searched word is part of a title, the document is probably more relevant for the query


Indexing the web1

Indexing the web

  • Once a crawl has collected pages, their text is compressed and stored in a repository

  • Each URL (document) mapped to a unique ID

  • A lexicon (sorted list of all words) is created

  • A hit list (“Inverted index”) is created for every word in the lexicon

    • Occurrences of a word in a particular document, including position, font, capitalization, metainformation (part of titles)


Google s indexing step 1

Google’s Indexing – Step 1

  • Each document is parsed an transformed into a collection of “hit lists” that are put into “barrels”, sorted by docID.

  • Hit: <wordID, position in doc, font info, hit type>

    • Hit type: Plain or fancy.

    • Fancy hit: Occurs in URL, title, anchor text, metatag.


Google s forward barrels

Google’s Forward Barrels

Forward Barrels

Docid

Wordid

#hits

Hit, hit

Wordid

#hits

Hit, hit, hit, hit, hit

Wordid

#hits

Hit

Wordid

#hits

Hit, hit

Docid

Wordid

#hits

Hit

Wordid

#hits

Hit, hit, hit

Wordid

#hits

Hit, hit

Barrel i

Docid

Wordid

#hits

Hit, hit, hit

Barrel i+1

Wordid

#hits

Hit, hit


Google s indexing step 2

Google’s Indexing – Step 2

  • Each barrel is then sorted by wordID to create the inverted index. This sorting also creates the lexicon file.

    • Lexicon: <wordID, offset into inverted index>

    • Lexicon is mostly cached in-memory


Google s inverted index

wordid

wordid

wordid

#docs

#docs

#docs

Docid

#hits

Hit, hit, hit, hit, hit

Docid

#hits

Hit

Docid

#hits

Hit, hit

Docid

#hits

Hit

Docid

#hits

Hit, hit, hit

Google’s Inverted Index

Lexicon (in-memory)

Postings (“Inverted Barrels”, on disk)

Barrel i

Sorted by wordid

Barrel i+1


Outline2

Outline

  • Data structures and algorithms for indexing the web

  • The PageRank algorithm


Motivation

Motivation

  • Efficient matching: Indexing helps finding pages that contain the search phrase, giving priority to the pages that contain it in titles or other privileged positions. Still there can be a huge number of such matches !

  • Also needed for an effective search: a measure of importance of the pages that matched the search criteria

  • Problem: Assessing the importance of web pages without human evaluation of the content

    • First solution: the PageRank algorithm


Pagerank history

PageRank History

  • History:

    • Proposed by 2 PhD students, Sergey Brin and Lawrence Page in 1998 at Stanford.

    • “The Anatomy of a Large-Scale Hypertextual Web Search Engine”.

    • “The PageRank citation ranking: Bringing order to the web ”, http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

    • Algorithm of the first generation of Google Search Engine.


Pagerank principles

PageRank Principles

  • Measure the importance of Web page based on the link structure alone.

  • The importance of a page is given by the number of pages linking to it (number of “votes” received) as well as their importance(the importance of the voters)

  • If a page contains links to a number of l pages, its contribution to the importance of each page is a fraction1/l of its own importance (it “splits” its votes)


Pagerank principles example

PageRank Principles - Example

Importance(P1)=100

Outdegree(P1)=2

Importance(P3)=53

P3

P1

50

50

3

Importance(P2)=9

Outdegree(P2)=3

Importance(P4)=3

P4

P2

3

3


Issues with computing pagerank

Issues with Computing PageRank

  • The simplified PageRank computation principles presented before cannot be directly applied:

    • Pages without inlinks: which should be their PR value? (it cannot be zero, otherwise nothing gets propagated)

    • Cycles in page graphs: we cannot go forever round the cycle, always increasing the scores

  • The solution to this problem can be formulated from one of the possible viewpoints on PageRank:

    • Algebraic point of view

    • Probabilistic point of view


Pagerank the probabilistic point of view

PageRank – The Probabilistic Point of View

  • The Random Surfer Model

  • Since the importance of a web page P is measured by its popularity (how many incoming links it has) we can view the importance of the page as the probability that a random surfer that starts browsing the net at any page arrives at the page P following hyperlinks.

  • If the random surfer is at a page having k outlinks, he has 1/k probability to go next to any of the k pages


The random surfer model

The Random Surfer Model

  • Initial data:

  • The page graph contains N pages Pi, i=1..N

  • We denote by Bi the set of all pages Pjthat have links to Pi

  • We denote by lj the outdegree of page Pj (the number of its outgoing links)

  • Initially, each page Pihas 1/N probability to be choosen as a start page. This is the initial probability (at moment 0) of the page to be reached, PR(i, 0)


The random surfer model1

The Random Surfer Model

  • Updating probabilities:

  • At a moment t, each page Pi has a probability PR(i, t)

  • At next moment t’, the probability of page Pi is PR(i, t’) and it is the weighted sum of the probabilities of its incoming pages, weighted by their outdegrees:


The random surfer model2

The Random Surfer Model

  • Updating probabilities:

PR(i, t’)

PR(j, t)

Outdegree(Pj)=lj

Pi

Pj

PR(j, t)/lj


The random surfer model3

The Random Surfer Model

  • Convergence:

  • The values PR(i, t), when t→∞, converge to PR(i)

  • The fact that PR converges to a unique probabilistic vector (the stationary distribution) can be mathematically proved (see: stochastic matrices, eigenvectors, the power method for finding eigenvector)


Example

N=4

l1=3, l2=2, l3=1, l4=2

Initially (t=0):

PR(1,0)=1/4

PR(2,0)=1/4

PR(3,0)=1/4

PR(4,0)=1/4

Example

PR(1,0)=1/4

PR(3,0)=1/4

1

P1

P3

1/3

1/3

1/2

1/3

1/2

1/2

P2

P4

1/2

PR(2,0)=1/4

PR(4,0)=1/4


Example cont

t=1;

PR(1,1)=1*PR(3,0)+1/2*PR(4,0) = 1 * 0.25 + 1/2 * 0.25 = 0.37

PR(2,1)=1/3*PR(1,0)= 1/3 * 0.25 = 0.08

PR(3,1)=1/3*PR(1,0)+1/2*PR(2,0)+1/2*PR(4,0) = 1/3 * 0.25 + 1/2 * 0.25 + 1/2 * 0.25 = 0.33

PR(4,1)=1/3*PR(1,0)+1/2*PR(2,0)= 1/3 * 0.25 + 1/2 * 0.25 = 0.20

Example (cont)

PR(1,0)=0.25

PR(3,0)=0.25

1

P1

P3

1/3

1/3

1/2

1/3

1/2

1/2

P2

P4

1/2

PR(2,0)=0.25

PR(4,0)=0.25


Example cont1

t=2;

PR(1,2)=1*PR(3,1)+1/2*PR(4,1) = 1 * 0.33 + 1/2 * 0.20 = 0.43

PR(2,2)=1/3*PR(1,1)= 1/3 * 0.37 = 0.12

PR(3,2)=1/3*PR(1,1)+1/2*PR(2,1)+1/2*PR(4,1) = 1/3 * 0.37 + 1/2 * 0.08 + 1/2 * 0.20 = 0.27

PR(4,2)=1/3*PR(1,1)+1/2*PR(2,1)= 1/3 * 0.37 + 1/2 * 0.08 = 0.16

Example (cont)

PR(1,1)=0.37

PR(3,1)=0.33

1

P1

P3

1/3

1/3

1/2

1/3

1/2

1/2

P2

P4

1/2

PR(2,1)=0.08

PR(4,1)=0.20


Example cont2

t=3;

PR(1,3)=1*PR(3,2)+1/2*PR(4,2) = 1 * 0.27 + 1/2 * 0.16 = 0.35

PR(2,3)=1/3*PR(1,2)= 1/3 * 0.43 = 0.14

PR(3,3)=1/3*PR(1,2)+1/2*PR(2,2)+1/2*PR(4,2) = 1/3 * 0.43 + 1/2 * 0.12 + 1/2 * 0.16 = 0.29

PR(4,3)=1/3*PR(1,2)+1/2*PR(2,2)= 1/3 * 0.43 + 1/2 * 0.12 = 0.20

Example (cont)

PR(1,2)=0.43

PR(3,2)=0.27

1

P1

P3

1/3

1/3

1/2

1/3

1/2

1/2

P2

P4

1/2

PR(2,2)=0.12

PR(4,2)=0.16


Example cont3

The values of PR calculated until now:

t=0: [0.25, 0.25, 0.25, 0.25]

t=1: [0.37, 0.08, 0.33, 0.20]

t=2: [0.43, 0.12, 0.27, 0.16]

t=3: [0.35, 0.14, 0.29, 0.20]

We can continue the iterations, and get:

t=4: [0.39, 0.11, 0.29, 0.19]

t=5: [0.39, 0.13, 0.28, 0.19]

t=6: [0.38, 0.13, 0.29, 0.19]

t=7: [0.38, 0.12, 0.29, 0.19]

t=8: [0.38, 0.12, 0.29, 0.19]

PR(1)=0.38

PR(2)=0.12

PR(3)=0.29

PR(4)=0.19

Example (cont)


Dangling nodes and disconnected components

Dangling Nodes and Disconnected Components

  • Problems with the initial Random Surfer Model:

    • If the random web surfer arrives at a page Pj that has no outlinks (a dangling node), he has nowhere to go. The accumulated importance of Pj “gets lost”, since it is not transferred further to any other pages

    • If the web is formed by several connected components, the random web surfer will never reach pages that are in a different connected component than the initial random node


Example the dangling node problem

N=3

l1=2, l2=2, l3=0

Initially (t=0):

PR(1,0)=1/3

PR(2,0)=1/3

PR(3,0)=1/3

Update rules:

PR(1,t’)=1/2 *PR(2,t)

PR(2,t’)=1/2*PR(1,t)

PR(3,t’)=1/2*PR(1,t)+1/2*PR(2,t)

Example – The Dangling Node Problem

P1

1/2

1/2

1/2

P3

1/2

P2


Example the dangling node problem cont

Applying the update rules we get:

t=0: [1/3, 1/3. 1/3]

t=1: [1/6, 1/6, 1/3]

t=2: [1/12, 1/12, 1/6]

t=3: [1/24, 1/24, 1/12]

….

Result: PR(1)=PR(2)=PR(3)=0 !

Example – The Dangling Node Problem (cont)

P1

1/2

1/2

1/2

P3

1/2

P2

This result has no meaning as a ranking ->

a solution must be found for dangling nodes


Solution for dangling nodes and disconnected components

Solution for dangling nodes and disconnected components

  • The PageRank Random Surfer model is updated as follows:

    • Most of the time (a percentage d) a surfer will follow links from a page, as in the model before. If a page has no outlinks, he will continue after it with a random page (a page with no outlinks will be considered to have N outlinks to any other page).

    • A smaller, but positive percentage of time (the rest of the percentage 1-d) the surfer will dump the current page and choose arbitrarily a different page from the web and “teleport” there


Computing pagerank

The probability of

reaching a page Pi

The probability of

arriving from a page Pj that has no outlinks

The probability of

arriving from a page Pj that has a link to Pi

The probability of

arriving through teleporting

at a random time

Computing PageRank

d=dumping factor, heuristic


The dumping factor

The dumping factor

  • Dumping factor (d) can have values in [0,1]

  • If d=0: all the web surfer moves are random jumps (teleports), no links are followed

  • If d=1: the web surfer makes no teleports, he only follows links, except for the case of dangling nodes

  • The value of d also influences how fast the vector converges to the stationary distribution (the number of needed iterations)

  • Usual value (proposed by Brin and Page): d=0.85

  • Convergence is reached in less than 100 iterations


Search engines

public Map<Vertex, Double> computePageRank(Digraph<Vertex> g) {

double d=0.85; int iterations=100;

int N=g.getNumberOfNodes();

List<Vertex> nodes= g.getAllNodes();

List<Vertex> nodesWithoutOutlinks = g.getNodesWithoutOutlinks();

Map<Vertex, Double> opr = new HashMap<Vertex, Double>(); // old pageranks

Map<Vertex, Double> npr = new HashMap<Vertex, Double>(); // new pageranks

for (Vertex n:nodes) npr.put(n, 1.0/N); // init pageranks with 1/N

for (Vertex n:nodes) opr.put(n, 1.0/N);

while (iterations>0) {

double dp=0;

for (Vertex p:nodesWithoutOutlinks)

dp=dp+opr.get(p)/N;

for (Vertex p:nodes) {

double nprp;

nprp=dp+(1-d)/N;

for (Vertex ip: g.inboundNeighbors(p))

nprp=nprp+d*opr.get(ip)/g.outDegree(ip);

npr.put(p,nprp);

}

Map<Vertex, Double> temp;

temp=opr; opr=npr; npr=temp;

iterations=iterations-1;

}

return npr;

}


Pagerank the algebraic point of view

PageRank – the Algebraic Point of View

  • Initial data:

  • The page graph contains N pages Pi, i=1..N

  • We denote by Bi the set of all pages Pjthat have links to Pi

  • We denote by lj the outdegree of page Pj (the number of its outgoing links)

  • The Hyperlink matrix A: a square matrix with the rows and column corresponding to web pages, where A[i,j] = 1/lj if there is a link from j to i and A[i,j] = 0 if not.


Example the hyperlink matrix

Example –The Hyperlink Matrix

1

2

3

1

4

P1

P3

1

1/3

2

1/3

1/2

1/3

1/2

1/2

3

P2

P4

1/2

4


Properties of the hyperlink matrix

Properties of the Hyperlink Matrix

  • Properties of the Hyperlink Matrix

    • All entries are nonnegative

    • The sum of the entries in a column j is 1, if j has outgoing links .

    • All elements of a column j are 0 if j has no outgoing links (j is a dangling node)

  • If the web has no dangling nodes, the Hyperlink matrix is stochastic


Stochastic matrices

Stochastic Matrices

  • A column stochastic matrix (probability matrix, Markov matrix) is a square matrix of nonnegative real numbers, with each column summing to 1.


Stochastic matrices1

Stochastic Matrices

  • The Perron-Frobenius Theorem: Every positive column stochastic matrix A has a unique stationary column vector X (an eigenvector with eigenvalue 1): A*X=X

  • The Power Method Convergence Theorem: Let A be a positive column stochastic matrix of size n*n and X its stationary column vector. Then X can be calculated by following procedure: Initialize the column vector Z with all entries equal to 1/n. Then the sequence Z, A*Z, A2*Z ….,Ak*Z converges to the vector X.


The google matrix

The Google Matrix

  • A= Transition matrix

  • S= a matrix obtained from A, by setting the elements of the columns where all elements of the column are 0, to 1/N

  • G= the Google matrix:

  • G[i,j]=d*S[i,j]+(1-d)/N

  • Property: the Google matrix is a stochastic matrix

  • The stationary vector of G contains the PageRank values


Pagerank and the history of search engines

PageRank and the History of Search Engines

  • PageRank (1998) was the first algorithm to introduce the concept of “importance of a webpage” and calculate it without relying on external information

    • crucial factor in Google ascension

  • Drawbacks:

    • PageRank can be manipulated

    • SEO (“Search Engine Optimisation”)


Pagerank and the future of search engines

PageRank and the Future of Search Engines

  • 2011: Google Panda:

    • introduce filters that prevent low quality sites and/or pages from ranking well in the search results, identifying

    • use human feedback and machine learning algorithms

  • 2012: Google Penguin:

    • decrease ranking of sites identified as using “black-hat SEO techniques” 

  • 2013: Google Hummingbird

    • Judge the context of a query - thereby judging the intent of a person carrying out a search, to determine what they are trying to find out


Other uses of pagerank

Other Uses of PageRank

  • Ranking scientific articles according to their citations

  • Ranking streets for predicting human movement and street congestion

  • Automatic summarization – extracting the most relevant sentences from a text


Tool project 3

Tool Project #3

  • Optional: Automatic Summarization Tool, based on PageRank

    • Text is represented as a graph of sentences

    • Edges are given by the “similarity” of two sentences (what can be used as a form of “recommendation” or “vote” between sentences ?)

    • Apply PageRank (or a modified version, able to cope with undirected, maybe weighted graphs) and take the top x% sentences to form the abstract

      http://bigfoot.cs.upt.ro/~ioana/algo/lab_pagerank.html


Bibliography

Bibliography

  • John Mac Cormick: Nine Algorithms that changed the Future, Chapters 2 & 3

  • Page, L., Brin, S., Motwani, R., Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

  • David Austing, How Google Finds Your Needle in the Web's Haystack, AMS Feature Column http://www.ams.org/samplings/feature-column/fcarc-pagerank


  • Login