Download
1 / 60

Page Rank - PowerPoint PPT Presentation


  • 173 Views
  • Uploaded on

Page Rank. Overview. Two dimensional arrays Monte Carlo algorithms Searching the world wide web Big data Page rank Goal: we will write a program to compute the relevancy of WWW documents based on the static structure of the WWW. Two Dimensional Arrays.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Page Rank' - glenda


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Overview
Overview

  • Two dimensional arrays

  • Monte Carlo algorithms

  • Searching the world wide web

  • Big data

  • Page rank

    Goal: we will write a program to compute the relevancy of WWW documents based on the static structure of the WWW.


Two dimensional arrays
Two Dimensional Arrays

  • Significance (a topic on the AP Computer Science A exam)

  • Syntax

  • Example of matrix multiplication

  • Arrays of arrays


Significance of two dimensional arrays
Significance of Two Dimensional Arrays

  • Tables; for instance, assignments for each student in a class, quarterly sales for each item in inventory, etc.

  • Matrices and binary relations in mathematics. For example, is there a direct road from city1 in USA to city2 in USA?

  • For our goal in the this section, we will have need for the number of links from doc1 in the WWW to doc2 in the WWW.


Syntax
Syntax

  • int[][] frequency = new int [26][26];

  • Elements are accessed: frequency[4][7] and not frequency[4,7]

  • Array indices in Java (like C, C++, C#) always begin with 0; in other words, the element with index 1 is the second element of the array.



Matrix multiplication exercise
Matrix Multiplication Exercise

  • http://cs.fit.edu/~ryan/java/programs/basic_algorithms/MatrixMultiplication2.java


Arrays of arrays
Arrays of Arrays

  • Logically: arrays of arrays in the tradition of C and C++. Very simple.

  • Unfortunately: introduces pointers, memory allocation, etc. Very complicated.


Monte carlo methods
Monte Carlo Methods

  • Introduction

  • The example of a Monte Carlo estimate for Pi (Java exercise). Fair shuffling (Java exercise). Random walk (important in financial analysis)

  • Used in path tracing to create realistic images

  • Percolation – an example of the power of a Monte Carlo algorithm

    Goal: we will write a Monte Carlo algorithm to estimate the relevancy of WWW documents based on the static structure of the WWW.


Monte carlo casino
Monte Carlo Casino

  • The name refers to the grand casino in the Principality of Monaco at Monte Carlo, which is well-known around the world as an icon of gambling.


Monte carlo estimate for pi
Monte Carlo estimate for Pi

Java exercise: http://cs.fit.edu/~ryan/java/programs/basic_algorithms/ComputePi2.java

Since we know the value of pi it is not really necessary to invent an algorithm to estimate its value.


Fair shuffling java exercise
Fair shuffling (Java exercise)

  • How would you test a algorithm for shuffling, say, cards? In particular how would you know if all of the many possible results are equally likely?

  • Main program http://cs.fit.edu/~ryan/java/programs/basic_algorithms/Experiment.java. Nothing to write; requires the method to shuffle.

  • http://cs.fit.edu/~ryan/java/programs/basic_algorithms/Shuffle.java contains two methods of shuffling cards.

  • Run the experiment with multiple trials and convince yourself both methods are fair


Percolation theory
Percolation Theory

Percolation. Pour liquid on top of some porous material.Will liquid reach the bottom? Many applications in chemistry, materials science, etc.

  • Spread of forest fires.

  • Natural gas through semi-porous rock.

  • Flow of electricity through network of resistors.

  • Permeation of gas in coal mine through a gas mask filter.


Percolation theory1
Percolation Theory

Given an N-by-N system where each site is vacant with probability p, what is the probability that system percolates?

Remark. Famous open question in statistical physics. No known mathematical solution. Computational thinking creates new science.

Recourse. Take a computational approach: Monte Carlo simulation.

Uses a recursive, dfs algorithm, but diverges from the present topic. (Recursion is a topic on the AP Computer Science A exam.)

p = 0.3(does not percolate)

p = 0.4(does not percolate)

p = 0.5(does not percolate)

p = 0.6(percolates)

p = 0.7(percolates)



Random walk
Random Walk relevancy of WWW documents.

  • Page rank can be computed a lot like random walk

  • See the Java applet (1 dim) at http://www.math.uah.edu/stat/applets/RandomWalkExperiment.html

  • See the Java applet (2 dim) at http://vlab.infotech.monash.edu.au/simulations/swarms/random-walk/


Searching the world wide web
Searching the World Wide Web relevancy of WWW documents.

  • History of Search Engines

  • Hypertext

  • Crawling the World Wide Web

  • Indexing


History of search engines
History of Search Engines relevancy of WWW documents.

  • History of Search by Larry Kim of WordStream


Markup and hypertext
Markup and Hypertext relevancy of WWW documents.

  • Documents served up through the WWW are generally “marked up” for presentation in a structured, standard called hypertext markup language (HTML).

  • The most important feature of HTML is the referencing (via URLs) of other WWW documents which enables easy, non-sequential, and varied paths of reading the documents.


Hypertext
Hypertext relevancy of WWW documents.


Www spiders
WWW Spiders relevancy of WWW documents.

  • Google, and others, continually, crawl around the WWW recording what they see to enable searching.


44% of hits and 35% of bandwidth is attributable to bots (and other odd things).

July 2013 (up to 9:30 am 26 Jul 2013) on the WWW server cs.fit.edu

Russian search engine


Indexing
Indexing (and other odd things).

  • Finding a relevant document in a vast ocean of linked HTML documents requires a very large index.

  • An index is a (sorted) list of keywords (terms) and the list of values (URLs) which contain them.


An example index of www documents
An example index of WWW documents (and other odd things).

Bourgeois .../manifesto.txt

Hero …/lilwomen.txt, …/muchado.txt, …/war+peace.txt

His .../manifesto.txt, …/lilwomen.txt, …/mobydick.txt, …/muchado.txt, …/war+peace.txt

Treachery …/war+peace.txt

Whale …/mobydick.txt

Yellowish …/lilwomen.txt , …/war+peace.txt


Several issues
Several Issues (and other odd things).

  • Pick out the words from the mark-up

  • What’s a word? 2nd, abc’s, CSTA

  • Normalize: lowercase, stemming

  • Some words are not worth indexing

    • “the”, “a”, etc.

    • A so-called stop list, eg., words ignored in Wikipedia search

    • Java exercise: http://cs.fit.edu/~ryan/java/programs/xml/URLtoText.java

      First some preliminary remarks before doing the exercise.


Searching and sorting
Searching and Sorting (and other odd things).

Problem: Determine if the word is in the stop list. What is the best approach?

  • Searching: linear search, binary search. (These are topics on the AP Computer Science A exam.) Binary search requires the data (the index, for example) to be sorted.

  • Sorting: selection sort, insertion sort, merge sort, quick sort; external sorting. (The first three of these sorts are topics on the AP Computer Science A exam.)


Linear versus binary search
Linear versus Binary search (and other odd things).

Suppose each comparison takes one millisecond (0.001)


Linear versus binary search1
Linear versus Binary Search (and other odd things).


Linear versus binary search2
Linear versus Binary Search (and other odd things).


Obama at google
Obama at Google (and other odd things).

  • https://www.youtube.com/watch?v=k4RRi_ntQc8


Sorting demo
Sorting Demo (and other odd things).

  • http://cs.fit.edu/~ryan/cse1002/sort.html

  • See also sorting illustrated by Algo-rythmicshttp://algo-rythmics.ms.sapientia.ro and folk dancers


Now do the exercise
Now do the exercise (and other odd things).

  • Java exercise: http://cs.fit.edu/~ryan/java/programs/xml/URLtoText.java

  • PS. How to students really program?

    • http://xkcd.com/1185 Observe the tool tip!


OK, we have a keyword index. It is likely we still have “gazillion” documents, for most of the terms. (See Googlewacks, Googlewhackblatt; one and two words search terms that return one document.)

How do we find the most relevant pages?


Big data
Big Data “gazillion” documents, for most of the terms. (See

  • The problem

  • Count-Min Algorithm


The problem with big data
The problem with Big Data “gazillion” documents, for most of the terms. (See

Consider a popular website which wants to keep track of statistics on the queries used to search the site. One could keep track of the full log of queries, and answer exactly the frequency of any search query at the site. However, the log can quickly become very large. This problem is an instance of the count tracking problem. Even known sophisticated solutions for fast querying such as a tree-structure or hash table to count up the multiple occurrences of the same query, can prove to be slow and wasteful of resources. Notice that in this scenario, we can tolerate a little imprecision. In general, we are interested only in the queries that are asked frequently. So it is acceptable if there is some fuzziness in the counts. Thus, we can tradeoff some precision in the answers for a more efficient and lightweight solution. This tradeoff is at the heart of sketches.

Cormode and Muthurishnon, 2011


Page rank1
Page Rank “gazillion” documents, for most of the terms. (See

  • Not finding pages, but ordering the found pages

  • Makes a big difference in the user’s experience, if “good” or “relevant” pages come first.

  • The upcoming algorithm gave Google a competitive advantage

  • How would you rank pages?

  • The approach/algorithm called “page rank” is not based on the WWW surfer as voter (popularity), but on the WWW author as voter (hence relatively static)

  • Conceptually in the Page Rank algorithm a random surfer mindlessly follows the hyperlinks of the entire WWW


Input output
Input/Output “gazillion” documents, for most of the terms. (See

  • What is the input? The entire WWW modeled as a graph.

  • What is the output? The ranking of every page in the WWW.

  • By assigning one number to every page, then the search query will order the found pages by the rankings in order to present to the user the most relevant pages first.


S w tiny hypertext
S&W Tiny Hypertext “gazillion” documents, for most of the terms. (See


S w tiny graph
S&W Tiny Graph “gazillion” documents, for most of the terms. (See


S w tiny adj list adj matrix
S&W Tiny: “gazillion” documents, for most of the terms. (See Adj list & Adj matrix

5

0 1

1 2 1 2

1 3 1 3 1 4

2 3

3 0

4 0 4 2

5 5

0 1 0 0 0

0 0 2 2 1

0 0 0 1 0

1 0 0 0 0

1 0 1 0 0


Wiki2 hypertext
Wiki2 Hypertext “gazillion” documents, for most of the terms. (See


Wiki2 graph
Wiki2 Graph “gazillion” documents, for most of the terms. (See


Wiki2 adj list adj matrix
Wiki2: “gazillion” documents, for most of the terms. (See Adj List & Adj Matrix

7

0 1 0 2 0 3 0 4 0 6

1 0

2 0 2 1

3 1 3 2 3 4

4 0 4 2 4 3 4 5

5 0 5 4

6 4

7 7

0 1 1 1 1 0 1

1 0 0 0 0 0 0

1 1 0 0 0 0 0

0 1 1 0 1 0 0

1 0 1 1 0 1 0

1 0 0 0 1 0 0

0 0 0 0 1 0 0


Wiki1 hypertext
Wiki1 Hypertext “gazillion” documents, for most of the terms. (See


Wiki1 graph
Wiki1 Graph “gazillion” documents, for most of the terms. (See


Java exercise
Java Exercise “gazillion” documents, for most of the terms. (See

  • Modify Adajency1.java

    • Print adjacency matrix

    • Print probability matrix

    • Print probability matrix with 90-10 rule


Interactive www page for pagerank
Interactive WWW “gazillion” documents, for most of the terms. (See Page for PageRank

  • http://williamcotton.com/pagerank-explained-with-javascript


Reachability markov theory
Reachability, Markov Theory “gazillion” documents, for most of the terms. (See

Can node 2 reach node 4? Yes, using a path of length 2 through node 3.


Final challenge
Final Challenge “gazillion” documents, for most of the terms. (See

  • Raise the page rank of page “23” by modifying only the links on page “23”

  • Decrease the page rank of page “23” by modifying only the links on page “23”

  • Can you find the maximum/minimum page rank?



Ted talks brin page the genesis of google
Ted Talks: Google bombsBrin & Page: The Genesis of Google

  • http://www.ted.com/talks/sergey_brin_and_larry_page_on_google.html


ad