Page Rank. Overview. Two dimensional arrays Monte Carlo algorithms Searching the world wide web Big data Page rank Goal: we will write a program to compute the relevancy of WWW documents based on the static structure of the WWW. Two Dimensional Arrays.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Page Rank
Goal: we will write a program to compute the relevancy of WWW documents based on the static structure of the WWW.
Goal: we will write a Monte Carlo algorithm to estimate the relevancy of WWW documents based on the static structure of the WWW.
Java exercise: http://cs.fit.edu/~ryan/java/programs/basic_algorithms/ComputePi2.java
Since we know the value of pi it is not really necessary to invent an algorithm to estimate its value.
Percolation. Pour liquid on top of some porous material.Will liquid reach the bottom? Many applications in chemistry, materials science, etc.
Given an N-by-N system where each site is vacant with probability p, what is the probability that system percolates?
Remark. Famous open question in statistical physics. No known mathematical solution. Computational thinking creates new science.
Recourse. Take a computational approach: Monte Carlo simulation.
Uses a recursive, dfs algorithm, but diverges from the present topic. (Recursion is a topic on the AP Computer Science A exam.)
p = 0.3(does not percolate)
p = 0.4(does not percolate)
p = 0.5(does not percolate)
p = 0.6(percolates)
p = 0.7(percolates)
We will examine a Monte Carlo algorithm for estimating the relevancy of WWW documents.
44% of hits and 35% of bandwidth is attributable to bots (and other odd things).
July 2013 (up to 9:30 am 26 Jul 2013) on the WWW server cs.fit.edu
Russian search engine
Bourgeois .../manifesto.txt
Hero …/lilwomen.txt, …/muchado.txt, …/war+peace.txt
His .../manifesto.txt, …/lilwomen.txt, …/mobydick.txt, …/muchado.txt, …/war+peace.txt
Treachery …/war+peace.txt
Whale …/mobydick.txt
Yellowish …/lilwomen.txt , …/war+peace.txt
First some preliminary remarks before doing the exercise.
Problem: Determine if the word is in the stop list. What is the best approach?
Suppose each comparison takes one millisecond (0.001)
OK, we have a keyword index. It is likely we still have “gazillion” documents, for most of the terms. (See Googlewacks, Googlewhackblatt; one and two words search terms that return one document.)
How do we find the most relevant pages?
Consider a popular website which wants to keep track of statistics on the queries used to search the site. One could keep track of the full log of queries, and answer exactly the frequency of any search query at the site. However, the log can quickly become very large. This problem is an instance of the count tracking problem. Even known sophisticated solutions for fast querying such as a tree-structure or hash table to count up the multiple occurrences of the same query, can prove to be slow and wasteful of resources. Notice that in this scenario, we can tolerate a little imprecision. In general, we are interested only in the queries that are asked frequently. So it is acceptable if there is some fuzziness in the counts. Thus, we can tradeoff some precision in the answers for a more efficient and lightweight solution. This tradeoff is at the heart of sketches.
Cormode and Muthurishnon, 2011
5
0 1
1 2 1 2
1 3 1 3 1 4
2 3
3 0
4 0 4 2
5 5
0 1 0 0 0
0 0 2 2 1
0 0 0 1 0
1 0 0 0 0
1 0 1 0 0
7
0 1 0 2 0 3 0 4 0 6
1 0
2 0 2 1
3 1 3 2 3 4
4 0 4 2 4 3 4 5
5 0 5 4
6 4
7 7
0 1 1 1 1 0 1
1 0 0 0 0 0 0
1 1 0 0 0 0 0
0 1 1 0 1 0 0
1 0 1 1 0 1 0
1 0 0 0 1 0 0
0 0 0 0 1 0 0
Can node 2 reach node 4? Yes, using a path of length 2 through node 3.