340 likes | 455 Views
ICS 215: Advances in Database Management System Technology Spring 2004. Professor Chen Li Information and Computer Science University of California, Irvine. Course Web Server. URL: http://www.ics.uci.edu/~ics215/ All course info will be posted online Instructor: Chen Li
E N D
ICS 215: Advances in Database Management System Technology Spring 2004 Professor Chen Li Information and Computer Science University of California, Irvine
Course Web Server • URL: http://www.ics.uci.edu/~ics215/ • All course info will be posted online • Instructor: Chen Li • ICS 424B, chenli@ics.uci.edu • Course general info: http://www.ics.uci.edu/~ics215/geninfo.html Notes 01
Topic today: Web Search • How did earlier search engines work? • How does Google work? • Readings: • Lawrence and Giles, Searching the World Wide Web, Science, 1998. • Brin and Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine WWW7/Computer Networks 30(1-7): 107-117, 1998. Notes 01
Earlier Search Engines • Hotbot, Yahoo, Alta Vista, Northern Light, Excite, Infoseek, Lycos … • Main technique: “inverted index” • Conceptually: use a matrix to represent how many times a term appears in one page • # of columns = # of pages (huge!) • # of rows = # of terms (also huge!) Page1 Page2 Page3 Page4 … ‘car’ 1 0 1 0 ‘toyota’ 0 2 0 1 page 2 mentions ‘toyota’ twice ‘honda’ 2 1 0 0 … Notes 01
Search by Keywords • If the query has one keyword, just return all the pages that have the word • E.g., “toyota” all pages containing “toyota”: page2, page4,… • There could be many many pages! • Solution: return those pages with most frequencies of the word first Notes 01
Multi-keyword Search • For each keyword W, find all the set of pages mentioning W • Intersect all the sets of pages • Assuming an “AND” operation of those keywords • Example: • A search “toyota honda” will return all the pages that mention both “toyota” and “honda” Notes 01
Observations • The “matrix” can be huge: • Now the Web has 4.2 billion pages! • There are many “terms” on the Web. Many of them are typos. • It’s not easy to do the computation efficiently: • Given a word, find all the pages… • Intersect many sets of pages… • For these reasons, search engines never store this “matrix” so naively. Notes 01
Problems • Spamming: • People want their pages to be put very top on a word search (e.g., “toyota”) by repeating the word many many times • Though these pages may be unimportant compared to www.toyota.com, even if the latter only mentions “toyota” only once (or 0 time). • Search engines can be easily “fooled” Notes 01
Closer look at the problems • Lacking the concept of “importance” of each page on each topic • E.g.: Our ICS215 class page is not as “important” as Yahoo’s main page. • A link from Yahoo is more important than a link from our class page • But, how to capture the importance of a page? • A guess: # of hits? where to get that info? • # of inlinks to a page Google’s main idea. Notes 01
Google’s History • Started at Stanford DB Group as a research project (Brin and Page) • Used to be at: google.stanford.edu • Very soon many people started liking it • Incorporated in 1998: www.google.com • The “largest” search engine now • Started other businesses: froogle, gmail, … Notes 01
PageRank • Intuition: • The importance of each page should be decided by what other pages “say” about this page • One naïve implementation: count the # of pages pointing to each page (i.e., # of inlinks) • Problem: • We can easily fool this technique by generating many dummy pages that point to our class page Notes 01
Details of PageRank • At the beginning, each page has weight 1 • In each iteration, each page propagates its current weight W to all its N forward neighbors. Each of them gets weight: W/N • Meanwhile, a page accumulates the weights from its backward neighbors • Iterate until all weights converge. Usually 6-7 times are good enough. • The final weight of each page is its importance. • NOTICE: currently Google is using many other techniques/heuristics to do search. Here we just cover some of the initial ideas. Notes 01
Example: MiniWeb • (Materials used by courtesy of Jeff Ullman) • Our “MiniWeb” has only three web sites: Netscape, Amazon, and Microsoft. • Their weights are represented as a vector Ne MS Am For instance, in each iteration, half of the weight of AM goes to NE, and half goes to MS. Notes 01
Iterative computation Final result: • Netscape and Amazon have the same importance, and twice the importance of Microsoft. • Does it capture the intuition? Yes. Ne MS Am Notes 01
Observations • We cannot get absolute weights: • We can only know (and we are only interested in) those relative weights of the pages • The matrix is stochastic (sum of each column is 1). So the iterations converge, and compute the principal eigenvector of the following matrix equation: Notes 01
Problem 1 of algorithm: dead ends • MS does not point to anybody • Result: weights of the Web “leak out” Ne MS Am Notes 01
Problem 2 of algorithm: spider traps • MS only points to itself • Result: all weights go to MS! Ne MS Am Notes 01
Google’s solution: “tax each page” • Like people paying taxes, each page pays some weight into a public pool, which will be distributed to all pages. • Example: assume 20% tax rate in the “spider trap” example. Notes 01
The War of Search Engines • More companies are realizing the importance of search engines • More competitors in the market: Microsoft, Yahoo!, etc. Notes 01
Next: HITS / Web communities • Readings: • Jon M. Kleinberg, Authoritative Sources in a Hyperlinked Environment, Journal of ACM 46(5): 604-632, 1999. • Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins, Trawling the Web for emerging cyber-communities, WWW 1999 Notes 01
Hubs and Authorities • Motivation: find web pages to a topic • E.g.: “find all web sites about automobiles” • “Authority”: a page that offers info about a topic • E.g.: DBLP is a page about papers • E.g.: google.com, aj.com, teoma.com, lycos.com • “Hub”: a page that doesn’t provide much info, but tell us where to find pages about a topic • E.g.: our ICS215 page linking to pages about papers • E.g.: www.searchenginewatch.com is a hub of search engines Notes 01
Two values of a page • Each page has a hub value and an authority value. • In PageRank, each page has one value: “weight” • Two vectors: • H: hub values • A: authority values Notes 01
HITS algorithm: find hubs and authorities • First step: find pages related to the topic (e.g., “automobile”), and construct the corresponding “focused subgraph” • Find pages S containing the keyword (“automobile”) • Find all pages these S pages point to, i.e., their forward neighbors. • Find all pages that point to S pages, i.e., their backward neighbors • Compute the subgraph of these pages root Focused subgraph Notes 01
Step 2: computing H and A • Initially: set hub and authority to 1 • In each iteration, the hub score of a page is the total authority value of its forward neighbors (after normalization) • The authority value of each page is the total hub value of its backward neighbors (after normalization) • Iterate until converge authorities hubs Notes 01
Example: MiniWeb Normalization! Ne Therefore: MS Am Notes 01
Example: MiniWeb Ne MS Am Notes 01
Trawling: finding online communities • Motivation: find groups of individuals who share a common interest, together with the Web pages most popular among them (similar to “hubs”) • Examples: • Web pages of NBA fans • Community of Turkish student organizations in the US • Fans of movie star Jack Lemmon • Applications: • Provide valuable and timely info for interested people • Represent the sociology of the web • Target advertising Notes 01
How: analyzing web structure • These pages often do not reference each other • Competitions • Different view points • Main idea: “co-citations” • Often these pages share a large number of pages • Example: the following two web sites share many pages • http://kcm.co.kr/English/ • www.cyberkorean.com/church Notes 01
Bipartite subgraphs C “Centers” F “Fans” • Bipartite graphs: sets of nodes, F and C • Dense bipartite graph: there are “enough” number of edges between F and C • Complete bipartite graph: there is an edge between each node in F and each node in C • (i,j)-Core: a complete bipartite graph with at least i nodes in F and j nodes in C • (i,j)-Core is a good signature for finding online communities • Usually i and j are between 3 and 9 Notes 01
“Trawling”: finding cores • Find all (i,j)-cores in the Web graph. • In particular: find “fans” (or “hubs”) in the graph • “centers” = “authorities” • Challenge: Web is huge. How to find cores efficiently? • Experiments: 200M pages, 1 TB data • Main idea: pruning • Step 1: using out-degrees • Rule: each fan must point to at least 6 different websites • Pruning results: 12% of all pages (= 24M pages) are potential fans • Retain only links, and ignore page contents Notes 01
Step 2: eliminate mirroring pages • Many pages are mirrors (exactly the same page) • They can produce many spurious fans • Use a “shingling” method to identify and eliminate duplicates • Results: • 60% of 24M potential-fan pages are removed • # of potential centers is 30 times of # of potential fans Notes 01
Step 3: using in-degrees of pages • Delete pages highly referenced, e.g., yahoo, altavista • Reason: they are referenced for many reasons, not likely forming an emerging community • Formally: remove all pages with more than k inlinks (k = 50, for instance) • Results: • 60M pages pointing to 20M pages • 2M potential fans Notes 01
Step 4: iterative pruning • To find (i,j)-cores • Remove all pages whose # of out-links is < i • Remove all pages whose # of in-links is < j • Do it iteratively Notes 01
Step 5: inclusion-exclusion pruning • Idea: in each step, we • Either “include” a community • Or we “exclude” a page from further contention • Check a page x with j out-degree. x is a fan of an (i,j)-core if: • There are i-1 fans point to all the forward neighbors of x • This step can be checked easily using the index on fans and centers • Result: for (3,3)-cores, 5M pages remained • Final step: • Since the graph is much smaller, we can afford to “enumerate” the remaining cores • Result: • (3,3)-cores: about 75 KB • High-quality communities • Check a few in the paper by yourself Notes 01