340 likes | 458 Views
This course, led by Professor Chen Li at the University of California, Irvine, delves into advances in database management, focusing on web search mechanisms. Key concepts include the operation of early search engines, Google’s groundbreaking techniques, and the significance of PageRank in determining web page importance. Through insightful readings and discussions, students explore keyword search strategies, multi-keyword intersections, and challenges like spam in search engine results. The course leverages historical examples and current frameworks to understand the complexities of modern web search.
E N D
ICS 215: Advances in Database Management System Technology Spring 2004 Professor Chen Li Information and Computer Science University of California, Irvine
Course Web Server • URL: http://www.ics.uci.edu/~ics215/ • All course info will be posted online • Instructor: Chen Li • ICS 424B, chenli@ics.uci.edu • Course general info: http://www.ics.uci.edu/~ics215/geninfo.html Notes 01
Topic today: Web Search • How did earlier search engines work? • How does Google work? • Readings: • Lawrence and Giles, Searching the World Wide Web, Science, 1998. • Brin and Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine WWW7/Computer Networks 30(1-7): 107-117, 1998. Notes 01
Earlier Search Engines • Hotbot, Yahoo, Alta Vista, Northern Light, Excite, Infoseek, Lycos … • Main technique: “inverted index” • Conceptually: use a matrix to represent how many times a term appears in one page • # of columns = # of pages (huge!) • # of rows = # of terms (also huge!) Page1 Page2 Page3 Page4 … ‘car’ 1 0 1 0 ‘toyota’ 0 2 0 1 page 2 mentions ‘toyota’ twice ‘honda’ 2 1 0 0 … Notes 01
Search by Keywords • If the query has one keyword, just return all the pages that have the word • E.g., “toyota” all pages containing “toyota”: page2, page4,… • There could be many many pages! • Solution: return those pages with most frequencies of the word first Notes 01
Multi-keyword Search • For each keyword W, find all the set of pages mentioning W • Intersect all the sets of pages • Assuming an “AND” operation of those keywords • Example: • A search “toyota honda” will return all the pages that mention both “toyota” and “honda” Notes 01
Observations • The “matrix” can be huge: • Now the Web has 4.2 billion pages! • There are many “terms” on the Web. Many of them are typos. • It’s not easy to do the computation efficiently: • Given a word, find all the pages… • Intersect many sets of pages… • For these reasons, search engines never store this “matrix” so naively. Notes 01
Problems • Spamming: • People want their pages to be put very top on a word search (e.g., “toyota”) by repeating the word many many times • Though these pages may be unimportant compared to www.toyota.com, even if the latter only mentions “toyota” only once (or 0 time). • Search engines can be easily “fooled” Notes 01
Closer look at the problems • Lacking the concept of “importance” of each page on each topic • E.g.: Our ICS215 class page is not as “important” as Yahoo’s main page. • A link from Yahoo is more important than a link from our class page • But, how to capture the importance of a page? • A guess: # of hits? where to get that info? • # of inlinks to a page Google’s main idea. Notes 01
Google’s History • Started at Stanford DB Group as a research project (Brin and Page) • Used to be at: google.stanford.edu • Very soon many people started liking it • Incorporated in 1998: www.google.com • The “largest” search engine now • Started other businesses: froogle, gmail, … Notes 01
PageRank • Intuition: • The importance of each page should be decided by what other pages “say” about this page • One naïve implementation: count the # of pages pointing to each page (i.e., # of inlinks) • Problem: • We can easily fool this technique by generating many dummy pages that point to our class page Notes 01
Details of PageRank • At the beginning, each page has weight 1 • In each iteration, each page propagates its current weight W to all its N forward neighbors. Each of them gets weight: W/N • Meanwhile, a page accumulates the weights from its backward neighbors • Iterate until all weights converge. Usually 6-7 times are good enough. • The final weight of each page is its importance. • NOTICE: currently Google is using many other techniques/heuristics to do search. Here we just cover some of the initial ideas. Notes 01
Example: MiniWeb • (Materials used by courtesy of Jeff Ullman) • Our “MiniWeb” has only three web sites: Netscape, Amazon, and Microsoft. • Their weights are represented as a vector Ne MS Am For instance, in each iteration, half of the weight of AM goes to NE, and half goes to MS. Notes 01
Iterative computation Final result: • Netscape and Amazon have the same importance, and twice the importance of Microsoft. • Does it capture the intuition? Yes. Ne MS Am Notes 01
Observations • We cannot get absolute weights: • We can only know (and we are only interested in) those relative weights of the pages • The matrix is stochastic (sum of each column is 1). So the iterations converge, and compute the principal eigenvector of the following matrix equation: Notes 01
Problem 1 of algorithm: dead ends • MS does not point to anybody • Result: weights of the Web “leak out” Ne MS Am Notes 01
Problem 2 of algorithm: spider traps • MS only points to itself • Result: all weights go to MS! Ne MS Am Notes 01
Google’s solution: “tax each page” • Like people paying taxes, each page pays some weight into a public pool, which will be distributed to all pages. • Example: assume 20% tax rate in the “spider trap” example. Notes 01
The War of Search Engines • More companies are realizing the importance of search engines • More competitors in the market: Microsoft, Yahoo!, etc. Notes 01
Next: HITS / Web communities • Readings: • Jon M. Kleinberg, Authoritative Sources in a Hyperlinked Environment, Journal of ACM 46(5): 604-632, 1999. • Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins, Trawling the Web for emerging cyber-communities, WWW 1999 Notes 01
Hubs and Authorities • Motivation: find web pages to a topic • E.g.: “find all web sites about automobiles” • “Authority”: a page that offers info about a topic • E.g.: DBLP is a page about papers • E.g.: google.com, aj.com, teoma.com, lycos.com • “Hub”: a page that doesn’t provide much info, but tell us where to find pages about a topic • E.g.: our ICS215 page linking to pages about papers • E.g.: www.searchenginewatch.com is a hub of search engines Notes 01
Two values of a page • Each page has a hub value and an authority value. • In PageRank, each page has one value: “weight” • Two vectors: • H: hub values • A: authority values Notes 01
HITS algorithm: find hubs and authorities • First step: find pages related to the topic (e.g., “automobile”), and construct the corresponding “focused subgraph” • Find pages S containing the keyword (“automobile”) • Find all pages these S pages point to, i.e., their forward neighbors. • Find all pages that point to S pages, i.e., their backward neighbors • Compute the subgraph of these pages root Focused subgraph Notes 01
Step 2: computing H and A • Initially: set hub and authority to 1 • In each iteration, the hub score of a page is the total authority value of its forward neighbors (after normalization) • The authority value of each page is the total hub value of its backward neighbors (after normalization) • Iterate until converge authorities hubs Notes 01
Example: MiniWeb Normalization! Ne Therefore: MS Am Notes 01
Example: MiniWeb Ne MS Am Notes 01
Trawling: finding online communities • Motivation: find groups of individuals who share a common interest, together with the Web pages most popular among them (similar to “hubs”) • Examples: • Web pages of NBA fans • Community of Turkish student organizations in the US • Fans of movie star Jack Lemmon • Applications: • Provide valuable and timely info for interested people • Represent the sociology of the web • Target advertising Notes 01
How: analyzing web structure • These pages often do not reference each other • Competitions • Different view points • Main idea: “co-citations” • Often these pages share a large number of pages • Example: the following two web sites share many pages • http://kcm.co.kr/English/ • www.cyberkorean.com/church Notes 01
Bipartite subgraphs C “Centers” F “Fans” • Bipartite graphs: sets of nodes, F and C • Dense bipartite graph: there are “enough” number of edges between F and C • Complete bipartite graph: there is an edge between each node in F and each node in C • (i,j)-Core: a complete bipartite graph with at least i nodes in F and j nodes in C • (i,j)-Core is a good signature for finding online communities • Usually i and j are between 3 and 9 Notes 01
“Trawling”: finding cores • Find all (i,j)-cores in the Web graph. • In particular: find “fans” (or “hubs”) in the graph • “centers” = “authorities” • Challenge: Web is huge. How to find cores efficiently? • Experiments: 200M pages, 1 TB data • Main idea: pruning • Step 1: using out-degrees • Rule: each fan must point to at least 6 different websites • Pruning results: 12% of all pages (= 24M pages) are potential fans • Retain only links, and ignore page contents Notes 01
Step 2: eliminate mirroring pages • Many pages are mirrors (exactly the same page) • They can produce many spurious fans • Use a “shingling” method to identify and eliminate duplicates • Results: • 60% of 24M potential-fan pages are removed • # of potential centers is 30 times of # of potential fans Notes 01
Step 3: using in-degrees of pages • Delete pages highly referenced, e.g., yahoo, altavista • Reason: they are referenced for many reasons, not likely forming an emerging community • Formally: remove all pages with more than k inlinks (k = 50, for instance) • Results: • 60M pages pointing to 20M pages • 2M potential fans Notes 01
Step 4: iterative pruning • To find (i,j)-cores • Remove all pages whose # of out-links is < i • Remove all pages whose # of in-links is < j • Do it iteratively Notes 01
Step 5: inclusion-exclusion pruning • Idea: in each step, we • Either “include” a community • Or we “exclude” a page from further contention • Check a page x with j out-degree. x is a fan of an (i,j)-core if: • There are i-1 fans point to all the forward neighbors of x • This step can be checked easily using the index on fans and centers • Result: for (3,3)-cores, 5M pages remained • Final step: • Since the graph is much smaller, we can afford to “enumerate” the remaining cores • Result: • (3,3)-cores: about 75 KB • High-quality communities • Check a few in the paper by yourself Notes 01