ICS 215: Advances in Database Management System Technology Spring 2004

ICS 215: Advances in Database Management System Technology Spring 2004 Professor Chen Li Information and Computer Science University of California, Irvine

Course Web Server • URL: http://www.ics.uci.edu/~ics215/ • All course info will be posted online • Instructor: Chen Li • ICS 424B, chenli@ics.uci.edu • Course general info: http://www.ics.uci.edu/~ics215/geninfo.html Notes 01

Topic today: Web Search • How did earlier search engines work? • How does Google work? • Readings: • Lawrence and Giles, Searching the World Wide Web, Science, 1998. • Brin and Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine WWW7/Computer Networks 30(1-7): 107-117, 1998. Notes 01

Earlier Search Engines • Hotbot, Yahoo, Alta Vista, Northern Light, Excite, Infoseek, Lycos … • Main technique: “inverted index” • Conceptually: use a matrix to represent how many times a term appears in one page • # of columns = # of pages (huge!) • # of rows = # of terms (also huge!) Page1 Page2 Page3 Page4 … ‘car’ 1 0 1 0 ‘toyota’ 0 2 0 1  page 2 mentions ‘toyota’ twice ‘honda’ 2 1 0 0 … Notes 01

Search by Keywords • If the query has one keyword, just return all the pages that have the word • E.g., “toyota”  all pages containing “toyota”: page2, page4,… • There could be many many pages! • Solution: return those pages with most frequencies of the word first Notes 01

Multi-keyword Search • For each keyword W, find all the set of pages mentioning W • Intersect all the sets of pages • Assuming an “AND” operation of those keywords • Example: • A search “toyota honda” will return all the pages that mention both “toyota” and “honda” Notes 01

Observations • The “matrix” can be huge: • Now the Web has 4.2 billion pages! • There are many “terms” on the Web. Many of them are typos. • It’s not easy to do the computation efficiently: • Given a word, find all the pages… • Intersect many sets of pages… • For these reasons, search engines never store this “matrix” so naively. Notes 01

Problems • Spamming: • People want their pages to be put very top on a word search (e.g., “toyota”) by repeating the word many many times • Though these pages may be unimportant compared to www.toyota.com, even if the latter only mentions “toyota” only once (or 0 time). • Search engines can be easily “fooled” Notes 01

Closer look at the problems • Lacking the concept of “importance” of each page on each topic • E.g.: Our ICS215 class page is not as “important” as Yahoo’s main page. • A link from Yahoo is more important than a link from our class page • But, how to capture the importance of a page? • A guess: # of hits?  where to get that info? • # of inlinks to a page  Google’s main idea. Notes 01

Google’s History • Started at Stanford DB Group as a research project (Brin and Page) • Used to be at: google.stanford.edu • Very soon many people started liking it • Incorporated in 1998: www.google.com • The “largest” search engine now • Started other businesses: froogle, gmail, … Notes 01

PageRank • Intuition: • The importance of each page should be decided by what other pages “say” about this page • One naïve implementation: count the # of pages pointing to each page (i.e., # of inlinks) • Problem: • We can easily fool this technique by generating many dummy pages that point to our class page Notes 01

Details of PageRank • At the beginning, each page has weight 1 • In each iteration, each page propagates its current weight W to all its N forward neighbors. Each of them gets weight: W/N • Meanwhile, a page accumulates the weights from its backward neighbors • Iterate until all weights converge. Usually 6-7 times are good enough. • The final weight of each page is its importance. • NOTICE: currently Google is using many other techniques/heuristics to do search. Here we just cover some of the initial ideas. Notes 01

Example: MiniWeb • (Materials used by courtesy of Jeff Ullman) • Our “MiniWeb” has only three web sites: Netscape, Amazon, and Microsoft. • Their weights are represented as a vector Ne MS Am For instance, in each iteration, half of the weight of AM goes to NE, and half goes to MS. Notes 01

Iterative computation Final result: • Netscape and Amazon have the same importance, and twice the importance of Microsoft. • Does it capture the intuition? Yes. Ne MS Am Notes 01

Observations • We cannot get absolute weights: • We can only know (and we are only interested in) those relative weights of the pages • The matrix is stochastic (sum of each column is 1). So the iterations converge, and compute the principal eigenvector of the following matrix equation: Notes 01

Problem 1 of algorithm: dead ends • MS does not point to anybody • Result: weights of the Web “leak out” Ne MS Am Notes 01

Problem 2 of algorithm: spider traps • MS only points to itself • Result: all weights go to MS! Ne MS Am Notes 01

Google’s solution: “tax each page” • Like people paying taxes, each page pays some weight into a public pool, which will be distributed to all pages. • Example: assume 20% tax rate in the “spider trap” example. Notes 01

The War of Search Engines • More companies are realizing the importance of search engines • More competitors in the market: Microsoft, Yahoo!, etc. Notes 01

Next: HITS / Web communities • Readings: • Jon M. Kleinberg, Authoritative Sources in a Hyperlinked Environment, Journal of ACM 46(5): 604-632, 1999. • Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins, Trawling the Web for emerging cyber-communities, WWW 1999 Notes 01

Hubs and Authorities • Motivation: find web pages to a topic • E.g.: “find all web sites about automobiles” • “Authority”: a page that offers info about a topic • E.g.: DBLP is a page about papers • E.g.: google.com, aj.com, teoma.com, lycos.com • “Hub”: a page that doesn’t provide much info, but tell us where to find pages about a topic • E.g.: our ICS215 page linking to pages about papers • E.g.: www.searchenginewatch.com is a hub of search engines Notes 01

Two values of a page • Each page has a hub value and an authority value. • In PageRank, each page has one value: “weight” • Two vectors: • H: hub values • A: authority values Notes 01

HITS algorithm: find hubs and authorities • First step: find pages related to the topic (e.g., “automobile”), and construct the corresponding “focused subgraph” • Find pages S containing the keyword (“automobile”) • Find all pages these S pages point to, i.e., their forward neighbors. • Find all pages that point to S pages, i.e., their backward neighbors • Compute the subgraph of these pages root Focused subgraph Notes 01

Step 2: computing H and A • Initially: set hub and authority to 1 • In each iteration, the hub score of a page is the total authority value of its forward neighbors (after normalization) • The authority value of each page is the total hub value of its backward neighbors (after normalization) • Iterate until converge authorities hubs Notes 01

Example: MiniWeb Normalization! Ne Therefore: MS Am Notes 01

Example: MiniWeb Ne MS Am Notes 01

Trawling: finding online communities • Motivation: find groups of individuals who share a common interest, together with the Web pages most popular among them (similar to “hubs”) • Examples: • Web pages of NBA fans • Community of Turkish student organizations in the US • Fans of movie star Jack Lemmon • Applications: • Provide valuable and timely info for interested people • Represent the sociology of the web • Target advertising Notes 01

How: analyzing web structure • These pages often do not reference each other • Competitions • Different view points • Main idea: “co-citations” • Often these pages share a large number of pages • Example: the following two web sites share many pages • http://kcm.co.kr/English/ • www.cyberkorean.com/church Notes 01

Bipartite subgraphs C “Centers” F “Fans” • Bipartite graphs: sets of nodes, F and C • Dense bipartite graph: there are “enough” number of edges between F and C • Complete bipartite graph: there is an edge between each node in F and each node in C • (i,j)-Core: a complete bipartite graph with at least i nodes in F and j nodes in C • (i,j)-Core is a good signature for finding online communities • Usually i and j are between 3 and 9 Notes 01

“Trawling”: finding cores • Find all (i,j)-cores in the Web graph. • In particular: find “fans” (or “hubs”) in the graph • “centers” = “authorities” • Challenge: Web is huge. How to find cores efficiently? • Experiments: 200M pages, 1 TB data • Main idea: pruning • Step 1: using out-degrees • Rule: each fan must point to at least 6 different websites • Pruning results: 12% of all pages (= 24M pages) are potential fans • Retain only links, and ignore page contents Notes 01

Step 2: eliminate mirroring pages • Many pages are mirrors (exactly the same page) • They can produce many spurious fans • Use a “shingling” method to identify and eliminate duplicates • Results: • 60% of 24M potential-fan pages are removed • # of potential centers is 30 times of # of potential fans Notes 01

Step 3: using in-degrees of pages • Delete pages highly referenced, e.g., yahoo, altavista • Reason: they are referenced for many reasons, not likely forming an emerging community • Formally: remove all pages with more than k inlinks (k = 50, for instance) • Results: • 60M pages pointing to 20M pages • 2M potential fans Notes 01

Step 4: iterative pruning • To find (i,j)-cores • Remove all pages whose # of out-links is < i • Remove all pages whose # of in-links is < j • Do it iteratively Notes 01

Step 5: inclusion-exclusion pruning • Idea: in each step, we • Either “include” a community • Or we “exclude” a page from further contention • Check a page x with j out-degree. x is a fan of an (i,j)-core if: • There are i-1 fans point to all the forward neighbors of x • This step can be checked easily using the index on fans and centers • Result: for (3,3)-cores, 5M pages remained • Final step: • Since the graph is much smaller, we can afford to “enumerate” the remaining cores • Result: • (3,3)-cores: about 75 KB • High-quality communities • Check a few in the paper by yourself Notes 01

ICS 215: Advances in Database Management System Technology Spring 2004

ICS 215: Advances in Database Management System Technology Spring 2004

Presentation Transcript

Database Management System

Advances in Database Querying

ADVANCES IN TECHNOLOGY

Database Management System

ECE 569 Database System Engineering Spring 2004

Database Management System

Database Management System

Database Management System

Database Management System Recent Advances

Database Management System

Advances in Technology

FAO/INFOODS Advances in Food Composition and Database Management System

ICS 421 Spring 2010 Database Design

ICS 224: Database Management Systems Spring 2011

ICS 321 Spring 2011 Introduction to Database Systems

SCSU Technology Update Spring 2004

Database Management System

ICS 214A: Database Management Systems Winter 2004

ICS 224: Database Management Systems Spring 2011

Database Management System