1 / 46

C20.0046: Database Management Systems Lecture #27

C20.0046: Database Management Systems Lecture #27. M.P. Johnson Stern School of Business, NYU Spring, 2005. Agenda. Last time: Data Mining RAID Websearch Etc. Goals after today:. Understand what RAID is Be able to perform RAID 4 Understand some issues in websearch

tal
Download Presentation

C20.0046: Database Management Systems Lecture #27

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C20.0046: Database Management SystemsLecture #27 M.P. Johnson Stern School of Business, NYU Spring, 2005 M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  2. Agenda • Last time: • Data Mining • RAID • Websearch • Etc. M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  3. Goals after today: • Understand what RAID is • Be able to perform RAID 4 • Understand some issues in websearch • Be able to perform PageRank M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  4. New topic: Recovery M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  5. System Failures (skip?) • Each transaction has internal state • When system crashes, internal state is lost • Don’t know which parts executed and which didn’t • Remedy: use a log • A file that records each action of each xact • Trail of breadcrumbs • See text for details… M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  6. Media Failures • Rule of thumb: Pr(hard drive has head crash within 10 years) = 50% • Simpler rule of thumb: Pr(hard drive has head crash within 1 year) = (say) 10% • If have many drives, then regular occurrence • Soln: different RAID strategies • RAID: Redundant Arrays of Independent Disks M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  7. RAID levels • RAID level 1: each disk gets a mirror • RAID level 4: one disk is xor of all others • Each bit is sum mod 2 of corresponding bits • E.g.: • Disk 1: 11110000 • Disk 2: 10101010 • Disk 3: 00111000 • Disk 4: • How to recover? • Various other RAID levels in text… M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  8. RAID levels • RAID level 1: each disk gets a mirror • RAID level 4: one disk is xor of all others • Each bit is sum mod 2 of corresponding bits • E.g.: • Disk 1: • Disk 2: 10101010 • Disk 3: 00111000 • Disk 4: • How to recover? • Various other RAID levels in text… M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  9. Next topic: Websearch • Create a search engine for searching the web • DBMS queries use tables and (optionally) indices • First thing to understand about websearch: • we never run queries on the web • Way too expensive, for several reasons • Instead: • Build an index of the web • Search the index • Return the results M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  10. Crawling • To obtain the data for the index, we crawl the web • Automated web-surfing • Conceptually very simple • But difficult to do robustly • First, must get pages • Prof. Davis (NYU/CS)’s example: http://www.cs.nyu.edu/courses/fall02/G22.3033-008/WebCrawler.java • http://pages.stern.nyu.edu/~mjohnson/dbms/eg/WebCrawler.java • Rule of thumb: 1 page per minute • Run program: sales% cd ~mjohnson/public_html/dbms/eg sales% java WebCrawler http://pages.stern.nyu.edu/~mjohnson/dbms 200 M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  11. Crawling issues in practice • DNS bottleneck • to view page by text link, must get address • BP claim: 87% crawling time ~ DNS look-up • Search strategy? • Refresh strategy? • Primary key for webpages • Use artificial IDs, not URLs • more popular pages get shorter DocIDs (why?) M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  12. Crawling issues in practice • Content-seen test • compute fingerprint/hash (again!) of page content • robots.txt • http://www.robotstxt.org/wc/robots.html • Bad HTML • Tolerant parsing • Non-responsive servers • Spurious text M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  13. Inverted indices • Basic idea of finding pages: • Create inverted index mapping words to pages • First, think of each webpage as a tuple • One column for each possible word • True means the word appears on the page • Index on all columns • Now can search: john bolton •  select * from T where john=T and bolton=T M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  14. Inverted indices • Can simplify somewhat: • For each field index, delete False entries • True entries for each index become a bucket • Create an inverted index: • One entry for each search word • the lexicon • Search word entry points to corresponding bucket • Bucket points to pages with its word • the postings file • Final intuition: the inverted index doesn’t map URLs to words • It maps words to URLs M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  15. Inverted Indices • What’s stored? • For each word W, for each doc D • relevance of D to W • #/% occurs. of W in D • meta-data/context: bold, font size, title, etc. • In addition to page importance, keep in mind: • this info is used to determine relevance of particular words appearing on the page M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  16. Search engine infrastructure • Image from here: http://www.cs.wisc.edu/~dbbook/openAccess/thirdEdition/slides/slides3ed-english/Ch27c_ir3-websearch-95.pdf M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  17. Google-like infrastructure • Very large distributed system • File sizes routines in GBs  Google File System • Block size = 64MB (not kb)! • 100k+ low-quality Linux boxes •  system failures are the rule, not exception • Divide index up by words into many barrels • lexicon maps word ids to word’s barrel • also, do RAID-like stragegy  two-D matrix of servers • many commodity machines  frequent crashes • Draw picture • May have more duplication for popular pages… M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  18. Google-like infrastructure • To respond to single-word query Q(w): • send to the barrel column for word w • pick random server in that column • return (some) sorted results • To respond to multi-word query Q(w1…wn): • for each word wi, send to the barrel column for wi • pick random server in that column • for all words in parallel, merge and prune • step through until find doc containing all words, add to results • index ordered on word;docID, so linear time • return (some) sorted results M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  19. Websearch v. DBMS M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  20. New topic: Sorting Results • How to respond to Q(w1,w2,…,wn)? • Search index for pages with w1,w2,…,wn • Return in sorted order (how?) • Soln 1: current order • Return 100,000 (mostly) useless results • Sturgeon's Law: “Ninety percent of everything is crud.” • Soln 2: ways from Information Retrieval Theory • library science + CS = IR M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  21. Simple IR-style approach • for each word W in a doc D, compute • # occurs of W in D / total # word occurs in D •  each document becomes a point in a space • one dimension for every possible word • Like k-NN and k-means • value in that dim is ratio from above (maybe weighted, etc.) • Choose pages with high values for query words • A little more precisely: each doc becomes a vector in space • Values same as above • But: think of the query itself as a document vector • Similarity between query and doc = dot product / cos • Draw picture M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  22. Information Retrieval Theory • With some extensions, this works well for relatively small sets of quality documents • But the web has 8 billion documents • Problem: if based just on percentages, very short pages containing query words score very high • BP: query a “major search engine” for “bill clinton” •  “Bill Clinton Sucks” page M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  23. Soln 3: sort by “quality” • What do you mean by quality? • Hire readers to rate my webpage (early Yahoo) • Problem: doesn’t scale well • more webpages than Yahoo employees… M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  24. Soln 4: count # citations (links) • Idea: you don’t have to hire webpage raters • The rest of the web has already voted on the quality of my webpage • 1 link to my page = 1 vote • Similar to counting academic citations • Peer review M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  25. Soln 5: Google’s PageRank • Count citations, but not equally – weighted sum • Motiv: we said we believe that some pages are better than others •  those pages’ votes should count for more • A page can get a high PageRank many ways • Two cases at ends of a continuum: • many pages link to you • yahoo.com links to you • PageRank, not PigeonRank • Search for “PigeonRank”… M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  26. PageRank • More precisely, let P be a page; • for each page Li that links to P, • let C(Li) be the number of pages Li links to. • Then PR0(P) = SUM(PR0(Li)/C(Li))) • Motiv: each page votes with its quality; • its quality is divided among the pages it votes for • Extensions: bold/large type/etc. links may get larger proportions… M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  27. Understanding PageRank (skip?) • Analogy 1: Friendster/Orkut • someone “good” invites you in • someone else “good” invited that person in, etc. • Analogy 2: PKE certificates • my cert authenticated by your cert • your cert endorsed by someone else's… • Both cases here: eventually reach a foundation • Analogy 3: job/school recommendations • three people recommend you • why should anyone believe them? • three other people rec-ed them, etc. • eventually, we take a leap of faith M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  28. Understanding PageRank • Analogy 4: Random Surfer Model • Idealized web surfer: • First, start at some page • Then, at each page, pick a random link… • Turns out: after long time surfing, • Pr(were at some page P right now) = PR0(P) • PRs are normalized M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  29. Computing PageRank • For each page P, we want: • PR(P) = SUM(PR(Li)/C(Li))) • But its circular – how to compute? • Meth 1: for n pages, we've got n linear eqs and n unknowns • can solve for all PR(P)s, but too hard • see your linear algebra course… • Meth 2: iteratively • start with PR0(P) set to E for each P • iterate until no more significant change • PB report O(50) iterations for O(30M) pages/O(300M) links • #iters req. grows only with log of web size M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  30. Problems with PageRank • Example (from Ullman): • A points to Y, M; • Y points to self, A; • M points nowhere draw picture • Start A,Y,M at 1: • (1,1,1)  (0,0,0) • The rank dissipates • Soln: add (implicit) self link to any dead-end sales% cd ~mjohnson/public_html/dbms/eg stern% java PageRank M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  31. Problems with PageRank • Example (from Ullman): • A points to Y, M; • Y points to self, A; • M points to self • Start A,Y,M at 1: • (1,1,1)  (0,0,3) • Now M becomes a rank sink • RSM interp: we eventually end up at M and then get stuck • Soln: add “inherent quality” E to each page stern% java PageRank2 M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  32. Modified PageRank • Apart from inherited quality, each page also has inherent quality E: • PR(P) = E + SUM(PR(Li)/C(Li))) • More precisely, have weighted sum of the two terms: • PR(P) = .15*E + .85*SUM(PR(Li)/C(Li))) • Leads to a modified random surfer model stern% java PageRank3 M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  33. Random Surfer Model’ • Motiv: if we (qua random surfer) end up at page M, we don’t really stay there forever • We type in a new URL • Idealized web surfer: • First, start at some page • Then, at each page, pick a random link • But occasionally, we get bored and jump to a random new page • Turns out: after long time surfing, • Pr(we’re at some page P right now) = PR(P) M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  34. Understanding PageRank • One more interp: hydraulic model • picture the web graph again • imagine each link as a tube bet. two nodes • imagine quality as fluid • each node is a reservoir initialized with amount E of fluid • Now let flow… • Steady state is: each node P w/PR(P) amount of fluid • PR(P) of fluid eventually settles in node P • equilibrium M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  35. Somewhat analogous systems (skip?) • Sornette: “Why Stock Markets Crash” • Si(t+1) = sign(ei + SUM(Sj(t)) • trader buys/sells based on • is inclination and • what is associates are saying • directions. of magnet det-ed by • old direction and • dirs. of neighbors • activation of neuron det-ed by • its props and • activation of neighbors connected by synapses • PR of P based on • its inherent value and • PR of in-links M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  36. Non-uniform Es (skip?) • So far, assumed E was const for all pages • But can make E a function E(P) • vary by page • How do we choose E(P)? • Idea 1: set high for pages with high PR from earlier iterations • Idea 2: set high for pages I like • BP paper gave high E to John McCarthy’s homepage •  pages he links to get high PR, etc. • Result: his own personalized search engine • Q: How would google.com get your prefs? M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  37. Tricking search engines • “Search Engine Optimization” • Challenge: include on your page lots of words you think people will query on • maybe hidden with same color as background • Response: popularity ranking • the pages doing this probably aren't linked to that much • but… M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  38. Tricking search engines • I can try to make my page look popular to the search engine • Challenge: create a page with 1000 links to my page • does this work? • Challenge: Create 1000 other pages linking to it • Response: limit the weight a single domain can give to itself • Challenge: buy a second domain and put the 1000 pages there • Response: limit the weight from any single domain… M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  39. Using anchor text • Another good idea: use anchor text • Motiv: pages may not give best descrips. of themselves • most search engines don’t contain "search engine" • BP claim: only 1 of 4 “top search engines” could find themselves on query "search engine" • Anchor text also describes page: • many pages link to google.com • many of them likely say "search engine" in/near the link •  Treat anchor text words as part of page • Search for “US West” or for “g++” M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  40. Tricking search engines • This provides a new way to trick the search engine • Use of anchor text is a big part of result quality • but has potential for abuse • Lets you influence the appearance of other people’s pages • Google Bombs • put up lots of pages linking to my page, using some particular phrase in the anchor text • result: search for words you chose produces my page • Examples: "talentless hack", "miserable failure", “waffles", the last name of a prominent US senator… M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  41. Bidding for ads • Google had two really great ideas: • PageRank • AdWords/AdSense • Fundamental difficulty with mass-advertising: • Most of the audience does want it • Most people don’t want what you’re selling • Think of car commercials on TV • But some of them do! M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  42. Bidding for ads • If you’re selling widgets, how do you know who wants them? • Hard question, so answer its inversion • If someone is searching for widgets, what should you try to sell them? • Easy – widgets! • Whatever the user searches for, display ads relevant to that query M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  43. Bidding for ads • Q: How to divvy correspondences up? • A: Create a market, and let the divvying take care of itself • Each company places the bid it’s willing to pay for an ad responding to a particular query • Ad auction “takes place” at query-time • Relevant ads displayed in descending bid order • Company pays only if user clicks • AdSense: place ads on external webpages, auction based on page content instead of query • Huge huge huge business M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  44. Click Fraud • The latest challenge: • Users who click on ad links to cost their competitors money • Or pay Indian housewives $.25/click • http://online.wsj.com/public/article/0,,SB111275037030799121-k_SZdfSzVxCwQL4r9ep_KgUWBE8_20050506,00.html?mod=tff_article • http://timesofindia.indiatimes.com/articleshow/msid-654822,curpg-1.cms M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  45. For more info • See sources drawn upon here: • Prof. Davis (NYU/CS) search engines course • http://www.cs.nyu.edu/courses/fall02/G22.3033-008/ • Original research papers by Page & Brin: • The PageRank Citation Ranking: Bringing Order to the Web • The Anatomy of a Large-Scale Hypertextual Web Search Engine • Links on class page • Interesting and very accessible • Google Labs: http://labs.google.com M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  46. You mean that’s it? • Final Exam: next Thursday, 5/5,10-11:50am • Final exam info is up • Course grades are cuvered • Interest in a review session? • Please fill out course evals! • https://ais.stern.nyu.edu/ • Comments by email, etc., are welcome • Thanks! M.P. Johnson, DBMS, Stern/NYU, Spring 2005

More Related