1 / 45

C20.0046: Database Management Systems Lecture #23

C20.0046: Database Management Systems Lecture #23. M.P. Johnson Stern School of Business, NYU Spring, 2008. Agenda. Ads Datamining RAID? Regexs? Set Theory? Indices? Etc.?. Fixing PR bugs. Apart from inherited quality, each page also has inherent quality E:

vaughn
Download Presentation

C20.0046: Database Management Systems Lecture #23

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C20.0046: Database Management SystemsLecture #23 M.P. Johnson Stern School of Business, NYU Spring, 2008 M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  2. Agenda • Ads • Datamining • RAID? • Regexs? • Set Theory? • Indices? • Etc.? M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  3. Fixing PR bugs • Apart from inherited quality, each page also has inherent quality E: • PR(P) = E + SUM(PR(Li)/C(Li))) • More precisely, have weighted sum of the two terms: • PR(P) = .15*E + .85*SUM(PR(Li)/C(Li))) • Leads to a modified random surfer model sales% java PageRank3 M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  4. Anchor text • Motiv: pages may not give best descrips. of themselves • most search engines don’t contain “search engine" • BP claim: only 1 of 4 “top search engines” could find themselves on query "search engine" • Anchor text offers mores descriptions of the page: • many pages link to google.com • many of them likely say "search engine" in/near the link •  Treat anchor text words as part of page • Search for “US West” or for “g++” M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  5. Next: Bidding for ads • Google had two big ideas: • PageRank • AdWords/AdSense • Fundamental difficulty with mass-advertising: • Most of the audience isn’t interested • Most people don’t want what you’re selling • Think of car commercials on TV • But some of them do! M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  6. Bidding for ads • If you’re selling widgets, how do you know who wants them? • Hard question, so invert the question • If someone is searching for widgets, what should you try to sell them? • Easy – widgets! • Whatever the user searches for, display ads relevant to that query • Or: whatever’s on the page he’s viewing M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  7. Bidding for ads • Q: How to divvy correspondences up? • A: Create a market, and let the divvying take care of itself • Each company places the bid it’s willing to pay for an ad responding to a particular query • Ad auction “takes place” at query-time • Relevant ads displayed in descending bid order (e.g.) • Company pays only if user clicks • Huge huge huge business M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  8. How to assign ads? • On each query, must assign a bidder for top place) • Matching problem • Online problem • Also: each bidder has daily budget • Many-to-one • One idea: choose highest bid (in expectation) • Can’t be better than ½-approx (why?) • Another: choose highest remaining budget • Definitely suboptimal (why?) • “Trade-off” algorithm: 1-1/e-approx (~.6321) • Best possible guarantee (see Vazirani et al.) M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  9. Click Fraud • Medium-sized challenge: • Users who click on ad links to cost their competitors money • Or pay housewives in India $.25/click • http://online.wsj.com/public/article/0,,SB111275037030799121-k_SZdfSzVxCwQL4r9ep_KgUWBE8_20050506,00.html?mod=tff_article • http://timesofindia.indiatimes.com/articleshow/msid-654822,curpg-1.cms M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  10. For more info • See sources drawn upon here: • Prof. Davis (NYU/CS) search engines course • http://www.cs.nyu.edu/courses/fall02/G22.3033-008/ • Original research papers by Page & Brin: • The PageRank Citation Ranking: Bringing Order to the Web • The Anatomy of a Large-Scale Hypertextual Web Search Engine • Links on class page • Interesting and very accessible • Why search the web when you could download it? • Webaroo M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  11. New topic: Data Mining • Situ: data rich but knowledge poor • Terabytes and terabytes of data • Searching for needles in haystacks • Can collect lots of data • credit card xacts • bar codes • club cards • webserver logs • call centers • cameras • 311 calls in NYC • http://blogs.law.harvard.edu/jim/discuss/msgReader$351 • http://www.nytimes.com/2003/12/01/nyregion/01PHON.html?ex=1385701200&en=0d74f2f58599fe35&ei=5007&partner=USERLAND M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  12. Lots of data • Can store this data • DBMSs, data warehousing • Can query it • SQL • DW/OLAP queries (CUBE, ROLLUP, etc.) • find the things such that… • Find the totals for each of… • Can get answers to specific questions, but what does it all mean? M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  13. But how to learn from this data? • What kinds of people will buy my widgets? • Whom should I send my direct-mail literature to? • How can I segment my market? • http://www.joelonsoftware.com/articles/CamelsandRubberDuckies.html • Whom should we approve for a gold card? • Who gets approved for car insurance? • And what should we charge? M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  14. ML/DM: two basic kinds • Supervised learning • Classification • Regression • Unsupervised learning • Clustering • Find something interesting, frequent sets… M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  15. Supervised learning • Situ: you have many particular instances • Transactions • Customers • Credit applications • Have various fields describing them • But other one property you’re interested in • Goal: infer this dependent property from the other data M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  16. Supervised learning • Supervised learning starts with training data • Many instances including the dependent property • Use the training data to build a model • Many different algorithms • GAs, NNs, decision trees, etc. • Given the model, you can now determine the dependent property for new instances • And ideally, the answers are “correct” • Categorical property  “classification” • Numerical property  “regression” M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  17. Classification applications • Construe as inference of one property from others: • Spam filtering • Network intrusion detection • Trading strategies • Fraud detection • See also: Zipf’s & Bedford’s laws • Political marketing? • Mail order tulip bulbs  Conservative party voters M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  18. k-Nearest Neighbor • Very simple algorithm • Sometimes works well • Map training data points in a space • Given any two points, can compute the distance between them • Euclidean • Given an unlabeled point, label this way: • Find the k nearest points • Let them vote on the label M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  19. Naïve Bayes Classifier • Bayes Theorem: Pr(B|A) = Pr(B,A)/Pr(A) = Pr(A|B)*Pr(B)/Pr(A) • Idea: Pr(B|A) can be defined in terms of Pr(A|B) • Here: Pr(S|W) = Pr(S,W)/Pr(W) = Pr(W|S)*Pr(S)/Pr(W) • W means: “the msg has the word W (or several words…)” • S means: “it’s spam” • Easy: We can (mathematically) compute Pr(W|S) • Goal: We want to get Pr(S|W) – i.e., is this spam? M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  20. Naïve Bayes Classifier • This is supervised learning, but training and use are intermingled • Spam filter improves over time • But let’s assume we’re given labeled sets of spam and good • Training phase: • For each word Wi, compute Pr(Wi) • For each word Wi, compute Pr(Wi|S) • Compute Pr(S) • That’s it! • Now, we wait for email to arrive… M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  21. Naïve Bayes Classifier • Msgs may have many words W = W1..Wn • So we need to predict Pr(S|W) = Pr(S|W1..Wn) = Pr(W1..Wn|S)*Pr(S)/Pr(W1..Wn) • Assuming total independence*, we get • Pr(W1…Wn) = Pr(W1)*…*Pr(Wn) • Pr(W1..Wn|S) = Pr(W1|S)*…*Pr(Wn|S) • We’ve precomputed the nums on the right • Now we can compute what we need * Which isn’t true… M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  22. Naïve Bayes Classifier • To decide spam status of a new msg, we just compute Pr(S|W) • If this is > 90% (say), call spam • False positives are very bad (why?) • For each new message (spam/not), we update our probabilities • Simple and unprincipled, but works very well • can write in a page of Perl code • See also Paul Graham: http://www.paulgraham.com/spam.html M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  23. New topic: Clustering • Unsupervised learning • partition items into clusters of same type • What different types of customers we have? • What different types of webpages are there? • Each item becomes a point in the space • One dim for every property • Poss plotting for webpages: dim. for each word • Value in dim d is # occur.s of d / # all word occur.s • k-means M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  24. k-means • Simple clustering algorithm • Want to partition data points into sets of “similar” instances • Plot in n-space •  partition points into clumps in space • Like a unsupervised analog to k-NN • But iterative M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  25. k-means • Alg: • Choose initial means (centers) • do { assign each point to the closest mean recompute means of clumps } until change in means < e • Visualization: • http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html • Gone! M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  26. New topic: Frequent Sets • Find sets of items that go together • Intuition: market-basket data • Suppose you’re Wal-Mart, with 460TB of data • http://developers.slashdot.org/article.pl?sid=04/11/14/2057228 • Might not know customer hist (but maybe: club cards!) • One thing you do know: groupings of items • Each set of item wrung up in a an individual purchase • Famous (claimed) example: beer and diapers are positively correlated • Intuition: fathers can’t take babies to the bar •  colocate beer and diapers M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  27. Finding Frequent Sets • Situ: one table • Baskets(id, item) • One row for each purchase of each item • Alt interp: words on webpages • Goal: given support s, find set of items that appear together in >=s baskets • For pairs, obvious attempt: SELECT B1.item, B2.item, count(B.id) FROM Baskets B1, Baskets B2 WHERE B1.id = B2.id AND B1.item < B2.item GROUP BY B1.item, B2.item HAVING count(B1.basket) >= s; Can we do better? M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  28. A Priori Property • a priori ~ prior to • Prior to investigation • Deductive/from reason/non-empirical • As opposed to a posteriori ~ post- • After investigation/experience • Logical observation: if support(X) >= s, then for every individual x in X, support(x) >= s •  any item y with support < s can be ignored M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  29. A Priori e.g. • You sell 10,000 items, stores record of 1,000,000 baskets, with avg of 4 items each •  4,000,000 rows in Baskets • #pairs of items from 1,000,000 baskets = C(4,2)*1,000,000 = 6,000,000 pairs! • One idea: • Find support-s items • Run old query on them M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  30. A Priori Property • Suppose we’re looking for sup-10,000 pairs • Each item in pair must be sup-10,000 • Counting arg: have only 4,000,000 individual items purchased • Can have at most 400 = 4,000,000/10,000 popular items •  we get to ignore most of the rows • Naïve solution above is now much cheaper • Can we do better? M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  31. A Priori Algorithm • But: a frequent itemset may have >2 items • Idea: build frequent sets (not just pairs) iteratively • Alg: • First, find all frequent items • These are the size-1 frequent sets • Next, for each size-k frequent set • For each frequent item • Check whether the union is frequent • If so, add to collection of size-k+1 frequent sets M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  32. Related goal: rules • Would like to infer: bought A  interested in B • Related to frequent sets, but now • A,B should be frequent (have high support) and • Rule should be true (have high confidence) • More advanced algs… M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  33. Next: Media Failures • Rule of thumb: Pr(hard drive has head crash within 10 years) = 50% • Simpler rule of thumb: Pr(hard drive has head crash within 1 year) = (say) 10% • If have many drives, then regular occurrence • Soln: different RAID strategies • RAID: Redundant Arrays of Independent Disks M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  34. RAID levels • RAID level 1: each disk gets a mirror • RAID level 4: one disk is xor of all others • Each bit is sum mod 2 of corresponding bits • E.g.: • Disk 1: 10110011 • Disk 2: 10101010 • Disk 3: 00111000 • Disk 4: • How to recover? • What’s the disadvantage of R4? • Various other RAID levels in text… M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  35. New topic: Regular Expressions • Finite Automata are studied in automata theory • FAs are the simplest/weakest kind of computer • Turing Machines the strongest • Chomsky’s Hierarchy • Each FA is equivalent to some regular expression • Expressions that specify a pattern • Can check whether a string matches the pattern • The REGEX(P) generates words • the FA accepts them M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  36. RegEx matching • Use REGEX_LIKE in Oracle • Metachar for any char is . • First, get employee_comment table: • http://pages.stern.nyu.edu/~mjohnson/oracle/empcomm.sql • And install… M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  37. Working with SQL*Plus • Command on sales: sqlplus • See my login.sql for settings… • Run a local file of scripts: • SQL> @filename • Get list of my tables: • SQL> select * from cat; • Describe a table: • SQL> describe table-name M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  38. RegEx matching • Now do search: • So far, a lot like LIKE SELECT emp_id, text FROM employee_comment WHERE REGEXP_LIKE(text,'...-....'); M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  39. RegEx matching • Can also pull out the matching text with REGEXP_SUBSTR: • If want only numbers, can specify a set of chars rather than a dot: SELECT emp_id, REGEXP_SUBSTR(text,'...-....') text FROM employee_comment WHERE REGEXP_LIKE(text,'...-....'); SELECT emp_id, REGEXP_SUBSTR(text, '[0123456789]..-...[0123456789]') text FROM employee_comment WHERE REGEXP_LIKE(text, '[0123456789]..-...[0123456789]'); M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  40. RegEx matching • Or can specify a range of chars: • Or, finally, can repeat a match several times: SELECT emp_id, REGEXP_SUBSTR(text, '[0-9]..-...[0-9]') text FROM employee_comment WHERE REGEXP_LIKE(text,'[0-9]..-...[0-9]'); SELECT emp_id, REGEXP_SUBSTR(text, '[0-9]{3}-[0-9]{4}') text FROM employee_comment WHERE REGEXP_LIKE(text,'[0-9]{3}-[0-9]{4}'); M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  41. RegEx matching • Other operators: • * - 0 or more matches • + - 1 or more matches • ? - 0 or 1 match • | - OR two options together • Here: we should catch area codes, if they appear: SELECT emp_id, REGEXP_SUBSTR(text, '[0-9]{3}-[0-9]{3}-[0-9]{4}|[0-9]{3}-[0-9]{4}') text FROM employee_comment WHERE REGEXP_LIKE(text,'[0-9]{3}-[0-9]{3}-[0-9]{4}|[0-9]{3}-[0-9]{4}'); M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  42. RegEx matching • There’s a shared structure between the two, but • Area code is just optional • Can use ? op • Must use parens to indicate the whole area code regex is optional SELECT emp_id, REGEXP_SUBSTR(text, '([0-9]{3}-)?[0-9]{3}-[0-9]{4}') text FROM employee_comment WHERE REGEXP_LIKE(text, '([0-9]{3}-)?[0-9]{3}-[0-9]{4}'); M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  43. RegEx matching • Also, different kinds of phone number separators: • dash, dot, blank • One solution: • Write RegEx for each case • OR them all together • Better: just use set of choices of each sep • Characters in [ ] means match any one: SELECT emp_id, REGEXP_SUBSTR(text, '([0-9]{3}[-. ])?[0-9]{3}[-. ][0-9]{4}') text FROM employee_comment WHERE REGEXP_LIKE(text,'([0-9]{3}[-. ])?[0-9]{3}[-. ][0-9]{4}'); M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  44. RegEx matching • One other thing: area codes in parentheses • Parens are part of the regex language • So these parens must be escaped: \( \) SELECT emp_id, REGEXP_SUBSTR(text, '([0-9]{3}[-. ]|\([0-9]{3}\) )?[0-9]{3}[-. ][0-9]{4}') text FROM employee_comment WHERE REGEXP_LIKE(text,'([0-9]{3}[-. ]|\([0-9]{3}\) )?[0-9]{3}[-. ][0-9]{4}'); M.P. Johnson, DBMS, Stern/NYU, Spring 2006

  45. Future • Project Presentations • Practice hw3 • Final Exam • Info up soon… M.P. Johnson, DBMS, Stern/NYU, Spring 2008

More Related