1 / 42

C20.0046: Database Management Systems Lecture #25

C20.0046: Database Management Systems Lecture #25. M.P. Johnson Stern School of Business, NYU Spring, 2005. Agenda. Querying XML Data Warehousing Next week: Data Mining Websearch Etc. Goals after today:. Be aware of some of the important XML standards

varuna
Download Presentation

C20.0046: Database Management Systems Lecture #25

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C20.0046: Database Management SystemsLecture #25 M.P. Johnson Stern School of Business, NYU Spring, 2005 M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  2. Agenda • Querying XML • Data Warehousing • Next week: • Data Mining • Websearch • Etc. M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  3. Goals after today: • Be aware of some of the important XML standards • Know how to write some DW queries in Oracle M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  4. XML: Semi-structured data • Not too random • Data organized into entities • Similar/related grouped to form other entities • Not too structured • Some attributes may be missing • Size of attributes may vary • Support of lists/sets • Juuust Right • Data is self-describing M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  5. <movieinfo> <movieid="o111"> <title>Lost in Translation</title> <year>2003</year> <starsidref="o333 o444"/> </movie> <movieid="o222"> <title>Hamlet</title> <year>1999</year> <starsidref="o333"/> </movie> <personid="o111"> <name>Bill Murray</name> <moviesidref="o111 o222"/> </person> </movieinfo> M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  6. New topic: Querying XML • XPath • Simple protocol for accessing node • Will use in XQuery and conversion from relations • XQuery • SQL : relations :: XQuery : XML • XSLT • sophisticated transformations • Sometimes for presentation M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  7. XQuery • Queries are FLWR expressions • Based on Quilt and XML-QL FOR/LET... WHERE... RETURN... FOR $b IN document("bib.xml")//book WHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998" RETURN $b/title M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  8. XQuery • Find all book titles published after 1995: FOR $x IN document("bib.xml")/bib/book WHERE $x/year > 1995 RETURN { $x/title } Result: <title>abc</title> <title>def</title> <title>ghi</title> M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  9. SQL v. XQuery Product(pid, name, maker)Company(cid, name, city) Find all products made in NYC SELECT x.name FROM Product x, Company y WHERE x.maker=y.cid and y.city="NYC" SQL FOR $r in document("db.xml")/db, $x in $r/Product/row, $y in $r/Company/row WHERE $x/maker/text()=$y/cid/text() and $y/city/text() = "NYC" RETURN { $x/name } XQuery M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  10. SQL v. XQuery For each company with revenues < 1M count the products over $100 SELECT y.name, count(*) FROM Product x, Company y WHERE x.price > 100 and x.maker=y.cid and y.revenue < 1000000GROUP BY y.cid, y.name FOR $r in document("db.xml")/db, $y in $r/Company/row[revenue/text()<1000000] RETURN <Company> <companyName>{ $y/name/text() }</companyName> <numberOfExpensiveProducts> { count( $r/Product/row[maker/text()=$y/cid/text()][price/text()>100]) } </numberOfExpensiveProducts> </Company> M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  11. XSLT: XSL Transformations • Converts XML docs to other XML docs • Or to HTML, PDF, etc. • E.g.: Have data in XML, want to display to all users • Users view web with IE, Firefox, Treo… • Have XSLT convert to HTML that looks good on each • XSLT processor takes XML doc and XSL template for view M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  12. XSLT v. XQuery • FLWR expressions: • Often much simpler than XSLT • XSLT v. XQuery: • http://www.xmlportfolio.com/xquery.html <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <xsl:for-each select="document('bib.xml')//book"> <xsl:if test="publisher='Morgan Kaufmann' and year='1998'"> <xsl:copy-of select="title"/> </xsl:if> </xsl:for-each> </xsl:template> </xsl:transform> FOR $b IN document("bib.xml")//book WHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998“ RETURN $b/title M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  13. Displaying XML with XSL/XSLT • XSL: style sheet language for XML • XSL : XML :: CSS : HTML • Menu in XML: • http://www.w3schools.com/xml/simple.xml • XSL file for displaying it: • http://www.w3schools.com/xml/simple.xsl • XSL applied to the XML: • http://www.w3schools.com/xml/simplexsl.xml • More info on Java with XSLT and XPath: • http://java.sun.com/webservices/docs/ea2/tutorial/doc/JAXPXSLT2.html M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  14. From XML to relations (Oracle) • To move single values from XML to tables, can simply use extractvalue in UPDATE statements: SQL> UPDATE purchase_order SET order_nbr = 7101, customer_po_nbr = extractvalue(purchase_order_doc, '/purchase_order/po_number'), customer_inception_date = to_date(extractvalue(purchase_order_doc, '/purchase_order/po_date'), 'yyyy-mm-dd'); M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  15. From relations to XML (Oracle) • Saw how to put XML in a table • Conversely, can convert ordinary rel data to XML • XMLElement() generates an XML node • Now can call XMLElement ftn to wrap vals in tags: • And can build it up recursively: SELECT XMLElement("supplier_id", s.supplier_id) || XMLElement("name", s.name) xml_fragment FROM supplier s; SELECT XMLElement("supplier", XMLElement("supplier_id", s.supplier_id), XMLElement("name", s.name)) FROM supplier s; M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  16. Why XML matters • Hugely popular • To past few years what Java was to mid-90s • Buzzword-compliant • XML databases won’t likely replace RDBMSs (remember OODBMSs?), but: • Allows for comm. between DBMSs disparate architectures, tools, languages, etc. • Basis for Web Services • DBMS vendors are adding XML support • MS, Oracle, et al. M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  17. For more info • APIs: SAX, JAXP • Editors: XML Spy, MS XML Notepad: http://www.webattack.com/get/xmlnotepad.shtml • Parsers: Saxon, Xalan, MS XML Parser • Lectures drew on resources from: • Nine-week course on XML: • http://www.cs.rpi.edu/~puninj/XMLJ/classes.html • W3C XML Tutorial: • http://www.w3schools.com/xml/default.asp • http://www.cs.cornell.edu/courses/cs433/2001fa/Slides/Xml,%20XPath,%20&%20Xslt.ppt M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  18. Recent XML news/etc. • Group at Sun planning “binary XML” • http://developers.slashdot.org/article.pl?sid=05/01/14/1650206&tid=156 • XML is “simple and sloppy” • http://www.adambosworth.net/archives/000031.html • RDF: Resource Definition Framework • Metadata for the web  “Semantic web” • Content, authors, relations to other content • http://www.w3.org/DesignIssues/RDFnot.html • Web + XML = the “global mind” • http://novaspivack.typepad.com/nova_spivacks_weblog/2004/06/minding_the_pla.html M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  19. New topic: Data Warehousing • Physical warehouse: stores different kinds of items • combined from different sources in supply chain • access items as a combined package • “Synergy” • DW is the sys containing the data from many DBs • OLAP is the system for easily querying the DW • Online analytical processing • front-end to DW & stats M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  20. Integrating Data • Ad hoc combination of DBs from different sources can be problematic • Data may be spread across many systems • geographically • by division • different systems from before mergers… M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  21. Conversion/scrubbing/merging • Lots of issues… • different types of data • Varchar(255) v. char(30) • Different values for data • ‘GREEN’/’GR/’2 • Semantic differences • Cars v. Automobiles • Missing values • Handle with nulls or XML M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  22. Federated DBs • Situ: n different DBs must work together • One idea: write programs for each to talk to each other one • How many programs required? • Like ambassadors for each country M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  23. Federated DBs • Better idea: introduce another DB • write programs for it to talk to each other DB • Now how many programs? • English in business, French in diplomacy •  Warehousing • Refreshed nightly M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  24. OLTP v. OLAP • DWs usually not updated in real-time • data is usually not live • but care about higher-level, longer-term patterns • For “knowledge workers”/decision-makers • Live data is in system used by OLTP • online transaction processing • E.g., airline reservations • OLTP data loaded into DW periodically, say nightly M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  25. Utilizing Data • Situ: each time manager has hunch •  requests custom reports •  direct programmers to write/modify SQL app to produce these results • on higher or lower levels, for different specifics • Problem: too difficult/expensive/slow • too great a time lag M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  26. EISs • Could just write queries at command-prompt • But decision makes aren’t (all) SQL programmers • Soln: create an executive information system • provides friendly front-end to common, important queries • basically a simple DB front-end • your project part 5 • GROUP BY queries are particularly applicable… M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  27. EISs v. OLAP • Okay for fixed set of queries • But what if queries are open-ended? • Q: What’s driving sales in the Northeast? • What’s the source cause? • Result from one query influences next query tried • OLAP systems are interactive: • run query • analyze results • think of new query • repeat M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  28. Star Schemas • Popular schema for DW data • One central DB surrounded by specific DBs • Center: fact table • Extremities: data tables • Fields in fact table are foreign keys to data tables • Normalization  Snowflake Schema • May not be worthwhile… M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  29. Dates and star schemas • OLAP behaves as though you had a Days table, with every possible row • Dates(day, week, month, year, DID) • (5, 27, 7, 2000) • Can join on Days like any other table M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  30. Dates and star schemas • E.g.: products x salesperson x region x date • Products sold by salespeople in regions on dates • Regular dim tables: • Product(PID, name, color) • Emp(name, SSN, sal) • Region(name, RID) • Fact table: • Sales(PID, DID, SSN, RID) • Interpret as a cube (cross product of all dimensions) • Can have both data and stats M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  31. Drill-down & roll-up • Imagine: notice some region’s sales way up • Why? Good salesperson? Some popular product there? • Maybe need to search by month, or month and product, abstract back up to just product… • “slicing & dicing” M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  32. OLAP and data warehousing • Could write GROUP BY queries for each • OLAP systems provide simpler, non-SQL interface for this sort of thing • Vendors: MicroStrategy, SAP, etc. • Otoh: DW-style operators have been added to SQL and some DBMSs… M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  33. DW extensions in SQL: ROLLUP (Oracle) • Suppose have orders table (from two years), with region and date info: • Can select total sales: • Examples derived/from Mastering Oracle SQL, 2e (O’Reilly) • Get data here: http://examples.oreilly.com/mastorasql2/mosql2_data.sql SQL> column month format a10 SQL> @mosql2_data SQL> describe all_orders; SELECT sum(o.tot_sales) FROM all_orders o join region r ON r.region_id = o.region_id; M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  34. DW extensions in SQL: ROLLUP (Oracle) • Can write GROUP BY queries for year or region or both: SELECT r.name region, o.year, sum(o.tot_sales) FROM all_orders o join region r ON r.region_id = o.region_id GROUP BY (r.name, o.year); M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  35. DW extensions in SQL: ROLLUP (Oracle) • ROLLUP operator • Extension of GROUP BY • Does GROUP BY on several levels, simultaneously • Order matters • Get sales totals for each region/year pair each region, and the grand total: SELECT r.name region, o.year, sum(o.tot_sales) FROM all_orders o join region r ON r.region_id = o.region_id GROUP BY ROLLUP (r.name, o.year); M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  36. DW extensions in SQL: ROLLUP (Oracle) • Change the order of the group fields to get a different sequence of groups • To get totals for each year/region pair, each year, and the grand total, and just reverse group-by order: SELECT o.year, r.name region, sum(o.tot_sales) FROM all_orders o join region r ON r.region_id = o.region_id GROUP BY ROLLUP (o.year, r.name); M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  37. DW extensions in SQL: ROLLUP (Oracle) • Adding more dimensions, like month, is easy (apart from formatting): • NB: summing happens on each level SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales) FROM all_orders o join region r ON r.region_id = o.region_id GROUP BY ROLLUP (o.year, o.month, r.name); M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  38. DW extensions in SQL: ROLLUP (Oracle) • If desired, can combine fields for the sake of grouping: SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales) FROM all_orders o join region r ON r.region_id = o.region_id GROUP BY ROLLUP ((o.year, o.month), r.name); M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  39. DW extensions in SQL: CUBE (Oracle) • Another GROUP BY extension: CUBE • Subtotals all possible combins of group-by fields (powerset) • Syntax: “ROLLUP”  “CUBE” • Order of fields doesn’t matter (apart from ordering) • To get subtotals for each region/month pair, each region, each month, and the grand total: SELECT to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales) FROM all_orders o join region r ON r.region_id = o.region_id GROUP BY CUBE (o.month, r.name); M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  40. DW extensions in SQL: CUBE (Oracle) • Again, can easily add more dimensions: SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales) FROM all_orders o join region r ON r.region_id = o.region_id GROUP BY CUBE (o.year, o.month, r.name); M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  41. DW SQL exts: GROUPING SETS (Oracle) • That’s a lot of rows • Instead of a cube of all combinations, maybe we just want the totals for each individual field: SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales) FROM all_orders o join region r ON r.region_id = o.region_id GROUP BY GROUPING SETS (o.year, o.month, r.name); M.P. Johnson, DBMS, Stern/NYU, Spring 2005

  42. Next time • Overview of data mining • Some other odds & ends… M.P. Johnson, DBMS, Stern/NYU, Spring 2005

More Related