Organization - PowerPoint PPT Presentation

slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Organization PowerPoint Presentation
Download Presentation

play fullscreen
1 / 56
Download Presentation
Download Presentation


- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Semistructured data -- June 2001 Semistructured data:from practice to theorySerge AbiteboulINRIA & Xyleme Serge.Abiteboul@xyleme.com

  2. Organization • Motivations • XML • Typing XML • Querying XML • XML and the Web • Illustrations: 2 problems • Incomplete information • Xyleme • Conclusion

  3. Semistructured data -- June 2001 Motivations

  4. Motivation: Complex data • Structure is irregular (missing/extra data…) • Schema does not exist or is unknown • Schema is rapidly evolving • Relational and ODB models are too rigid • Example: BibTex, HTML, SGML, XML, ASN.1, STEP/Express…

  5. Ontology meta-data Mediator wrapper wrapper wrapper wrapper wrapper wrapper Source Source Source Source Source Source Complex data: mediation User Many data sources coming and going

  6. Motivations: The Web today • Terabytes of data • Private web: not publicly available pages • Deep web: data hidden behind forms • A lot of public pages • Standard is a document/hypertext language HTML

  7. Browsing Search engines in: list of words out: sorted list of URLs Applis: hand-made wrappers Expensive Incomplete Short-lived, not adapted to the Web constant changes The Web today [Raghavan ’00]

  8. A new standard XML • HTML is not appropriate for data exchange on the Web • Standard database models are too constraining for the Web • The solution: a semistructured data model XML • Reminder: a data model consists of a type definition language, a query/update language + more

  9. Semistructured data -- June 2001 The most successful semistructured data model: XML

  10. The origin of XML • Parents • SGML • Relational and OO databases • SGML: markup language for documents • HTML and the Web: billions of pages • Not appropriate for data exchange • XML eXtensible Mark-up Language • W3C and most industrial companies [B2B] • Main idea: separate content and presentation • Use tags to represent structure and semantics

  11. HTML XML comes from SGML – also hypertext language – semistructured data fixed number of tags – not fixed content and presentation – not mixed are mixed very difficult to extract data – much easier from a page old standard for the Web – new standard XML: documents + databases

  12. The <b> X23 </b> new camera replaces the <b> X22 </b>. It comes equipped with a flash (worth by itself <i>53.99 $</i>) and provides great quality for only <i>359.99 $</i>. Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 Information System HTML HTML = Hypertext Language hard Text + presentation Where is the data ?

  13. Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 ... Information System XML = Semistructured Data <product-table> < product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description> </product> < product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description> ... </product-table> easy Data + Structure Semistructured: more flexible XML

  14. XML: example <dealer> <UsedCars> <ad> <model>Honda</model> <year>96</year></ad> </UsedCars> <NewCars> <ad> <model>Acura</model> </ad> </NewCars> <NewCars> <ad> <model>R406</model> </ad> </NewCars> </dealer> dealer UsedCars NewCars NewCars ad ad ad model year model model Honda 96 Acura R406 It is just an unranked tagged ordered tree

  15. XML • Tree or graph • Data and structure/semantics are mixed • Tags contain typing information • Core constructor is list of tag/value pairs • Details • Each node may have an arbitrary number of children with distinct or not tags • Nodes also have attributes that are unordered and unique per node • Standard means to represent cyclic data: Id Idrefs

  16. XML Very active/noisy field - standards • types (DTD/XML schema), style-sheet (XSL), resource description (RDF...) • DOM, SAX… • WML (wap), MathML, SMIL (multimedia), RSS (news), RDF (metadata)... • How fast will XML conquer the web? • so far rather slow (about 1% now of the visible web; much more in intranets); accelerates (e.g., with Explorer 5.5)

  17. Semistructured data -- June 2001 Typing XML

  18. Typing XML • This is heresy for the freedom of the Web • Essential for data management: query optimization, user interfaces, applications • Differences with standard database typing • Collections are sequences instead of sets • Types may be very large (e.g., from integration) • Data is more irregular so types should be more permissive • New issues sometimes: you have the data, extract its type, an approximate type

  19. Intuition : the type is a tree dealer • Semantics and structure are in paths • dealer/UsedCars/ad • dealer/UsedCars/ad/model UsedCars NewCars ad ad model year model text text text

  20. DTD: a grammar Catalog  Product* Product  Name Price? Cat (Part Quantity)* Part  BasicPart + ComposedPart BasicPart  Pame ComposedPart  Name (Part Quantity)* • Nice and simple • Shortcoming: type of an element is independent of its context

  21. dealer dealer UsedCars NewCars UsedCars NewCars adused ad adnew ad model year model year model model More complex: specialization • Type of ad depends on its context • One way to view it: homomorphism

  22. Regular tree automata • Set of accepted trees: regular tree languages • Definable in monadic second-order logic dealer q0 Acceptance: there is a computation such that all leaves are labeled qf Used New p q ad ad ad ad r r s s m y m y m m qf qf qf qf qf qf • variants: top/down bottom/up, nondeterminism, unranked trees

  23. DTDs+specialization Result: DTDs+specialization = regular tree languages • Closure (intersection, union, complement) • Tests for validation, inclusion • Static analysis

  24. Situation today • Many people are using DTDs • Nice and simple in spite of ugly syntax • New proposal: xml-schema • More powerful but too complicated? • Other proposals: Relax, Trex • Usually based on some kind of regular tree automata • From experience: one will win and not necessarily the best

  25. Semistructured data -- June 2001 Query languages for XML

  26. Query Languages for XML • Extensions of SQL • first-order-logic • Information retrieval keyword search • Navigation via regular expression + pattern matching Lorel, XML-QL, XMAS… • Structural recursion UnQL, XSLT… • No official winner – leader is Xquery

  27. Tree with variables and constraints Pattern matching between the query and the data Each match provides a valuation for X,Y,Z Pattern matching catalog product X Y name price cat=elec <200 Z subcategory

  28. Example in Lorel select <offer> Z/name, P/name, P’/price </offer> from P in catalog/product, Z in discount_stores/store, Z/storecatalog/product P’ where P/category=“camera” and P/make=“canon” and P’/id = P/id • Joins like in relational databases • Construction of complex results • Regular expressions for paths (e.g., W/*/name = “Gates”)

  29. What is new in XML queries • A bit new: limited recursion (like in deductive databases) • A bit new but no big deal: constructed answers (like in OODB) • Very new: ordered data • Bothering • Theoretical base is a bit messy: FO, tree automata, bisimulation • No yardstick like relational calculus/algebra

  30. Proposal : k-pebble transducers stack [milo,suciu,vianu]

  31. root a c a a b b b a k-pebble transducers: result

  32. Semistructured data -- June 2001 XML and the Web

  33. Why it is the same old story • Massive amounts of data • Providers export data, users access data • Query languages, indexing, optimization • Database paradigm: still effective on the Web

  34. Why it is not the same old story • Databases • rigid structure • transactions, concurrency control • data independence • controlled (e.g., known cost model) • coherent system, very • polished artifact • The Web • flexible, no schema • flexible protocols • fuzzy separation • perfect mess (and that’s why people like it?) • closer to a natural ecosystem!

  35. The principles of the Web • The uncertainty principle: you can never be sure of anything or that the data is consistent • The incompleteness principle: they do not give you all the data you want (but some you don’t want :-) • The chaos principle: you can rarely assume the existence of some global schema • The instability principle: everything keeps changing Every piece of data you got is probably wrong, incomplete, does not conform to its expected type and is probably already stale

  36. What can be reused? • Some technology? indexes, B-trees, distributed query processing (concurrency control and transactions not yet) • Database theory? little • Algebra and rewrite rules for optimization • Dependency theory • First order and other logics • Seems that because of the ordering, it opens the gates for many more tools such as regular/tree languages

  37. Metaphor [AV]: the Web is infinite • What are the pages pointing to my homepage? • Google solution: milliseconds – stale data • Freeze the Web: weeks to get exact answer • Exact answer: no means to get it • Leads to reconsider the notion of computation

  38. Computability • Finitely computable: give the answer in finite time • All pages reached from my HP in less than 3 links • Eventually computable: each solution is given in finite time; computation may be infinite • All pages reached from my HP • Not computable • Can my HP be reached starting from my HP? • Also: approximate, partial, stale, pipelined answers

  39. Tough life: the Web is huge • Relational calculus/algebra: logspace data complexity (also AC0) • What is the data complexity of an Xquery of the Web? • Complexity of computing on the Web • Logspace in the Web? • Need to trade quality for performance

  40. The Web keeps changing • Classical: versions, temporal queries • Less classical: monitoring of the Web [Xyleme] • Smart crawling of the Web: flow of docs • Query subscription: query on this flow • Continuous queries • What is the underlying theory?

  41. Semistructured data -- June 2001 Illustration: incomplete information Work with Victor Vianu

  42. Example Access to an electronic catalog Q1: name, subcat, price of electronic products with price less than $200 Q2: name, pictures of cameras at least pictured once

  43. * product product product * missing product2 product1 canon 120 elec nikon 199 elec sony 175 elec camera camera cdplayer catalog Q1: name, subcat, price of electronic products with price less than 200

  44. Missing data after Q1 product1 product2 * * name price cat picture name price cat picture =elec >200 !=elec subcategory subcategory

  45. 3 3 c.jpg akai a.jpg elec camera * product1 catalog product2 * product2b * product2c missing product product product product2a canon 120 elec nikon 199 elec sony 175 elec camera camera cdplayer Q2: name, pictures of cameras at least pictured once

  46. product + product2a Missing data name pricecat picture =elec product1 >200 * subcategory no picture name price cat picture product3 !=elec no picture subcategory name price cat product2c elec product2b subcategory * namepricecat !=camera =elec >200 namepricecatpicture =elec >200 Known data subcategory subcategory !=camera

  47. After two queries • Known information: • Prefix of the real data tree • Missing information • Complex type • Q3: name, price, pictures of cameras costing less than $100 and at least pictured once • can be completely answered using A1, A2 • Q4: list all cameras • can be partially answered using A1, A2

  48. Semistructured data -- June 2001 Illustration: Xyleme

  49. A dynamic warehouse of Web data • Warehouse • Xyleme stores huge quantities of data (teraB) • Xyleme is not a search engine (only index) or a mediator(only virtual data) • XML • Xyleme is focused on XML • Dynamic • Xyleme is interested in data evolution/changes

  50. Technical Challenges 1. Data Acquisition and Maintenance discover data of interest and maintain it up to date • Repository store and index this data 3. Efficient query Processing 4. Semantic Integration provide a simple view of each semantic domain 5. Change Control Monitor the web and offer services such as Query Subscription