1 / 50

From Searching Text to Querying XML Streams

From Searching Text to Querying XML Streams. Dan Suciu www.cs.washington.edu/homes/suciu. About Me. Born 1957, Romania BS: Bucharest, PhD: University of Pennsylvania Now: University of Washington (Seattle) My work is on semistructured data

Download Presentation

From Searching Text to Querying XML Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Searching Text to Querying XML Streams Dan Suciu www.cs.washington.edu/homes/suciu XML Toolkit

  2. About Me • Born 1957, Romania • BS: Bucharest, PhD: University of Pennsylvania • Now: University of Washington (Seattle) My work is on semistructured data • Book: Data on the Web: From relations, to semistructured data and XML Past/present projects: • XML-QL = precursor of XQuery • XMill = the XML compressor • XML toolkit XML Toolkit

  3. Motivation • Text databases • Studied over the past 15 years • Traditional client/server model • Struggled with lack of standard text syntax • Recently, new standard: XML • Traditional client/server: in today’s dbms • New applications: stream processing • This talk: processing stream XML data • My motivation: work on the XML Toolkit project XML Toolkit

  4. Outline • Background • The XML stream processing problem • Basic XML processing with automata • Adapting automata to XML • Stream indexes • Conclusions XML Toolkit

  5. Background:Relational Databases • Structured, stored in tables • Schema separate from data • Queries: precise, refer to schema and data (SQL) Hard to publish, easy to query precisely XML Toolkit

  6. Background:Text Databases • Unstructured, stored in documents • No schema, only data • Queries: imprecise, refer to data only (keywords) Foundations of Databases, Abiteboul (FR), Hull (USA), Vianu (USA) Addison Wesley, 1995 Data on the Web Abiteoul (FR), Buneman (UK), Suciu (USA) Morgan Kaufmann, 1999 Easy to publish, hard to query precisely XML Toolkit

  7. Background:XML Data • Semistructured • Schema and data are together: self-describing • Queries: precise, refer to schema and data (SQL) • <bib> • <book> <title> Foundations… </title> • <author> <name> Abiteboul </name> • <country> FR </country> • </author> • <author> <name> Hull </name> • <country> USA </country> • </author> • <author> <name> Vianu </name> • <country> USA </country> • </author> • <publisher> Addison Wesley </publisher> • <year> 1995 </year> • </book> • … • </bib> XML: Easier to publish,easy to query precisely XML Toolkit

  8. Background:XML Data Data model = tree bib paper book book title author journal title author author publisher author Addison Wesley name country Data on the Web name country Buneman UK Abiteboul FR XML Toolkit

  9. Background:XML Data • Querying with XPath (and XQuery) • This talk: XPath queries restricted to: tag / // * [ ] path=“constant” XML Toolkit

  10. Background:XPath in One Slide tag, / /bib/book/author/name //,* Navigate partially known structure /bib/book//name/*/zip Conjunctivequeries ala SQL /bib/book[author/name=“Abiteboul”] [ ] /bib/book/[year=“1995” and author[name=“Abiteboul” and country=“FR”]] This is precisely the “region algebra” E.g. use proximal nodes [Navarro&Baeza-Yates’97] XML Toolkit

  11. Outline • Background • The XML stream processing problem • Basic XML processing with automata • Adapting automata to XML • Stream indexes • Conclusions XML Toolkit

  12. Main Application:XML Packet Routing • Selective Dissemination of Information [Altinel&Franklin’00, Chan et al.02] • XML content routing [Snoeren et al.01] • SOAP Message routing in Application Servers XML Toolkit

  13. <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> XML Packet Routing <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> XML Toolkit

  14. XPath expressions /bib/book /publisher=“MK” /bib/book [category=“recent”]/title =“Web” /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” Output XML Streams Input XML Stream <bib> <book> ... </bib> <bib> <book> ... </bib> XML Toolkit

  15. The XML Stream Processing Problem • Given: • A set of XPath expressions • An Incoming stream of XML documents • Decide: • For each document which expressions it matches Hard: Large number of XPath expressions e.g. 103 - 106 Streaming XML data, high throughput e.g. 5MB/s Easy: Shallow XML data e.g. depth=20 Short XPath expressions XML Toolkit

  16. The Approaches Basic techniques • NFA plus optimizations: • Xfilter/Yfilter [Altinel&Franklin’00] • XTrie [Chan et al.02] • DFA: • XML Toolkit Beyond the obvious • Stream indexes (XML Toolkit) • Stream views XML Toolkit

  17. Outline • Background • The XML stream processing problem • Basic XML processing with automata • Adapting automata to XML • Stream indexes • Conclusions XML Toolkit

  18. e e * catalog price product category quantity * "tools" price 200 From XPath to NFA /catalog/product[category="tools"][*/price = 200]/quantity //price Extra processing needed to combine branches (not in this talk) XML Toolkit

  19. NFA . . . . . . Current states SAX events Basic NFA Evaluation XPath /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some” . . . . . . . . . /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old” <bib> <book> ... </bib> XML Toolkit

  20. Basic NFA Evaluation Properties: Space = linear Throughput = decreases linearly Systems: • XFilter [Altinel&Franklin’99], YFilter. • XTrie [Chan et al.’02] XML Toolkit

  21. Current state SAX events Basic DFA Evaluation DFAs XPath /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some” . . . . . . . . . /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old” <bib> <book> ... </bib> XML Toolkit

  22. Basic DFA Evaluation Properties: Throughput = constant ! Space = GOOD QUESTION System: • XML Toolkit [University of Washington]http://xmltk.sourceforge.net XML Toolkit

  23. XMLTK: An XML Toolkit for Scalable XML Stream Processing I. Avila-Campillo, T.J. Green, A. Gupta, M. Onizuka, D. Raven, D. Suciu XML Toolkit

  24. Motivation • Lots of data sits in large text files • ad hoc data formats • “Queried” with Unix command line tools • grep, sort, tail, etc • Would be nice to XML-ize it... • ...but then the Unix command line tools won’t work any more. XML Toolkit

  25. Example Text file • In the old Unix world… score decision paperID title • accept P054 “Theory of XML parsing” • reject P021 “Experience with an XML optimizer” • accept P069 “Towards a unified theory of data models” • . . . . . . • Find the top ten rejected papers (in score order): grep “reject” papers.txt | sort | tail 10 XML Toolkit

  26. Example (cont’d) • In the new XML world… • <submissions> • <paper> • <score> 6 </score> • <decision> accept </decision> • <paperID> P054 <paperID> • <title>Theory of XML parsing </title> • </paper> • <paper> • <score> 3 </score> • <decision> reject </decision> • <paperID> P021 </paperID> • <title> Experience with an XML optimizer </title> • </paper> • . . . . . … can’t use those tools anymore  XML Toolkit

  27. Example (con’d) Doing it with the XML Toolkit: Finds top ten rejected <paper>s, in <score> order xsort –c /submissions –e paper[decision/text()=“reject”] –k score/text() papers.xml| xtail –c /submissions –e paper –n 10 XML Toolkit

  28. Goals of the XML Toolkit Simple, scalable tools for XML processing • Provides service: there are people who need this • Provides a research platform: for XML stream processing XML Toolkit

  29. Outline • The tools • The XPath processing engine • Conclusions XML Toolkit

  30. The Tools Current tools: • xsort • xagg • xnest • xflatten • xdelete • xpair • xhead • xtail • file2xml • xmill Will talk only about this May look plenty, but actually still incomplete... XML Toolkit

  31. XSort: Definition General form -c = the context, i.e. where to sort -e = the item, i.e what to sort -k = the key, i.e. what to sort on xsort (–c XPathExpr (-e XPathExpr (-k XPathExpr)*)*)* XML Toolkit

  32. XSort c c c c c c e2 e5 e1 e2 e6 e4 e7 e5 e1 e3 e6 e7 e4 e3 e9 e8 e9 e8 XSort: Definition XML Toolkit

  33. XSort Examples Examples illustrated on data like this: <bib> <book> <author>Elliotte Rusty Harold</author> <author>W. Scott Means</author> <title>XML in a Nutshell</title> <publisher>O'Reilly</publisher> <year>2001</year> <isbn>0-596-00058-8</isbn> </book> <paper> <author>Sylvain Devillers</author> <title>XML and XSLT Modeling for Multimedia Bitstream Manipulation.</title> <year>2001</year> <booktitle>WWW Posters</booktitle> <ee>http://www10.org/cdrom/posters/1112.pdf</ee> <url>db/conf/www/www2001p.html#Devillers01</url> </paper> . . . . . XML Toolkit

  34. XSort: Examples xsort –c /bib –e paper –k title/text() Sorts the <paper>s, by <title> The <book>s are dropped from the output Compare to… <bib> <paper> . . . </paper> <paper> . . . </paper> . . . . . </bib> xsort –c /bib –e * –k title/text() xsort –c /bib –e paper –k title/text() –e book –k title/text() XML Toolkit

  35. XSort: Examples xsort –c /bib –e paper/author –k lastName/text() –k firstName/text() Sorts the <author>s, by <lastName> then <firstName> <bib> <author> . . . </author> <author> . . . </author> . . . . . </bib> XML Toolkit

  36. XSort: Examples xsort –c /bib –e paper –e article –e book –e * <paper>s first, then <article>s, then <book>s, then all the rest <bib> <paper> . . . </paper> <paper> . . . </paper> . . . . . <article> . . . </article> . . . . . <book> . . . </book> . . . . . </bib> XML Toolkit

  37. XSort: Examples xsort –c /bib/* –e author –e title –e year –e * Normalize all entries: <author>s first, then <title>s, then <year>sthen all the other elements xsort –c /bib/paper –e author –e * –c /bib/book –e title –e * In <paper>s list the <author>s first; in <book>s list the <title> first; Leave other entries unchanged XML Toolkit

  38. XSort: Implementation • Sorts one context at a time, copies the rest • For each context: • Create a “global key” for each item • Sort items, with a two-pass, multiway merge sort • Quote from Databases 101 (news from the trenches): • with disk blocks of 4KB and 128MB of main memory, one can sort files up to 4TB in two passes ! XML Toolkit

  39. XSort: Performance xsort –c /dblp –e * –k title/text() 1GB ! 8minutes XML Toolkit

  40. Outline • The tools • The XPath processing engine • Conclusions XML Toolkit

  41. The XPath Processor Common to all tools is the following problem: Given: • Set of correlated XPath expressions • Stream of SAX events Decide: • When are the expressions true  variable events XML Toolkit

  42. The XPath Processor How we did it: • All Xpath expressions  Deterministic Finite Automaton • Restriction: no predicates yet (current work...) • Does this scale to many, many XPath expressions ? • Yes, if we compute the DFA lazily (upcoming ICDT’2003 paper) • Evaluation time is = parsing time • Can do even better with a Stream IndeX (next) XML Toolkit

  43. News: The parser isthe main bottleneckin XPath streamprocessing ! Stream IndeX (SIX) Solution: “Index” the XML stream, parse only partially Definition: The SIX = a table of (start, end) offsets XML Toolkit

  44. Stream IndeX (SIX): Construction XML SIX <bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book> </bib> XML Toolkit

  45. Skip Parsing Skip Parsing Stream IndeX (SIX): Skip Parsing XPath XML /bib/paper/title. . . <bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book><paper>. . . . . . </bib> XML Toolkit

  46. Stream IndeX (SIX) in XML Stream Processing SIX (E.g. DIME) <bib> <book> ... </bib> <bib> <book> ... </bib> <bib> <book> ... </bib> XML The SIX stream is about 6% of the data stream And can be made MUCH smaller XML Toolkit

  47. XML Toolkit

  48. XML Toolkit

  49. Outline • The tools • The XPath processing engine • Conclusions XML Toolkit

  50. Conclusions • The toolkit is already available: • http://www.cs.washington.edu/homes/suciu/XMLTK • http://xmltk.sourceforge.net • What it does so far it does very well: • Sorting, aggregation, nest/unnest • But doesn’t do too much: • Restrictedselections, no projections, no restructurings yet • Volunteers welcome ! • Can one process XML data without parsing it completely ? • SIX XML Toolkit

More Related