1 / 28

Monitoring XML Data on the Web

Monitoring XML Data on the Web. Benjamin Nguyen , Serge Abiteboul, Grégory Cobéna and Mihaï Preda INRIA Rocquencourt, Projet Verso and Xyleme S.A. FRANCE Contact: firstname.lastname@inria.fr or mihai.preda@xyleme.com http://www-rocq.inria.fr/verso/ and http://www.xyleme.com.

brick
Download Presentation

Monitoring XML Data on the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Monitoring XML Data on the Web Benjamin Nguyen, Serge Abiteboul, Grégory Cobéna and Mihaï Preda INRIA Rocquencourt, Projet Verso and Xyleme S.A. FRANCE Contact: firstname.lastname@inria.fr or mihai.preda@xyleme.com http://www-rocq.inria.fr/verso/ and http://www.xyleme.com

  2. Organization • Introduction • Query Subscription • Motivations • Subscription System Architecture • Subscription Language • Complex Event Detection Algorithm • Alerters • Conclusion SIGMOD'01 Santa-Barbara

  3. A Dynamic Warehouse for the XML Data of the Web Xyleme A complex tissue in the vascular system of higher plants… functions chiefly in conduction but also in support and storage. -Webster

  4. A brief look back… • 1999/2000: a group of researchers from • Inria Rocquencourt, Verso Group • U. of Mannheim, Database Group • U. of Orsay, IASI Group • CNAM, Vertigo Group • October 2000: creation of a start-up. http://www.xyleme.com/ SIGMOD'01 Santa-Barbara

  5. The three aspects of Xyleme • Webhouse • Xyleme stores huge quantities of data (teraB) • Xyleme is more than a search engine (only index) or a mediator (only virtual data) • XML • Xyleme is focused on XML, i.e., trees • Dynamic • Xyleme is interested in data evolution/changes SIGMOD'01 Santa-Barbara

  6. User Interface Xyleme Interface Acquisition & Crawler Change Control Semantic Module Loader Xyleme Global Architecture -------------------- I N T E R N E T ----------------------- Web Interface Query Processor Repository and Index Manager Runs on a cluster of Linux PCs. Implemented in C++ SIGMOD'01 Santa-Barbara

  7. Query Subscription1. Motivations

  8. The Web changes all the time • Data acquisition + maintenance • keep the warehouse up-to-date: “Acquisition and Maintenance of XML Data from the Web”, L. Mignet, M. Preda, S. Abiteboul, B. Amann, A. Marian, Tech. Report • Version management • “Change-Centric Management of Versions in an XML Warehouse”, A.Marian,S. Abiteboul,G. Cobena, L. Mignet VLDB’01 • Change monitoring • query subscription SIGMOD'01 Santa-Barbara

  9. Query Subscription • Users may subscribe to certain events • Changes in a page, a set of pages, • Changes in pages from a particular semantic domain, containing some specific words or with a particular DTD • Changes of particular elements somewhere (new products in a catalog) • Users may request to be notified • Immediately at the time the event is detected • Regularly, e.g., weekly • After a certain number of event detections • Users want to be notified • By email • Upon Login to our site SIGMOD'01 Santa-Barbara

  10. Query subscription2. Architecture

  11. Architecture Xyleme Query Processor documents Trigger Engine Xyleme Alerter Complex Event Detection Reporter Xyleme Reporter Subscription Manager SQL Xyleme Subscription Manager Web Browser SQL SIGMOD'01 Santa-Barbara

  12. d document & alerts d/46 d/46,67 loading Step 1: Atomic Event Detection 5 millions of pages/day atomic event 46: URL matches pattern www.musee-orsay.fr/* atomic event 67: XML document contains the tag <painter> with the value “Monet” metadata manager HTML parser complex event detection XML loader SIGMOD'01 Santa-Barbara

  13. Step2: Complex Event Detection Millions of alerts of pages/day Millions of subscriptions HTML parser complex event detection complex event 12: 67 & 46 (XML document contains the tag <painter> with value “Monet” and URL matches pattern www.musee-orsay.fr/*) XML loader SIGMOD'01 Santa-Barbara

  14. notification/monitoring alerts triggers Millions of notifications/day notification/results clock Step 3: Notification Processor complex event detection Reporter continuous queries SIGMOD'01 Santa-Barbara

  15. Query subscription3. The language

  16. Subscription Language • SQL-like language. • Combines the use of monitoring queries and continuous queries. • The language can be extended by adding new types of atomic events. • Uses the XML Query Language for continuous queries. “Querying the XML Documents of the Web”, V. Aguilera, S. Cluet, F. Boiscuvier, Tech. Report SIGMOD'01 Santa-Barbara

  17. Example subscription myPaintings % what are the new painting entries in Musee d’Orsay site monitoring newPainting select URL where URL extends www.musee-orsay.fr/* and <painter> contains “Monet” % manage the changes in the expositions continuous delta Exposition select ... from ... where when monthly notify daily % send me a daily report SIGMOD'01 Santa-Barbara

  18. Query Subscription4. Complex Event Detection

  19. C1 a4 a3 a7 a1 a4 a6 a5 C0 = a0 C1 = a0 a4 C2 = a0 a1 a3 C3 = a2 C4 = a4 a5 a6 a7 a0 a2 a4 Atomic Event Set Algorithm C2 C4 C1 C3 C0 SIGMOD'01 Santa-Barbara

  20. a4 a4 a0 a0 a2 a2 Atomic Event Set Algorithm C1 C2 a3 C4 a7 S={a0 a2 a 4} a1 a6 Detected Events: a5 C0 C1 C3 C3 C0 a4 a4 SIGMOD'01 Santa-Barbara

  21. Complexity results • A formal study has been conducted. • Experimental (simulation) values concur with this study • Results show that the algorithm is well suited for our application: • 10 million Complex Events • 1 million Atomic Events • 100 Atomic events detected per document 0.8 ms to process a document. ~2 million documents per day. SIGMOD'01 Santa-Barbara

  22. Query Subscription5. Alerters

  23. Alerters • Each Alerter can be viewed as a plugin that acts on a document flow. • All sorts of Atomic events can be detected: URL pattern detection, Keywords, XML structure, Page rank… • Can be distributed. SIGMOD'01 Santa-Barbara

  24. Conclusion and Perspectives • This work has been implemented and integrated in the Xyleme System. • The core of our system is reusable. • The system is expandable, and can be used to trigger various other modules: • versionning of documents • semantic classification SIGMOD'01 Santa-Barbara

  25. Perspectives • Re-use of the core of our system. • Triggering of various other modules. • versioning documents • semantic classification SIGMOD'01 Santa-Barbara

  26. HTML comes from SGML hypertext language fixed number of tags content and presentation are mixed very difficult to extract data from a page old standard XML also semistructured data not fixed not mixed very easy new standard The Coming of XML SIGMOD'01 Santa-Barbara

  27. Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 ... Information System XML = Semistructured Data <product-table> < product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description> </product> < product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description> ... </product-table> Data + Structure Semistructured: more flexible XML SIGMOD'01 Santa-Barbara

  28. The Web and XML

More Related