1 / 36

Organization

A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme SA ( Serge.Abiteboul@inria.fr ) http://www-rocq.inria.fr/verso/ http://www.xyleme.com/. Organization. 1. The Web and XML 2. Xyleme

akamu
Download Presentation

Organization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Xyleme, 2001 A Dynamic Warehouse for the XML data of the WebGrégory COBENAINRIA & Xyleme SA( Gregory.Cobena@inria.fr )Serge Abiteboul, INRIA & Xyleme SA( Serge.Abiteboul@inria.fr )http://www-rocq.inria.fr/verso/ http://www.xyleme.com/

  2. Organization • 1. The Web and XML • 2. Xyleme • 3. Data Acquisition and Maintenance • XML Repository, Semantic Data Integration and Query Processing • 4. Query Subscription • Conclusion

  3. Xyleme, 2001 1. The Web and XML

  4. The Web today • Terabytes of data • A lot of public pages • 1 billion in [06/2000] • several millions of servers • Private web: not publicly available pages • Deep web: data hidden behind forms

  5. The <b> X23 </b> new camera replaces the <b> X22 </b>. It comes equipped with a flash (worth by itself <i>53.99 $</i>) and provides great quality for only <i>359.99 $</i>. Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 Information System HTML HTML = Hypertext Language hard Text + presentation Where is the data ?

  6. Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 ... Information System XML = Semistructured Data <product-table> < product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description> </product> < product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description> ... </product-table> easy Data + Structure Semistructured: more flexible XML

  7. XML : Tree Types product-table • Semantics and structure are in paths • product-table/product/reference • product-table/product/price product reference price designation description

  8. Xyleme, 2001 2. A Dynamic Warehouse for the XML Data of the Web Xyleme

  9. Xyleme Research Project Xyleme at INRIA (1999-2000) : Explore XML + Web + SGBD to make the Web a Knowledge Database • INRIA • Sophie Cluet:Databases (OQL…) • Serge Abiteboul: semi-structured data + web • Guy Ferran: ex O2 Technology • Mannheim University • Guido Moerkotte • Université d’Orsay • Marie Christine Rousset • CNAM • Dan Vodislav

  10. Xyleme Company • Started September 2000 (25 employees end of 2001) • Market Challenges: • Few XML documents available on the Web (because of weak software support) • Company is focusing on private XML: • Press, Editors, Financial Data, Biology… • Technology: • Scalability for large amount of data • Internet (+focus) / Intranet support • Monitoring and Version Management • Heterogeneous Data Integration

  11. Architecture • Cluster of PCs • Developed with Linux and C++ • Communications • local: Corba • external: HTTP • Distribution between autonomous machines • Now Web Services

  12. User Interface Xyleme Interface Acquisition & Crawler Change Control Semantic Module Loader Functional Architecture -------------------- I N T E R N E T ----------------------- Web Interface Query Processor Repository and Index Manager

  13. Change Control and Semantic Integration Change Control and Semantic Integration Acquisition and Maintenance Acquisition and Maintenance Index Index Index Loader |Query Loader |Query Repository Repository Repositorry Repository Architecture -------------------- I N T E R N E T ----------------------- E T H E R N E T

  14. Xyleme, 2001 3. Data Acquisition and Maintenance,Page Importance

  15. Goals • Discover XML pages on the web that are of interest for customers • For this crawl the web (HTML+XML) • Maintain them up to date • Do this under bounded resources: • Memory for known URLs • Bandwidth

  16. Life Cycle of a page in Xyleme • The URL of D is discovered as a link in another page (or published by a customer) • The page scheduler decides to read D • The meta data of D is read • type, last_date_update... • The document D is loaded • The document D is re(read) regularly

  17. Main Issues • Loading of pages • we can load up to 5 millions of pages/day on a standard PC • main cost is Internet connection • Metadata management (access to disk) • Page scheduling • decide which page to read or refresh next

  18. Page Importance • Definition: Important pages are linked to by important pages • Offline algorithm (used by Google) • Our Online algorithm (M. Preda, S. Abiteboul, G. Cobena) • does not require to maintain graph information • faster convergence with focused crawling

  19. Xyleme, 2001 ( XML Repository,Semantic Data Integrationand Query Processing )

  20. Querying Language • Today: A mix of OQL and XQL • We are currently moving to X-Query (which is also a mix of OQL and XQL…) Select boss/Name, boss/Phone From comp in BusinessDomain, boss in comp//Manager Where comp/Product contains “Xyleme”

  21. Web Heterogeneity • Semantic domains, e.g., cinema • Many possible types for data in this domain, many DTDs • Semantic Integration • one abstract DTD for the domain • gives the illusion that the system maintains an homogeneous database for this domain 1 domain = 1abstract DTD

  22. Indexing • Standard inverted index • word  documents that contain this word • Xyleme index • word  elements that contain this word document + element identifier • Goal: more work can be performed without accessing data

  23. Xyleme, 2001 4. Change Control

  24. The Web changes all the time • Data acquisition + maintenance • keep the warehouse up-to-date • Version management • representation and storage of changes • Change monitoring • query subscription

  25. Subscription Language • SQL-like language based on ‘atomic events’. • Combines the use of monitoring queries and continuous queries. • The language can be extended by adding new types of atomic events. • Uses the XML Query Language for continuous queries. “Querying the XML Documents of the Web”, V. Aguilera, S. Cluet, F. Boiscuvier, Tech. Report

  26. Example subscription myPaintings % what are the new painting entries in Musee d’Orsay site monitoring newPainting select URL where URL extends www.musee-orsay.fr/* and <painter> contains “Monet” % manage the changes in the expositions continuous delta Exposition select ... from ... where when monthly notify daily % send me a daily report Atomic events

  27. d document & alerts d/46 d/46,67 loading Step 1: Atomic Event Detection 5 millions of pages/day atomic event 46: URL matches pattern www.musee-orsay.fr/* atomic event 67: XML document contains the tag <painter> with the value “Monet” metadata manager HTML parser complex event detection XML loader

  28. Step 2: Complex Event Detection Millions of alerts of pages/day Millions of subscriptions HTML parser complex event detection complex event 12: 67 & 46 (XML document contains the tag <painter> with value “Monet” and URL matches pattern www.musee-orsay.fr/*) XML loader

  29. notification/monitoring alerts triggers Millions of notifications/day notification/results clock Step 3: Notification Processor complex event detection Reporter continuous queries

  30. Architecture Xyleme Query Processor documents Trigger Engine Xyleme Alerter Complex Event Detection Reporter Xyleme Reporter Subscription Manager SQL Xyleme Subscription Manager Web Browser SQL

  31. Complex Events Algorithm • The formal problem is NP-hard • We proposed several possible algorithms • Experimental (simulation) values proved the effectiveness of our solutions • The Hash-Tree based algorithm is well suited for our application: • 10 million Complex Events • 1 million Atomic Events • 100 Atomic events detected per document 0.8 ms to process a document. ~2 million documents per day.

  32. Alerters • Each Alerter can be viewed as a plug-in that acts on a document flow. • All sorts of Atomic events can be detected: URL pattern detection, Keywords, XPath expressions, Page rank… • Can be distributed.

  33. Some Advanced Alerts • Process document flow (single pass) • Full strings • Context Stack • Reversed look-up • XML Alerts • Reversed XPath expressions • Dual context stack for ‘/’ and ‘//’

  34. Versions • Objectives: • Temporal Queries (persistent identification of nodes) • Version some documents or some sites (store a ‘delta’) • Change Monitoring (query changes) • We proposed a representation of changes “Change-Centric Management of Versions” (VLDB 2001) • We developed a Diff algorithm for XML “Detecting Changes in XML Documents”, G. Cobena, S. Abiteboul, A. Marian ICDE 2002 (San Jose)

  35. Conclusion & Prospectives • Focus crawling on important pages • Refine notion of importance • Improve important pages discovery • Improve Change control accuracy • Semantic web • Real-time advanced processing

  36. Xyleme, 2001 Merci

More Related