500 likes | 617 Views
Entrepôts de contenu autour de XML et des services Web Serge Abiteboul INRIA-Futurs et LRI-Paris 11. Introduction. Joint works – some participants & projects. Xyleme : Sophie Cluet , Guy Ferran & many others
E N D
Entrepôts de contenu autour de XML et des services WebSerge AbiteboulINRIA-Futurs et LRI-Paris 11 EDA06 - Entrepôts de contenu
Joint works – some participants & projects • Xyleme: Sophie Cluet, Guy Ferran & many others • Acware within Edot project: Benjamin Nguyen, Gabriela Ruberg, Gregory Cobena • Active XML within DbGlobe project: Omar Benjelloun, Ioana Manolescu, Tova Milo & many others • KadoP with Edos project: Ioana Manolescu, Nicoleta Preda & many others EDA06 - Entrepôts de contenu
Success stories in the time of the Internet bubble: Information management • Google: management of Web pages • Mapquest: management of maps • Amazone: book catalogue • eBay: product catalogue • Napster (emule, bearshare, etc.): music database • Flickr: picture database • Wikipedia: dictionary • Even in France: • Meetic: dating database • Kelkoo: comparative shopping EDA06 - Entrepôts de contenu
The trend is towards peer-to-peer infoware • Why? • The Web is switching from centralized servers to communities and syndication • Buzzwords such as Web 2.0 (?) • Infoware: classe de logiciels dont l'objectif est non plus de traiter de l'information, mais de la gérer globalement, tellement les quantités sont de plus en plus importantes • Analogy: • Software development by very structured and controlled groups of programmers vs. • open-source software produced by large communities of autonomous developers EDA06 - Entrepôts de contenu
Outline • Introduction • Content warehouse • Concept • XML and Web services • Xyleme • Peer-to-Peer content warehouse • Concept • Active XML • KadoP • Conclusion EDA06 - Entrepôts de contenu
Warehouse • Goal: integrated access to heterogeneous, autonomous, distributed sources of information • Main functionalities: acquire, transform, filter, clean and integrate data, support for queries • Warehouse vs. mediation Warehouse: information is acquired in advance ≠ Mediation: information acquired when needed Classical tradeoff between updates and queries Typically mix of both EDA06 - Entrepôts de contenu
Content warehouse • All kinds of content • Mail, reports, news, web pages, contacts, catalogs, annotations, etc • Text, multimedia, etc. • Little is numerical vs. OLAP • some may me mixed, e.g., financial reports • Typically found on the Web and not in relational databases EDA06 - Entrepôts de contenu
Content vs. data warehouse EDA06 - Entrepôts de contenu
XML Warehouse Operational data sources Application Operational data sources XML Warehouse Application Feed Exploit Operational data sources Application Operational data sources Same as a relational warehouse • Import data from many sources • Add value to it without interfering with operational data • Export integrated views of it EDA06 - Entrepôts de contenu
The basis of content management • Standard for data exchange • XML, XML Schema… • Extensible Markup Language • Labeled ordered trees • Foundations: tree automata • Query languages • XPATH, XQuery… • Foundations: tree automata • Not perfect but at least exist XML • Xquery • Xpath SOAP WSDL EDA06 - Entrepôts de contenu
Functionalities View and Semantic stemming, integration, classification… Query Processing Store & Index Exploiting GUI, Web services, reporting… Feeding Web EDA06 - Entrepôts de contenu
Functionalities: Feeding • Loading from the Web (Internet and Intranet) • Web search • Web crawl • Access Web data via forms or Web services • Plug-ins to load from • File systems, document management systems • Data bases, LDAP • Newsgroup, emails • Other applications • Extraction and transformation • XSL-T or Xquery mappings for XML sources • XML-izers to load data from other formats • Monitoring of the feeding EDA06 - Entrepôts de contenu
Functionalities: More feeding • User feeding • Document editing • Meta data editing • Publication • API: SOAP and WebDAV EDA06 - Entrepôts de contenu
Functionalities: Storage • Storage of (massive volume of) XML (terabytes) • Indexing of (massive volume of) XML • By structure • By full-text • Linguistic support: multi language, stemming, synonyms, etc. • Very efficient XML query processing • Importance ranking • Monitoring of the warehouse (support for subscriptions) • Access control and security • Versioning, archiving • Recovery • Possibly transaction mechanism EDA06 - Entrepôts de contenu
Functionalities: Enrichment • Global organization • Global schema management • Management of collections • Incorporate domain ontologies and thesauri • Document classification • Cleaning by filtering out documents from collections, etc. • Document enrichment • Concept extraction and tagging • Cleaning inside de document • Summarization, etc. • Relationships between documents • Tables of contents • Tables of index • Cross referencing, etc. EDA06 - Entrepôts de contenu
Functionalities: View & integration • View management • Document restructuring/mapping • Schema to schema mapping • Semantic integration • Manual for complex ones and (semi-) automatic for simple ones • Tools to analyze a set of schemas • Tools to integrate them • Processing for queries on integration view • Management of virtual data in a mediator style EDA06 - Entrepôts de contenu
Functionalities: Exploitation • Access to the warehouse • Browsing • Querying by keywords, XPaths or Xquery • Temporal queries • Query subscription • Reporting • Generation of complex reports with pointers to documents, counts, abstracts… • Organized by collections, content, domains… • By GUI or from programs (Web service-based API) EDA06 - Entrepôts de contenu
Xyleme – in short • 1999: Xyleme research project at INRIA • 2000: Creation of a spin-off • 2006: About 40 people • Technology: a content warehouse built around a very efficient and scalable XML repository • Application example: all articles of Le Monde in XML EDA06 - Entrepôts de contenu
Xyleme Functionalities View and Semantic stemming, integration, classification… Query Processing Store & Index Exploiting GUI, Web services, reporting… Feeding Web EDA06 - Entrepôts de contenu
Xyleme Architecture Loader| Local | Query Loader| Local | Query Loader| Local | Query XML store XML store XML store Index Index Index Client side Applications IE/Java/C++/.Net Or Any Platform HTTP | Web Service API Server side or Application Server Tomcat|Soap Name Server User Manager Url Manager Notification Mgr Global Query Manager Java/C++ API Global Query Manager Corba ... EDA06 - Entrepôts de contenu
Structural identifiersand indexing hash(C) LAN Put(C;[d,p,6,6,1]) hash(“John) Put(“John”;[d,p,3,1,2]) 1 A 7 0 6 2 B C 4 6 1 1 X ancestor of Y <=> pre(X) < pre(Y) and post(X) > post(Y) 3 4 7 D E 1 F 3 5 2 2 2 5 G “John” X parent of Y <=> X ancestor of Y and level(X) = level(Y) - 1 2 3 Structural IDs = Prefix-Postfix-Level EDA06 - Entrepôts de contenu
Query evaluation based on Holistic twig joins (d1, 201, 400) A C D (d1, 224, 201) (d1, 228, 237) “John” (d1, 228, 237) EDA06 - Entrepôts de contenu
The golden triangle of distributed content management on the Web • Standard for data exchange • XML, XML Schema… • Extensible Markup Language • Labeled ordered trees • Foundations: tree automata • Query languages • XPATH, XQuery… • Standards for distributed computing: Web services • SOAP, WSDL, UDDI… • Simple Object Access Protocols • Corba but simpler and on the Web XML • Xquery • Xpath SOAP WSDL EDA06 - Entrepôts de contenu
Peer-to-peer • A large and varying number of computers cooperate to solve some particular task without any centralized authority • Goal: build an efficient, robust, scalable system based (typically) on inexpensive, unreliable computers distributed in a wide area network • Examples • seti@home: search for extraterrestrial intelligence • kazaa: obtain free music/video over the net • cabal: decryption of 512 bits RSA code • grub: P2P Web search EDA06 - Entrepôts de contenu
An XML warehouse in P2P • Warehouse: a very centralized system • P2P: an ultra distributed system (no authority) • P2P warehouse: an oxymoron? • No! • A warehouse: from a logical viewpoint • P2P system: from a physical viewpoint EDA06 - Entrepôts de contenu
P2P mediation Centralized mediation mediator data sources data sources P2P mediator warehouse (logical & physical) P2P warehouse (logical) data sources data sources P2P warehouse (physical) P2P warehouse Centralized warehouse EDA06 - Entrepôts de contenu
P2P XML Warehouse • Data sources and peers are distributed, transient and autonomous • Information is distributed and replicated • Nothing is centralized • Not the control, storage, indexing… • The machines are “cooperating” with some level of trust to provide the functionalities of an XML warehouse EDA06 - Entrepôts de contenu
Performance Optimization of parallelism Avoid bottleneck Replication Availability Replication Cost Avoid the cost of server Share operational cost Dynamicity add/remove new data sources Better scaling Performance Cost for complex queries Communication cost Availability Peers can leave Consistency maintenance Difficult to support transaction Quality Difficult to guarantee quality Advantages Disadvantages EDA06 - Entrepôts de contenu
Centralized vs. distributed data management EDA06 - Entrepôts de contenu
Unstructured P2P networks Local exchange: mappings relate content on different peers Queries are propagated (flooding) SomeWhere, ... Structured P2P networks Content is indexed globally and located via the index Local content, global access KadoP, ... Two classes of P2P networks EDA06 - Entrepôts de contenu
Active XML XML • Xquery • Xpath SOAP WSDL • The standards of distributed data management • Active XML = XML documents with embedded Web service calls where service calls are typically in Xquery • Intensional & Dynamic • This is not a new idea • Procedural attributes in relational systems • Basis of Object Databases • Sun’s JSP, PHP+MySQL, Apache Jelly… EDA06 - Entrepôts de contenu
Active XML = XML + embedded service calls(omitting syntactic details) <resorts state=‘Colorado’> <resort> <name> Aspen </name> <scond> Unisys.com/snow(“Aspen”) </scond> <hotels ID=AspHotels > …. Yahoo.com/GetHotels(<city name=“Aspen”/>) </hotels> </resort> … </resorts> <depth unit=“meter”>1</depth> May contain calls to any SOAP web service to any AXML web services - to be defined EDA06 - Entrepôts de contenu
Not a new idea in databasesNot a new idea on the Web • Mixing calls to data is an old idea • Procedural attributes in relational systems • Basis of Object Databases • In HTML world • Sun’s JSP, PHP+MySQL • Call to Web services inside documents • Macromedia MX, Apache Jelly EDA06 - Entrepôts de contenu
What exactly to exchange • A parameter of a call contains some service calls • The result of a call contains some service calls • Do we evaluate these calls before transmitting the data or not • Hi John, what is the phone number of the CEO of INRIA? • (33 1) 39 66 00 01 • Look in INRIA directory at Michel Cosnard • Find his name at www.inria.fr then look on the directory EDA06 - Entrepôts de contenu
When to activate the call • Explicit pull mode • Frequency: Daily, weekly, etc. • After some event: e.g., when another service call completed • This aspect of the problem is related to active databases • Implicit pull mode : Lazy • When the data is requested • Difficulty : detect that the result of a particular request may be affected by a particular call • This is related to deductive databases • Push mode • E.g., based on a query subscription; the web server pushes information to the client • E.g., synchronization with an external source • This is related to stream and subscription queries EDA06 - Entrepôts de contenu
Peer-to-peer architecture Each Active XML peer Repository: manages Active XML data with embedded web service calls Web client: uses Web services Web server: provides (parameterized) queries/updates over the repository as web services Open source system SUN’s Java SDK 1.4 XML parser XPath processor, XSLT engine Apache Tomcat 4.0 servlet engine Apache Axis SOAP toolkit 1.0 X-OQL query processor persistent DOM repository JSP-based user interface JSTL 1.0 standard tag library see http://activexml.net Active XML peer AXML peer soap EDA06 - Entrepôts de contenu
KadoP model • Data: XML Document; views; Active XML; Web services • Simple semantics: Concepts, namespaces, DTDs, iSa, partOf, relatedTo, context documents (for services) • Queries: tree pattern query with join • KadoP • XML data distributed in the P2P network • Index is distributed via a DHT • Goal: Efficient processing of terabytes of XML with no centralized authority EDA06 - Entrepôts de contenu
Typically on a WAN Peers come and go Small number of messages to “locate” the peer in charge of key k – log n Standard interface: put, get We tried Pastry, Chord and JXTA We use now Pastry Distributed hash tables hash(k) DHT put(k;v3) v1,v2,v3 put(k;v1) put(k;v2) get(k) EDA06 - Entrepôts de contenu
Use structured ID as in Xyleme Publish them in a DHT Use Holistic twig join Main issue: communications WAN vs. LAN Long posting lists Optimization techniques Use only docID [wisconsin] Ship smallest list Semi-join techniques Intensional indexing Indexing in KadoP hash(C) DHT hash(“John) put(C;[d,p,6,6,1]) put(“John”;[d,p,3,1,2]) hash(C) DHT EDA06 - Entrepôts de contenu
On going work • AXML and distributed data management on the Web • Opinion: Xquery is a language for local XML management • Language for distributed query management • Active XML? • What else? • Foundation of distributed query optimization • Recent proposal: AXML + send/receive • KadoP and P2P (Active) XML indexing • Now being tested and working on optimization • ActiveXML is open-source – see activexml.net • KadoP soon will be – already available upon request • Application: distribution of open-source software (with Mandriva) EDA06 - Entrepôts de contenu
Other issues for turning the network into a scalable database • Take an arbitrary problem for data or knowledge management and look at it in the P2P setting with Gigabytes of data • Examples • Self tuning (joint work with Alkis Polyzotis) • Semantic integration (lots of work in Gemo) • Distributed access control (joint work with Bogdan Cautis) • Monitoring (joint work with DistribCom group in INRIA-Rennes) EDA06 - Entrepôts de contenu
Publicité • Lancement de webContent • Une plateforme RNTL • Entrepôt de données du Web pour la surveillance • EADS, Thales, Bongrain, Xyleme, Exalead, NewPhoenix • Recherche de jeunes ingénieurs pour travailler dans webContent EDA06 - Entrepôts de contenu
Merci Merci EDA06 - Entrepôts de contenu