1 / 29

Content Integration for E-Business

Content Integration for E-Business. Joe Hellerstein. New Generation of e-Business on the Internet. Companies moving beyond marketing, storefronts Attempting to do operations on the Internet procurement supply chain customer relationships etc. In a cross-enterprise environment

heman
Download Presentation

Content Integration for E-Business

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Content Integration for E-Business Joe Hellerstein

  2. New Generation of e-Business on the Internet • Companies moving beyond marketing, storefronts • Attempting to do operations on the Internet • procurement • supply chain • customer relationships • etc. • In a cross-enterprise environment • Requires cross-enterprise content integration • catalog integration is the procurement instance of this problem

  3. Content Integration • Content integration across enterprises • Not the “in-house” data warehousing problem • Not the Enterprise App Integration (EAI) problem • “Operational” data must be integrated • As opposed to historical (trend) data • E.g. pricing, availability, supply chain • Structured and unstructured data • Not just relational or XML queries • Not just text search • A combination of the two: logic meets statistics

  4. The “Butterfly” • Everybody’s favorite picture c. 1/2000: • At question (6/2001) is how many butterflies, who owns them • Not a startup opportunity (Transora vs. Chemdex) • Perhaps one of the wings is smaller than the other (HomeDepot) Marketplace Suppliers Buyers

  5. Road Map • Setting • Scenarios & Terminology • Characteristics and Challenges of Content Integration • Research Evangelism

  6. Some Scenarios for Content Integration • Catalog Management: Integration and Syndication • “MRO” (Maintenance, Repair and Operations) a la Grainger • Thousands of suppliers, run by a “content manager” • Availability and Pricing • Travel industry • Necessitates live, cross-enterprise querying • Supply Chain Management • E.g. auto industry • Increase in production requires the entire supply chain (“the cows”) • Contractual information along with catalog and availability

  7. Marketing: The EcoSystem and its Terminology • Enterprise Application Integration (EAI): App Glue • Imperative, message-oriented programming (scripting languages) • Transactional networking (persistent queues) • Gateways to popular packaged apps • Vendors: WebMethods, BEA, CrossWorlds, Netfish, MQseries, etc. • Data Integration: Warehousing and associated processes • Intra-enterprise, for “business intelligence” (historical trends) • Vendors: Informatica, Ascential, DBMS vendors • Content Management: Tools for content creation • Web page and graphic design • Versioning and configuration management • Vendors: Vignette, Interwoven, etc.

  8. Road Map • Setting • Scenarios & Terminology • Characteristics and Challenges of Content Integration • Content Access, Mapping and Transformation • Query Processing • Research Evangelism

  9. Content Integration: Characteristic and Challenges • New integration challenges for e-business • cross-enterprise • operational • data-centric (not app-centric) • structured/unstructured • Two main thrusts • Content Access, Mapping and Transformation • Query Processing

  10. Content Access: Relationships with Providers • Varying relationships with content providers • Direct DBMS access (typically in-house) • Direct access to federated apps (SAP, etc.) • Gateway vendors a la Merant, NEON, Attunity, etc. • Arm’s-length relationships • HTML screen scraping • XML messaging Relationships evolve over time! MySimon example

  11. Content Mapping • Syntactic and semantic integration • Formatting/normalization is one piece of the puzzle • XML, HTML, Relational, etc. • Semantics is much harder • E.g. “price”. E.g. “delivery”. • Semantics gate the process • A “content manager” must own the transformation task • Ease of use critical • Home Depot has 60,000 suppliers! • Standards can help a bit (e.g. UDDI) • But graphical tools are the name of the game

  12. Cohera Workbench

  13. Schemas and Taxonomies • Cross-enterprise = multiple schemas • Even if standards prevail (very optimistic) • Early e-catalog systems were locked into one schema • Great for service companies, e.g. Requisite • Tools are sounding the death knell • Taxonomies are critical • Natural for browsing, especially with dirty data • “Black Ink”, “India ink”, “fountain pen ink, black” • Taxonomy per vertical markets, plus standards like UNSPSC • Office Supplies->Ink and lead refills->India ink • Taxonomy as data: query it, browse it, etc. • Integration task includes taxonomy integration!

  14. Themes in Content Access and Mapping • Scalability in human terms • “Content managers”, not geeks • The name of the game: semi-automatic tools • Statistical (“fuzzy”) techniques to provide hints (not silver bullets) • Integrated into graphical programming-by-example interfaces • Problem domains: • Wrapper generation • Data cleaning • Schema mapping • Taxonomy mapping • Syndication • One of the key “systems” challenges today

  15. Road Map • Setting • Scenarios & Terminology • Characteristics and Challenges of Content Integration • Research Evangelism

  16. Query Processing Issues • Content to be integrated is increasingly “uncacheable” • Arm’s-length accessibility • Business rules, not data • E.g. custom content throughout the dataflow • Volatile information • E.g. Availability • Yet a great deal of content is cacheable and slowly changing • Upshot: need a combined technology • Prefetch/Cache/Replicate when possible • Query live when impossible

  17. Federated Query Processing • DBMS community must shed our materialization myopia! • ETL/Warehousing was inelegant and limited • What do we do on a “cache miss”?? • Should be no distinction between materialized views and queries! • Federated Query Processing • Query across multiple sources • Choose among multiple replicas, materialized views • Consider staleness • This is the natural extension of the modern database vision • Cohera uses Mariposa’s economic model to do this • Decouples optimization, cost estimation, storage and processing

  18. Standard Queries Required • Hand-coded queries are brittle: you want ad-hoc • Don’t buy a handful of beans • Need support for standard query languages • SQL and XPath today • SQL/XQuery tomorrow • Everybody knows this! • Part of industrial religion • Oracle on one side • Dotcoms on the other side • You might get by claiming to be “XML compliant” • But most people have cottoned on by now

  19. IR capabilities need to be in the engine • The best-integrated data will still be noisy (product names, etc) • Text search on taxonomies, names, descriptions • Still no good integration of DBMS and IR engines • Storage (compression huge in IR) • Index concurrency (many updates per doc in IR) • Query optimization challenges • Note: this is not semi-structured querying! • Integration of logic + statistics is the real model/query challenge • Plus HCI issues • Unify: “query”, “browse”, “mine”, “rank” • Cohera integrates AltaVista into the engine & optimizer

  20. Core Systems Issues Remain Important • Availability, Scalability, Load Balancing • All critically important in the B2B space • Availability: you don’t even control the components! Outage=news. • Scalability: MRO wants to grow up to very big installations • Load Balancing: need to respect SLAs, etc. • Need adaptive, load balancing, federated QP • 100s to 1000’s of “sites” • Replication is key to availability, but optimizer must understand it • Cohera’s economic model adapts for each query • Other models being studied (see DE Bulletin 6/2000) • Compile-time, centralized optimizers (R*, et al) will break

  21. Query Processing: Themes • Standards • Logic + Statistics • Adaptivity to changing performance, load, failures • Optimizer Scalability

  22. So What Really Matters Today? • Cohera sells because… • Customers need the content integration workbench today • They are in integration pain! • Comes in multiple guises (e-catalog, supplier enablement, etc.) • Smart tools start cutting the pain immediately • Customers want an open, standard solution • Plain old SQL and relational schemas (vs. Requisite, e.g.) • XML “in the bottom”, “out the top” for messaging/integration • Customers want federated querying…tomorrow • For today, they’ll settle for a centralized solution • Want the flexibility to grow in that direction • Federated query engine works fine centralized • The converse clearly not true

  23. Road Map • Setting • Scenarios & Terminology • Characteristics and Challenges of Content Integration • Research Evangelism

  24. Research Evangelism • Semi-Automatic Tools • Statistical + logical techniques, with a user in the loop • E.g. Potter’s Wheel [Raman/Hellerstein, VLDB ‘01]http://control.cs.berkeley.edu • schema integration algebra • interactive visualization • programming-by-example • statistical inferencing for discrepancies and domain detection • A new class of “systems” work! • “Tools”/“Apps” must be part of our agenda • Many systems challenges here, especially on the stat/HCI side • Architectural elegance, API design, extensibility, scalability, etc.

  25. Research Evangelism, Cont. • Adaptive Query Processing • Critical to the federated B2B space • Unpredictable world, you don’t control the components • Also critical to the ubiquitous computing space • Sensors are the next challenge • Who’s the DBA of your housepaint? The freeway lines? • Economic optimization (Mariposa) is one model • Finer-Grained adaptivity possible (Eddies, SIGMOD 2K) • See http://telegraph.cs.berkeley.edu for examples, ideas, SW

  26. Research Evangelism, Cont. • Tired of research on relational? Choose wisely! • One big direction here is to integrate IR • Another is to abandon languages in favor of interfaces • query+browse+mine: semi-automatic GUIs again! • XML is critical to business, but under control • We’re doing fine in this space, thank you • XQuery will push (merge with?) SQL • The end-result will resemble things you’ve seen before • But text search is eating our lunch! • Intellectual impact in the last decade? • Industrial impact in the last decade? • Text search is mostly “just” an access method + a sort metric • Integrate into our composable algebras and architectures! • Teach it in our undergrad classes

  27. Summary • Content Integration is a new, challenging industrial space • Cohera provides the first complete solution • Access with varying relationships, formats • Support for multiple schemas and taxonomies • Support for custom syndication • Support for distributed data, both cacheable and uncacheable • Ad hoc querying • Fuzzy & structured search • Availability, Scalability, Load Balancing • Smart graphical tools for content managers • A fertile area for research as well • Join the fun!

More Related