Content integration for e business
1 / 29

- PowerPoint PPT Presentation

  • Updated On :

Content Integration for E-Business. Joe Hellerstein. New Generation of e-Business on the Internet. Companies moving beyond marketing, storefronts Attempting to do operations on the Internet procurement supply chain customer relationships etc. In a cross-enterprise environment

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about '' - heman

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

New generation of e business on the internet l.jpg
New Generation of e-Business on the Internet

  • Companies moving beyond marketing, storefronts

  • Attempting to do operations on the Internet

    • procurement

    • supply chain

    • customer relationships

    • etc.

  • In a cross-enterprise environment

  • Requires cross-enterprise content integration

    • catalog integration is the procurement instance of this problem

Content integration l.jpg
Content Integration

  • Content integration across enterprises

    • Not the “in-house” data warehousing problem

    • Not the Enterprise App Integration (EAI) problem

  • “Operational” data must be integrated

    • As opposed to historical (trend) data

    • E.g. pricing, availability, supply chain

  • Structured and unstructured data

    • Not just relational or XML queries

    • Not just text search

    • A combination of the two: logic meets statistics

The butterfly l.jpg
The “Butterfly”

  • Everybody’s favorite picture c. 1/2000:

  • At question (6/2001) is how many butterflies, who owns them

    • Not a startup opportunity (Transora vs. Chemdex)

    • Perhaps one of the wings is smaller than the other (HomeDepot)




Road map l.jpg
Road Map

  • Setting

  • Scenarios & Terminology

  • Characteristics and Challenges of Content Integration

  • Research Evangelism

Some scenarios for content integration l.jpg
Some Scenarios for Content Integration

  • Catalog Management: Integration and Syndication

    • “MRO” (Maintenance, Repair and Operations) a la Grainger

    • Thousands of suppliers, run by a “content manager”

  • Availability and Pricing

    • Travel industry

    • Necessitates live, cross-enterprise querying

  • Supply Chain Management

    • E.g. auto industry

    • Increase in production requires the entire supply chain (“the cows”)

    • Contractual information along with catalog and availability

Marketing the ecosystem and its terminology l.jpg
Marketing: The EcoSystem and its Terminology

  • Enterprise Application Integration (EAI): App Glue

    • Imperative, message-oriented programming (scripting languages)

    • Transactional networking (persistent queues)

    • Gateways to popular packaged apps

    • Vendors: WebMethods, BEA, CrossWorlds, Netfish, MQseries, etc.

  • Data Integration: Warehousing and associated processes

    • Intra-enterprise, for “business intelligence” (historical trends)

    • Vendors: Informatica, Ascential, DBMS vendors

  • Content Management: Tools for content creation

    • Web page and graphic design

    • Versioning and configuration management

    • Vendors: Vignette, Interwoven, etc.

Road map8 l.jpg
Road Map

  • Setting

  • Scenarios & Terminology

  • Characteristics and Challenges of Content Integration

    • Content Access, Mapping and Transformation

    • Query Processing

  • Research Evangelism

Content integration characteristic and challenges l.jpg
Content Integration: Characteristic and Challenges

  • New integration challenges for e-business

    • cross-enterprise

    • operational

    • data-centric (not app-centric)

    • structured/unstructured

  • Two main thrusts

    • Content Access, Mapping and Transformation

    • Query Processing

Content access relationships with providers l.jpg
Content Access: Relationships with Providers

  • Varying relationships with content providers

    • Direct DBMS access (typically in-house)

    • Direct access to federated apps (SAP, etc.)

      • Gateway vendors a la Merant, NEON, Attunity, etc.

    • Arm’s-length relationships

      • HTML screen scraping

      • XML messaging

        Relationships evolve over time!

        MySimon example

Content mapping l.jpg
Content Mapping

  • Syntactic and semantic integration

    • Formatting/normalization is one piece of the puzzle

      • XML, HTML, Relational, etc.

    • Semantics is much harder

      • E.g. “price”. E.g. “delivery”.

  • Semantics gate the process

    • A “content manager” must own the transformation task

    • Ease of use critical

      • Home Depot has 60,000 suppliers!

      • Standards can help a bit (e.g. UDDI)

      • But graphical tools are the name of the game

Schemas and taxonomies l.jpg
Schemas and Taxonomies

  • Cross-enterprise = multiple schemas

    • Even if standards prevail (very optimistic)

    • Early e-catalog systems were locked into one schema

      • Great for service companies, e.g. Requisite

      • Tools are sounding the death knell

  • Taxonomies are critical

    • Natural for browsing, especially with dirty data

      • “Black Ink”, “India ink”, “fountain pen ink, black”

    • Taxonomy per vertical markets, plus standards like UNSPSC

      • Office Supplies->Ink and lead refills->India ink

    • Taxonomy as data: query it, browse it, etc.

  • Integration task includes taxonomy integration!

Themes in content access and mapping l.jpg
Themes in Content Access and Mapping

  • Scalability in human terms

  • “Content managers”, not geeks

  • The name of the game: semi-automatic tools

    • Statistical (“fuzzy”) techniques to provide hints (not silver bullets)

    • Integrated into graphical programming-by-example interfaces

    • Problem domains:

      • Wrapper generation

      • Data cleaning

      • Schema mapping

      • Taxonomy mapping

      • Syndication

  • One of the key “systems” challenges today

Road map17 l.jpg
Road Map

  • Setting

  • Scenarios & Terminology

  • Characteristics and Challenges of Content Integration

  • Research Evangelism

Query processing issues l.jpg
Query Processing Issues

  • Content to be integrated is increasingly “uncacheable”

    • Arm’s-length accessibility

    • Business rules, not data

      • E.g. custom content throughout the dataflow

    • Volatile information

      • E.g. Availability

  • Yet a great deal of content is cacheable and slowly changing

  • Upshot: need a combined technology

    • Prefetch/Cache/Replicate when possible

    • Query live when impossible

Federated query processing l.jpg
Federated Query Processing

  • DBMS community must shed our materialization myopia!

    • ETL/Warehousing was inelegant and limited

    • What do we do on a “cache miss”??

    • Should be no distinction between materialized views and queries!

  • Federated Query Processing

    • Query across multiple sources

    • Choose among multiple replicas, materialized views

      • Consider staleness

  • This is the natural extension of the modern database vision

    • Cohera uses Mariposa’s economic model to do this

    • Decouples optimization, cost estimation, storage and processing

Standard queries required l.jpg
Standard Queries Required

  • Hand-coded queries are brittle: you want ad-hoc

    • Don’t buy a handful of beans

  • Need support for standard query languages

    • SQL and XPath today

    • SQL/XQuery tomorrow

  • Everybody knows this!

    • Part of industrial religion

      • Oracle on one side

      • Dotcoms on the other side

    • You might get by claiming to be “XML compliant”

      • But most people have cottoned on by now

Ir capabilities need to be in the engine l.jpg
IR capabilities need to be in the engine

  • The best-integrated data will still be noisy (product names, etc)

  • Text search on taxonomies, names, descriptions

  • Still no good integration of DBMS and IR engines

    • Storage (compression huge in IR)

    • Index concurrency (many updates per doc in IR)

    • Query optimization challenges

  • Note: this is not semi-structured querying!

    • Integration of logic + statistics is the real model/query challenge

      • Plus HCI issues

    • Unify: “query”, “browse”, “mine”, “rank”

  • Cohera integrates AltaVista into the engine & optimizer

Core systems issues remain important l.jpg
Core Systems Issues Remain Important

  • Availability, Scalability, Load Balancing

    • All critically important in the B2B space

    • Availability: you don’t even control the components! Outage=news.

    • Scalability: MRO wants to grow up to very big installations

    • Load Balancing: need to respect SLAs, etc.

  • Need adaptive, load balancing, federated QP

    • 100s to 1000’s of “sites”

    • Replication is key to availability, but optimizer must understand it

    • Cohera’s economic model adapts for each query

    • Other models being studied (see DE Bulletin 6/2000)

    • Compile-time, centralized optimizers (R*, et al) will break

Query processing themes l.jpg
Query Processing: Themes

  • Standards

  • Logic + Statistics

  • Adaptivity to changing performance, load, failures

  • Optimizer Scalability

So what really matters today l.jpg
So What Really Matters Today?

  • Cohera sells because…

    • Customers need the content integration workbench today

      • They are in integration pain!

      • Comes in multiple guises (e-catalog, supplier enablement, etc.)

      • Smart tools start cutting the pain immediately

    • Customers want an open, standard solution

      • Plain old SQL and relational schemas (vs. Requisite, e.g.)

      • XML “in the bottom”, “out the top” for messaging/integration

    • Customers want federated querying…tomorrow

      • For today, they’ll settle for a centralized solution

      • Want the flexibility to grow in that direction

        • Federated query engine works fine centralized

        • The converse clearly not true

Road map25 l.jpg
Road Map

  • Setting

  • Scenarios & Terminology

  • Characteristics and Challenges of Content Integration

  • Research Evangelism

Research evangelism l.jpg
Research Evangelism

  • Semi-Automatic Tools

    • Statistical + logical techniques, with a user in the loop

    • E.g. Potter’s Wheel [Raman/Hellerstein, VLDB ‘01]

      • schema integration algebra

      • interactive visualization

      • programming-by-example

      • statistical inferencing for discrepancies and domain detection

    • A new class of “systems” work!

      • “Tools”/“Apps” must be part of our agenda

      • Many systems challenges here, especially on the stat/HCI side

        • Architectural elegance, API design, extensibility, scalability, etc.

Research evangelism cont l.jpg
Research Evangelism, Cont.

  • Adaptive Query Processing

    • Critical to the federated B2B space

      • Unpredictable world, you don’t control the components

    • Also critical to the ubiquitous computing space

      • Sensors are the next challenge

      • Who’s the DBA of your housepaint? The freeway lines?

    • Economic optimization (Mariposa) is one model

    • Finer-Grained adaptivity possible (Eddies, SIGMOD 2K)

    • See for examples, ideas, SW

Research evangelism cont28 l.jpg
Research Evangelism, Cont.

  • Tired of research on relational? Choose wisely!

    • One big direction here is to integrate IR

    • Another is to abandon languages in favor of interfaces

      • query+browse+mine: semi-automatic GUIs again!

  • XML is critical to business, but under control

    • We’re doing fine in this space, thank you

    • XQuery will push (merge with?) SQL

    • The end-result will resemble things you’ve seen before

  • But text search is eating our lunch!

    • Intellectual impact in the last decade?

    • Industrial impact in the last decade?

    • Text search is mostly “just” an access method + a sort metric

      • Integrate into our composable algebras and architectures!

      • Teach it in our undergrad classes

Summary l.jpg

  • Content Integration is a new, challenging industrial space

    • Cohera provides the first complete solution

      • Access with varying relationships, formats

      • Support for multiple schemas and taxonomies

      • Support for custom syndication

      • Support for distributed data, both cacheable and uncacheable

      • Ad hoc querying

      • Fuzzy & structured search

      • Availability, Scalability, Load Balancing

      • Smart graphical tools for content managers

    • A fertile area for research as well

      • Join the fun!