slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Organization PowerPoint Presentation
Download Presentation
Organization

Loading in 2 Seconds...

play fullscreen
1 / 36

Organization - PowerPoint PPT Presentation


  • 82 Views
  • Uploaded on

A Dynamic Warehouse for the XML data of the Web Grégory COBENA INRIA & Xyleme SA ( Gregory.Cobena@inria.fr ) Serge Abiteboul, INRIA & Xyleme SA ( Serge.Abiteboul@inria.fr ) http://www-rocq.inria.fr/verso/ http://www.xyleme.com/. Organization. 1. The Web and XML 2. Xyleme

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Organization' - akamu


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1
Xyleme, 2001

A Dynamic Warehouse for the XML data of the WebGrégory COBENAINRIA & Xyleme SA( Gregory.Cobena@inria.fr )Serge Abiteboul, INRIA & Xyleme SA( Serge.Abiteboul@inria.fr )http://www-rocq.inria.fr/verso/ http://www.xyleme.com/

organization
Organization
  • 1. The Web and XML
  • 2. Xyleme
  • 3. Data Acquisition and Maintenance
  • XML Repository, Semantic Data Integration and Query Processing
  • 4. Query Subscription
  • Conclusion
the web today
The Web today
  • Terabytes of data
  • A lot of public pages
    • 1 billion in [06/2000]
    • several millions of servers
  • Private web: not publicly available pages
  • Deep web: data hidden behind forms
html hypertext language

The <b> X23 </b> new camera replaces the <b> X22 </b>. It comes equipped with a flash (worth by itself <i>53.99 $</i>) and provides great quality for only <i>359.99 $</i>.

Ref Name Price

X23 Camera 359.99

R2D2 Robot 19350.00

Z25 PC 1299.99

Information System

HTML

HTML = Hypertext Language

hard

Text + presentation

Where is the data ?

xml semistructured data

Ref Name Price

X23 Camera 359.99

R2D2 Robot 19350.00

Z25 PC 1299.99

...

Information System

XML = Semistructured Data

<product-table>

< product reference=”X23">

<designation> camera </designation>

<price unit=Dollars> 359.99 </price>

<description> … </description>

</product>

< product reference=”R2D2">

<designation> Robot </designation>

<price unit=Dollars> 19350 </price>

<description> … </description>

...

</product-table>

easy

Data + Structure

Semistructured:

more flexible

XML

xml tree types
XML : Tree Types

product-table

  • Semantics and structure are in paths
    • product-table/product/reference
    • product-table/product/price

product

reference

price

designation

description

xyleme research
Xyleme Research

Project Xyleme at INRIA (1999-2000) :

Explore XML + Web + SGBD to make the Web a Knowledge Database

  • INRIA
    • Sophie Cluet:Databases (OQL…)
    • Serge Abiteboul: semi-structured data + web
    • Guy Ferran: ex O2 Technology
  • Mannheim University
    • Guido Moerkotte
  • Université d’Orsay
    • Marie Christine Rousset
  • CNAM
    • Dan Vodislav
xyleme company
Xyleme Company
  • Started September 2000

(25 employees end of 2001)

  • Market Challenges:
    • Few XML documents available on the Web (because of weak software support)
    • Company is focusing on private XML:
      • Press, Editors, Financial Data, Biology…
    • Technology:
      • Scalability for large amount of data
      • Internet (+focus) / Intranet support
      • Monitoring and Version Management
      • Heterogeneous Data Integration
architecture
Architecture
  • Cluster of PCs
  • Developed with Linux and C++
  • Communications
    • local: Corba
    • external: HTTP
  • Distribution between autonomous machines
  • Now Web Services
functional architecture

User Interface

Xyleme Interface

Acquisition

& Crawler

Change Control

Semantic Module

Loader

Functional Architecture

-------------------- I N T E R N E T -----------------------

Web Interface

Query Processor

Repository and Index Manager

architecture1

Change Control and

Semantic

Integration

Change Control and

Semantic

Integration

Acquisition and

Maintenance

Acquisition and

Maintenance

Index

Index

Index

Loader |Query

Loader |Query

Repository

Repository

Repositorry

Repository

Architecture

-------------------- I N T E R N E T -----------------------

E

T

H

E

R

N

E

T

goals
Goals
  • Discover XML pages on the web that are of interest for customers
    • For this crawl the web (HTML+XML)
  • Maintain them up to date
  • Do this under bounded resources:
    • Memory for known URLs
    • Bandwidth
life cycle of a page in xyleme
Life Cycle of a page in Xyleme
  • The URL of D is discovered as a link in another page (or published by a customer)
  • The page scheduler decides to read D
    • The meta data of D is read
      • type, last_date_update...
    • The document D is loaded
  • The document D is re(read) regularly
main issues
Main Issues
  • Loading of pages
    • we can load up to 5 millions of pages/day on a standard PC
    • main cost is Internet connection
  • Metadata management (access to disk)
  • Page scheduling
    • decide which page to read or refresh next
page importance
Page Importance
  • Definition: Important pages are linked to by important pages
  • Offline algorithm (used by Google)
  • Our Online algorithm

(M. Preda, S. Abiteboul, G. Cobena)

    • does not require to maintain graph information
    • faster convergence with focused crawling
querying language
Querying Language
  • Today: A mix of OQL and XQL
  • We are currently moving to X-Query (which is also a mix of OQL and XQL…)

Select boss/Name, boss/Phone

From comp in BusinessDomain,

boss in comp//Manager

Where comp/Product contains “Xyleme”

web heterogeneity
Web Heterogeneity
  • Semantic domains, e.g., cinema
  • Many possible types for data in this domain, many DTDs
  • Semantic Integration
    • one abstract DTD for the domain
    • gives the illusion that the system maintains an homogeneous database for this domain

1 domain = 1abstract DTD

indexing
Indexing
  • Standard inverted index
    • word  documents that contain this word
  • Xyleme index
    • word  elements that contain this word

document + element identifier

  • Goal: more work can be performed without accessing data
the web changes all the time
The Web changes all the time
  • Data acquisition + maintenance
    • keep the warehouse up-to-date
  • Version management
    • representation and storage of changes
  • Change monitoring
    • query subscription
subscription language
Subscription Language
  • SQL-like language based on ‘atomic events’.
  • Combines the use of monitoring queries and continuous queries.
  • The language can be extended by adding new types of atomic events.
  • Uses the XML Query Language for continuous queries. “Querying the XML Documents of the Web”, V. Aguilera, S. Cluet, F. Boiscuvier, Tech. Report
example
Example

subscription myPaintings

% what are the new painting entries in Musee d’Orsay site

monitoring newPainting

select URL

where URL extends www.musee-orsay.fr/*

and <painter> contains “Monet”

% manage the changes in the expositions

continuous delta Exposition

select ... from ... where

when monthly

notify daily % send me a daily report

Atomic events

step 1 atomic event detection

d

document

& alerts

d/46

d/46,67

loading

Step 1: Atomic Event Detection

5 millions of pages/day

atomic event 46: URL matches

pattern www.musee-orsay.fr/*

atomic event 67: XML document

contains the tag <painter> with

the value “Monet”

metadata

manager

HTML

parser

complex

event detection

XML

loader

step 2 complex event detection
Step 2: Complex Event Detection

Millions of alerts of pages/day

Millions of subscriptions

HTML

parser

complex

event detection

complex event 12: 67 & 46

(XML document contains the tag

<painter> with value “Monet”

and URL matches pattern

www.musee-orsay.fr/*)

XML

loader

step 3 notification processor

notification/monitoring

alerts

triggers

Millions of

notifications/day

notification/results

clock

Step 3: Notification Processor

complex

event detection

Reporter

continuous

queries

architecture2
Architecture

Xyleme

Query

Processor

documents

Trigger

Engine

Xyleme

Alerter

Complex

Event

Detection

Reporter

Xyleme

Reporter

Subscription

Manager

SQL

Xyleme

Subscription

Manager

Web Browser

SQL

complex events algorithm
Complex Events Algorithm
  • The formal problem is NP-hard
  • We proposed several possible algorithms
  • Experimental (simulation) values proved the effectiveness of our solutions
  • The Hash-Tree based algorithm is well suited for our application:
    • 10 million Complex Events
    • 1 million Atomic Events
    • 100 Atomic events detected per document

0.8 ms to process a document. ~2 million documents per day.

alerters
Alerters
  • Each Alerter can be viewed as a plug-in that acts on a document flow.
  • All sorts of Atomic events can be detected: URL pattern detection, Keywords, XPath expressions, Page rank…
  • Can be distributed.
some advanced alerts
Some Advanced Alerts
  • Process document flow (single pass)
  • Full strings
    • Context Stack
    • Reversed look-up
  • XML Alerts
    • Reversed XPath expressions
    • Dual context stack for ‘/’ and ‘//’
versions
Versions
  • Objectives:
    • Temporal Queries (persistent identification of nodes)
    • Version some documents or some sites (store a ‘delta’)
    • Change Monitoring (query changes)
  • We proposed a representation of changes

“Change-Centric Management of Versions” (VLDB 2001)

  • We developed a Diff algorithm for XML

“Detecting Changes in XML Documents”,

G. Cobena, S. Abiteboul, A. Marian ICDE 2002 (San Jose)

conclusion prospectives
Conclusion & Prospectives
  • Focus crawling on important pages
    • Refine notion of importance
    • Improve important pages discovery
  • Improve Change control accuracy
    • Semantic web
    • Real-time advanced processing