Information management in p2p serge abiteboul inria futurs and univ paris 11
Download
1 / 89

Information Management in P2P - PowerPoint PPT Presentation


  • 331 Views
  • Uploaded on

Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11. Introduction. Success stories at the time of the Internet bubble. Google: management of Web pages Mapquest: management of maps Amazone: book catalogue eBay: product catalogue

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Management in P2P' - Jimmy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Information management in p2p serge abiteboul inria futurs and univ paris 11 l.jpg
Information Management in P2PSerge AbiteboulINRIA-Futurs and Univ. Paris 11

P2P Data Management, 2006, S. Abiteboul



Success stories at the time of the internet bubble l.jpg
Success stories at the time of the Internet bubble

  • Google: management of Web pages

  • Mapquest: management of maps

  • Amazone: book catalogue

  • eBay: product catalogue

  • Napster (emule, bearshare, etc.): music database

  • Flickr: picture database

  • Wikipedia: dictionary

  • del.icio.us: annotations

  • InFrance:

    Meetic: dating database

    Kelkoo: comparative shopping

They are all about

publishing some database

P2P Data Management, 2006, S. Abiteboul


The trend is towards peer to peer and interactivity l.jpg
The trend is towards peer-to-peerand interactivity

  • P2P: A large and varying number of computers cooperate to solve some particular task without any centralized authority

  • Goal: build an efficient, robust, scalable system based (typically) on inexpensive, unreliable computers distributed in a wide area network

  • [email protected]; kazaa; cabal

  • Switch from centralized servers to communities and syndication

  • Interaction and Web 2.0

  • Motivations: Social, organizational

P2P Data Management, 2006, S. Abiteboul


Information management in a p2p network l.jpg
Information management in a P2P network

  • Private terminology: data ring

  • Information is heterogeneous, distributed, replicated, dynamic

  • Which info: Data + meta-data + knowledge + services

  • Peers are heterogeneous, autonomous and possibly mobile

    • From sensors to PDA to mainframe

  • Typically very large number of peers

  • Variety of requirements: QoS, performance, security, etc.

P2P Data Management, 2006, S. Abiteboul


Acknowledgement l.jpg
Acknowledgement

  • Xyleme:Scalable XML warehousing

  • Sophie Cluet, Guy Ferran (Xyleme) & many others

  • ActiveXML:Language for P2P data management

  • Omar Benjelloun (Google), Ioana Manolescu, Tova Milo (Tel Aviv) & many others

  • KadoP: P2P scalable XML indexing

  • Ioana Manolescu, Nicoleta Preda & others

  • Data Ring: Infrastructure for P2P data management

  • Alkis Polyzotis (UC Santa Cruz)

P2P Data Management, 2006, S. Abiteboul


Outline l.jpg
Outline

  • Introduction – the data ring

  • Calculus for P2P data management (ActiveXML)

  • Algebra for P2P data management (ActiveXML algebra)

  • Indexing in P2P (KadoP)

  • Conclusion

  • Goal of the tutorial: present issues and technology on p2p information management

  • Warning: it is very biased – it is not a survey

P2P Data Management, 2006, S. Abiteboul


Outline8 l.jpg
Outline

  • Introduction – the data ring

  • Calculus for P2P data management (ActiveXML)

  • Algebra for P2P data management (ActiveXML algebra)

  • Indexing in P2P (KadoP)

  • Conclusion

P2P Data Management, 2006, S. Abiteboul



The information in a data ring l.jpg
The information in a data ring

  • Data: tuples, collections, documents, relations…

  • Services: data sources, possibly some processing

  • Meta-data about resources: attribute/values pairs, annotations

  • Ontologies to explain data and metadata

  • View definitions

  • Data integration information, e.g., mappings between ontologies

  • Physical data: Indices and materialized views

P2P Data Management, 2006, S. Abiteboul


Functionalities of the data ring l.jpg
Functionalities of the data ring

  • Storage, persistence, replication

  • Indexing, caching, querying, updating, optimization

  • Schema management, access control

  • Fault tolerance, self tuning, monitoring

  • Resource discovery, history, provenance, annotations, multi-linguism,

  • Semantic enrichment, uncertain data

  • Each functionality may be achieved by a peer or by the network

P2P Data Management, 2006, S. Abiteboul


And now what is a peer l.jpg

A mainframe database

A file system

Web server

A PC

A PDA

A telephone

A sensor

A home appliance

A car

A manufacturing tool

A telecom equipment

A toy

Another data ring

And now, what is a peer?

Any connected device or software with some information to share

A net address and some names of resources (e.g. document or service)

P2P Data Management, 2006, S. Abiteboul


Advantages and disadvantages of p2p l.jpg
Advantages and disadvantages of P2P

  • Scaling

  • Performance

    • Optimization of parallelism

    • Avoid bottleneck

    • Replication

  • Availability

    • Replication

  • Cost

    • Avoid the cost of server

    • Share operational cost

  • Dynamicity

    • add/remove new data sources

  • Complexity

  • Performance

    • Cost for complex queries

    • Communication cost

  • Availability

    • Peers can leave

  • Consistency maintenance

    • Difficult to support transaction

  • Quality

    • Difficult to guarantee quality

P2P Data Management, 2006, S. Abiteboul


Crash course on web standards l.jpg
Crash course on Web standards

Owl

RDFS

XML

  • Data exchange format = XML

    • Labeled ordered trees

    • Its main asset = XML schema

    • There is much more

  • Distributed computing protocol = Web services

    • SOAP Simple Object Access Protocols

    • WSDL Web Service Definition Language

    • UDDIUniversal Description, Discovery and Integration

    • BPEL Business Process Execution Language

  • Query languages = XPath and XQuery

    • Declarative query language for XML + full-text + update language

  • Knowledge representation = Owl or RDF/S

Xquery

Xpath

SOAP

WSDL

P2P Data Management, 2006, S. Abiteboul


Slide15 l.jpg

Information used to live in islands but with the Web, this is changing:uniform access to information… …the dream for distributed data management


Do you like the standards l.jpg
Do you like the standards? is changing:

  • It is the wrong question!

  • Correct questions: What can you do with it? What is missing?

  • Is Xquery the ultimate query language for the Web? No

    • It is a language for querying centralized XML

    • We will see what it is missing

  • We will not talk much about semantics

P2P Data Management, 2006, S. Abiteboul


Automatic and distributed management of the data ring l.jpg
Automatic and distributed management of the data ring is changing:

  • No centralized server

  • No information administrator (no info manager)

  • Most users are non-experts

    • E.g., scientists

  • Requirements

    • Ease of deployment (zero-effort)

    • Ease of administration (zero-effort)

    • Ease of publication (epsilon-effort)

    • Ease of exploitation (epsilon-effort)

    • Participation in community building notably via annotations

Happy database admin

P2P Data Management, 2006, S. Abiteboul


What should be made automatic l.jpg
What should be made automatic is changing:

  • Self-statistics from the monitoring of the data ring

    • In particular, define the statistics that are needed*

  • Self-tuning based on the self-statistics

    • Choose the most appropriate organization

    • Decide to install access structures: indexes, views, etc.

    • Control replication of data and services

  • Self-healing

    • Recovery from errors

    • E.g., replacement of a failing Web service

  • And automatic file management

P2P Data Management, 2006, S. Abiteboul


Any hope l.jpg
Any hope? is changing:

  • Technology exists (database self-tuning, machine learning, etc.)

  • But self-tuning for databases has advanced very slowly

  • Why can this work?

  • There is no alternative (for db, this was just a cool gadget)

  • KISS (keep it simple stupid!)

  • The power of parallelism

  • This is assuming lots of machine have free cycles (true) and bandwidth is generous (not always true)

P2P Data Management, 2006, S. Abiteboul


Distributed access control l.jpg
Distributed access control is changing:

  • Goal: Control access to ring resources

  • Access to resources is based on access rights (ACL)

  • Who is controlling ACLs?

  • A node manages ACLs for a collection of distributed resources

    • Easy but against the spirit and possible bottleneck

  • The network manages access control

    • Anybody can get the data

    • The data is published with encryption and signatures; only nodes with proper access rights can perform reads/writes

    • Some techniques exist

P2P Data Management, 2006, S. Abiteboul


Monitoring l.jpg
Monitoring is changing:

  • What is monitored?

  • Web service calls and database updates

  • The Web

    • Web pages

    • RSS feed

  • What is produced?

  • A stream of events

    • As a continuous service

    • As a RSS feed

    • As a Web site/page

  • Info-surveillance

  • Self-statistics and tracing

  • Basis for error diagnosis

  • P2P Data Management, 2006, S. Abiteboul


    Streams are everywhere l.jpg
    Streams are everywhere is changing:

    • In query processing

    • In indexing (KadoP)

    • In recursive queries (AXML-QSQ)

    • In messaging, monitoring and pub/sub

    • That is why we will use an algebra over streams of trees and not simply an algebra over trees

    P2P Data Management, 2006, S. Abiteboul


    Example edos distribution system l.jpg
    Example: Edos distribution system is changing:

    • A system for the management of Linux distribution

    • Joint work with Mandriva Software and U. Tel Aviv

    • Community of open-source developers: thousands

    • System releases: about 10 000 software packages + metadata

    • Functionalities

      • Query the metadata

      • Query subscription

      • Retrieve packages

      • Publish a new release or update an existing one

    P2P Data Management, 2006, S. Abiteboul


    Exemple webcontent l.jpg
    Exemple: WebContent is changing:

    • WebContent: an ANR platform for the management of web content

    • Web surveillance

      • Business, technical, web watching…

    • Participation of Gemo

      • WP3: knowledge

      • WP5: P2P content management

    • Partners: CEA, EADS, Thales, Bongrain, Xyleme, Exalead, many research groups (UVSQ, Grenoble, Paris 6, etc.)

    P2P Data Management, 2006, S. Abiteboul


    Taxonomy of such applications l.jpg
    Taxonomy of such applications is changing:

    • Parameters

      • Number of peers and quantity of data

      • How volatile the peers are

      • The query/update workload

      • The functionalities that are desired

    • Edos: peers and documents in thousands, mostly append for updates, peers not too volatile

    • An extreme: Google search engine in P2P for billions of documents using millions of hyper volatile peers

    • Mostly interested in the first case

    P2P Data Management, 2006, S. Abiteboul


    Thesis l.jpg
    Thesis is changing:

    • XQuery is fine for local XML processing and publishing

    • Not sufficient for distributed data management

    • The success of the relational model, i.e., of tables on a server :

      • A logic for defining tables

      • An algebra for describing query plans over tables

    • By analogy, we need for trees in a P2P system

      • A logic for defining distributed tree data and data services

      • An algebra for describing query plans over these

    • Proposal: ActiveXML logic and algebra

    P2P Data Management, 2006, S. Abiteboul


    Outline27 l.jpg
    Outline is changing:

    • Introduction – the data ring

    • Calculus for P2P data management (ActiveXML)

    • Algebra for P2P data management (ActiveXML algebra)

    • Indexing in P2P (KadoP)

    • Conclusion

    P2P Data Management, 2006, S. Abiteboul


    2 active xml a logic for distributed data management l.jpg

    2. Active XML: is changing:a logic for distributed data management


    The basis l.jpg
    The basis is changing:

    • AXML is a declarative language for distributed information management and an infrastructure to support the language in a P2P framework

    • Simple idea: XML documents with embedded service calls

    • Intensional data

      • Some of the data is given explicitly whereas for some, its definition (i.e. the means to acquire it when needed) is given

    • Dynamic data

      • If the data sources change, the same document will provide different information

    P2P Data Management, 2006, S. Abiteboul


    Example omitting syntactic details l.jpg
    Example is changing:(omitting syntactic details)

    <resorts state=‘Colorado’>

    <resort>

    <name> Aspen </name>

    <sc> Unisys.com/snow(“Aspen”) </sc>

    <depth unit=“meter”>1</depth>

    <hotels ID=AspHotels > ….

    Yahoo.com/GetHotels(<city name=“Aspen”/>)

    </hotels>

    </resort> …

    </resorts>

    • May contain calls

    • to any SOAP web service :

      • e-bay.net, google.com…

    • to any AXML web services

      • to be defined

    P2P Data Management, 2006, S. Abiteboul


    Marketing philosophy l.jpg
    Marketing is changing: Philosophy

    Active answer = intensional and dynamic and flexible

    Embedding calls in data is an old idea in database

    Manon: What’s the capital of Brazil?

    Dad: Let’s ask Wikipedia.com!

    Manon: How do I get a cheap ticket to Galapagos?

    Dad: Let’s place a subscription on LastMinute.com!

    Manon: What are the countries in the EC?

    Dad: France, Germany, Holland, Belgium, and hum… Let’s ask YouLists.com for more!

    P2P Data Management, 2006, S. Abiteboul


    Active xml peer l.jpg
    Active XML peer is changing:

    AXML

    peer

    soap

    • Peer-to-peer architecture

    • Each Active XML peer

      • Repository: manages Active XML data

      • Web client: calls the services inside a document

      • Web server: provides (parameterized) queries/updates over the repository as web services

    • Exchange of AXML instead of XML

    P2P Data Management, 2006, S. Abiteboul


    What is an axml peer l.jpg
    What is an AXML peer? is changing:

    • PC

      • Now open source ObjectWeb – queries in OQL

    • Peer on a mass storage system

      • eXist (open source XML database) queries in XQuery

      • Xyleme queries in XyQL

    • PDA or cell phone

      • Persistence in file system and XPATH

    • On going: the entire network

      • Data is stored in a P2P network - KadoP

    • More: java card, a relational database…

    P2P Data Management, 2006, S. Abiteboul


    A key issue call activation l.jpg

    When to activate the call? is changing:

    Explicit pull mode: active databases

    Implicit pull mode: deductive databases

    Push mode: query subscription

    What to do with its result?

    How long is the returned data valid?

    Mediation and caching

    Where to find the arguments?

    Under the service call: XML,XPATH or a service call

    A key issue: call activation

    P2P Data Management, 2006, S. Abiteboul


    Another key issue what to send l.jpg
    Another key issue: what to send? is changing:

    • Send some AXML tree t

      • As result of a query or as parameter of a call

    • The tree t contains calls, do we have to evaluate them?

      • If I do, I may introduce service calls, do we have to evaluate all these calls before transmitting the data?

    • Hi John, what is the phone number of the Prime Minister of France?

    • Find his name at whoswho.com then look in the phone dir

    • Look in the yellow pages for deVillepin’s in phone dir of www.gov.fr

    • (33) 01 56 00 01

    P2P Data Management, 2006, S. Abiteboul


    Active xml cool idea complex problems l.jpg

    Blasphemous claim: is changing:

    Active XML is the proper paradigm for data exchange!

    Not XML + not XQuery

    Brings to a unique setting

    distributed db, deductive db, active db, stream data

    warehousing, mediation

    This is unreasonable? Yes!

    Plenty of works ahead… to make it work

    But first, the algebra

    Active XMLcool idea – complex problems

    P2P Data Management, 2006, S. Abiteboul


    Outline37 l.jpg
    Outline is changing:

    • Introduction – the data ring

    • Calculus for P2P data management (ActiveXML)

    • Algebra for P2P data management (ActiveXML algebra)

      • Query processing

      • Query optimization

    • Indexing in P2P (KadoP)

    • Conclusion

    P2P Data Management, 2006, S. Abiteboul


    3 active xml algebra l.jpg

    3. Active XML algebra is changing:


    Motivation l.jpg
    Motivation is changing:

    • Relational model: centralized tables

    • optimization: algebraic expression and rewriting

    • Active XML model: distributed trees

    • optimization: algebraic expression and rewriting

    • Distributed query optimization based on algebraic rewriting of Active XML trees

    • Based on experiences with AXML optimization

    P2P Data Management, 2006, S. Abiteboul


    Active xml peers l.jpg

    We focus on positive AXML is changing:

    Set-oriented data

    Positive/monotone services

    Services = tree-pattern-query-with-join queries

    Services produce streams

    Optimized by a local query optimizer

    Evaluated by a local query processor

    Out of our scope

    Active XML peers

    output

    stream

    π

    Local

    query

    processing

    join

    π

    input

    stream

    input

    stream

    P2P Data Management, 2006, S. Abiteboul


    The problem l.jpg
    The problem is changing:

    • An AXML system

      • A set of peers

      • For each peer a set of documents and services

      • Extensional data is distributed

      • Intensional data (knowledge) is distributed

        • Defined using query services (TPQJ queries)

        • These services are generic: any peer can evaluate a query

    • A query q to some peer

    • Evaluate the answer to q with optimal response time

    P2P Data Management, 2006, S. Abiteboul


    Axml algebra l.jpg
    AXML algebra is changing:

    l

    En

    E1

    E2

    [email protected]

    En

    E1

    E2

    [email protected]

    [email protected]

    E1

    [email protected]

    [email protected]

    E1

    E1

    • (AXML) algebraic expressions:

    AXML logic

    [email protected]

    Each such expression lives at some peer

    Includes the AXML trees

    P2P Data Management, 2006, S. Abiteboul


    Algebraic expressions annotations l.jpg
    Algebraic expressions annotations is changing:

    • Executing service call:

    • Terminated service call: 

    • Subtlety

      [email protected](5): definition of intensional data

      eval([email protected](5)): request to evaluate it; during query optimization

      [email protected](5): query is being evaluated; during query processing

      [email protected](5): query evaluation is complete

    P2P Data Management, 2006, S. Abiteboul


    Evaluation rules local rules l.jpg
    Evaluation rules: local rules is changing:

    l

    l

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    tn

    t1

    t2

    t1

    t2

    tn

    t2

    t1

    tn

    E1

    [email protected]

    E1

    [email protected]

    tn

    t1

    t2

    for l ≠ sc, s ≠ send, receive

    P2P Data Management, 2006, S. Abiteboul


    Evaluation rules transfer rules l.jpg
    Evaluation rules: transfer rules is changing:

    [email protected]

    [email protected]

    newRoot()@p’

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    t1

    t1

    t1

    t2

    t2

    t2

    &

    [email protected]

    • Site p asks p’ to do the work and send the result to p

    P2P Data Management, 2006, S. Abiteboul


    Evaluation rules more transfer rules l.jpg
    Evaluation rules: more transfer rules is changing:

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    Z

    Z

    • When a query is evaluated, results appear

    • They are sent to the place that requested them

    • Also some rules for eof

    [email protected]

    &

    Z

    P2P Data Management, 2006, S. Abiteboul


    Evaluation l.jpg
    Evaluation is changing:

    • Reminder: setting

      • An AXML system

      • A request to evaluate query q at peer p – [email protected]( q )

    • Rewrite the trees in peer workspaces until termination of the process

    • Results

    • For positive XML, this process converges… to a possibly infinite state

    • This process computes the answer to q

    • May be fairly inefficient: need for optimization!

    P2P Data Management, 2006, S. Abiteboul



    Query optimization l.jpg
    Query optimization is changing:

    • Well-known optimization techniques for distributed data management

    • Pushing selections

    • Semijoin reducers

    • Horizontal, vertical, hybrid decomposition

    • Recursive query processing and query-subquery

    • Some “specific” AXML optimizations

    • Pushing queries over service calls

    • Lazy service call evaluation …

    • Optimizing subscription management

    • All are captured by the algebraic framework

    P2P Data Management, 2006, S. Abiteboul


    Example pushing selections l.jpg
    Example: pushing selections is changing:

    [email protected]

    [email protected]

    [email protected]

    q

    q1

    [email protected]

    [email protected]

    (q2([email protected]))

    [email protected]

    [email protected]

    [email protected]

    @p2([email protected]([email protected]))

    (q2(d))

    • Same rule applies if [email protected] is replaced by a continuous query

    Suppose q ≡ q1((q2)):

    P2P Data Management, 2006, S. Abiteboul


    Example interleaving of processing and optimization l.jpg
    Example: interleaving of is changing:processing and optimization

    (r2)

    (r3)

    (r4)

    (r1)

    • At peer i: di = ri di+1

    • Query at p1: (d1)

    • (d1)→ (r1)(d2)[email protected]((r1)(d2)) → [email protected]((r1))  [email protected]((d2))[email protected]((r1))→●(r1) (starts streaming data)

    • (d2)→(r2)(d3); (r2) starts streaming data

    • (d3)→ (r3)(d4) …

    P2P Data Management, 2006, S. Abiteboul


    Transfer and load balancing rules l.jpg
    Transfer and load balancing rules is changing:

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]()

    [email protected]

    [email protected]

    E

    E

    Peer p1 delegates the evaluation of E to p2

    P2P Data Management, 2006, S. Abiteboul


    Transfer and load balancing rules53 l.jpg
    Transfer and load balancing rules is changing:

    [email protected]

    E

    [email protected]

    [email protected]

    [email protected]

    eval(E)

    [email protected]

    [email protected]

    [email protected]()

    [email protected]

    Peer p1 delegates the evaluation of E to p2

    P2P Data Management, 2006, S. Abiteboul


    Transfer and load balancing rules54 l.jpg
    Transfer and load balancing rules is changing:

    [email protected]

    E

    [email protected]

    [email protected]()

    [email protected]

    &

    [email protected]

    eval(E)

    [email protected]

    Peer p1 delegates the evaluation of E to p2

    P2P Data Management, 2006, S. Abiteboul


    Transfer and load balancing rules55 l.jpg
    Transfer and load balancing rules is changing:

    eval(E)

    [email protected]

    [email protected]()

    [email protected]

    &

    [email protected]

    eval(E)

    [email protected]

    Peer p1 delegates the evaluation of E to p2

    P2P Data Management, 2006, S. Abiteboul


    Transfer and load balancing rules56 l.jpg
    Transfer and load balancing rules is changing:

    eval(E)

    eval(E)

    eval(E)

    [email protected]

    [email protected]()

    [email protected]

    &

    [email protected]

    [email protected]

    Peer p1 delegates the evaluation of E to p2

    P2P Data Management, 2006, S. Abiteboul


    Transfer and load balancing rules57 l.jpg
    Transfer and load balancing rules is changing:

    [email protected]

    [email protected]

    E

    E

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]()

    eval@p2

    #x@p1

    Peer p1 delegates the evaluation of E to p2

    P2P Data Management, 2006, S. Abiteboul


    Back to interleaved execution and optimization l.jpg
    Back to interleaved execution and optimization is changing:

    (r2)

    (r3)

    (r4)

    (r1)

    Repeated transfers

    (r2)

    (r3)

    (r4)

    (r1)

    Data transfers reduced

    More work for p1: merging all the streams

    Hierarchical stream merging

    P2P Data Management, 2006, S. Abiteboul


    Example horizontal and vertical decomposition l.jpg
    Example: is changing:Horizontal and vertical decomposition

    • A relation d over ABC that is split both horizontally and vertically

      d = (d1  d2) d3

      d1 = B<5 (d') and d2 = B>=5 (d')

      d', d1, d2 over AB and d3 over BC; each di is at a peer pi

    • Consider the query B=0@p(d)

    • → B=0@p( (B<5 (d') B>=5 (d'))) d3@p3 )

    • → B=0 @p( d1@p1 d3@p3 )

    • → @p (#x@preceive(d1@p1), #y@preceive(d3@p3))

    • & send@p1(#x@p; B=0@p1(d1@p1) )

    • &  send@p3(#y@p; d3@p3)

    P2P Data Management, 2006, S. Abiteboul


    Common sub expression elimination l.jpg
    Common sub-expression elimination is changing:

    #x@p

    receive@p

    eval@p

    E

    E

    E

    • eval@p(E), #x@preceive@p(E) →

    • eval@p(#x@p), #x@preceive@p(E)

    eval@p

    #x@p

    &

    &

    receive@p

    #x@p

    P2P Data Management, 2006, S. Abiteboul


    Common sub expression elimination61 l.jpg
    Common sub-expression elimination is changing:

    newRoot()@p

    newRoot()@p

    q@p’

    eval@p’

    send@p

    send@p

    &

    q@p’

    eval@p’

    #x@p’

    s@p

    s@p

    s@p

    s@p

    #x@p’

    #x@p’

    s@p

    receive@p’

    s@p

    q@p’

    newRoot()@p’

    &

    &

    send@p’

    #x@p’

    #y@p’

    receive@p’

    receive@p’

    #y@p’

    #x@p’

    s@p

    s@p

    In spite of the two calls to s@p, the function is evaluated only once

    P2P Data Management, 2006, S. Abiteboul


    Example recursive query processing l.jpg
    Example: recursive query processing is changing:

    • Using a pseudo Datalog syntax

    • s1@p($x, $y)  d2@p'($x, $z), s2@p'($z, $y)

    • s2@p'($x, $y)  d1@p($x, $z), s1@p($y, $z)

    • After rewriting:

    • on p : #x@p receive@p(q1@p'(d2@p', s2@p') ) 

    • root@p send@p(#y@p', q2@p( d1@p, #x@p) ) 

    • on p' : root@p'send@p'(#x@p,

    •  q1@p'(d2@p', #y@p' receive@p'(s2@p')  ) ) 

    P2P Data Management, 2006, S. Abiteboul


    Generic and global services l.jpg
    Generic and global services is changing:

    • q@any: where q is a query

      • Any peer that has some query processor for q can do it

    • f@any: where f is a processing service call

      • Example: decryption or gene comparison

    • q over a P2P collection

    eval@p

    eval@p

    eval@p

    eval@p

    q@p2

    q@p1

    index

    q

    q

    coll

    @

    @

    q

    P2P Data Management, 2006, S. Abiteboul


    The axml algebra conclusion l.jpg
    The AXML algebra – conclusion is changing:

    • Captures distributed XML query processing/optimization

    • Based on a communication model a la CCS

    • Algebraic – stream-oriented

    • Orthogonal to the local XML query optimizer

    • Orthogonal to the network support (DHT, small world etc.)

    • What is not yet available? A cost model and heuristics

    P2P Data Management, 2006, S. Abiteboul


    Outline65 l.jpg
    Outline is changing:

    • Introduction – the data ring

    • Calculus for P2P data management (ActiveXML)

    • Algebra for P2P data management (ActiveXML algebra)

    • Indexing in P2P (KadoP)

    • Conclusion

    P2P Data Management, 2006, S. Abiteboul



    Efficient evaluation of tree pattern queries l.jpg
    Efficient evaluation of tree-pattern-queries is changing:

    • Many optimization techniques

    • We are interested here in distributed query evaluation/optimization

    • 1) We consider XML indexing

    • 2) Holistic twig join that is based on indexing

    • 3) P2P indexing

    • 4) P2P query processing

    • 5) Optimizing P2P indexing

    P2P Data Management, 2006, S. Abiteboul


    Xml indexing structural identifiers l.jpg
    XML indexing: structural identifiers is changing:

    1

    A

    8

    0

    7

    2

    B

    C

    8

    6

    1

    1

    X ancestor of Y <=>

    pre(X) < pre(Y) and

    post(X) ≥ post(Y)

    3

    8

    5

    D

    F

    E

    4

    8

    6

    2

    2

    2

    6

    4

    G

    X parent of Y <=>

    X ancestor of Y and

    level(X) = level(Y) - 1

    “John”

    6

    4

    3

    3

    -Level

    Structural IDs = Prefix-Postfix

    P2P Data Management, 2006, S. Abiteboul


    Holistic twig join l.jpg
    Holistic Twig Join is changing:

    • Input a document and a tree pattern query

    • Find the bindings of the query in the document

    • Holistic = holistique

    • (le tout et pas juste les parties)

    • Twig = brindille

    • Join = you know…

    • Sounds like Harry Potter?

    P2P Data Management, 2006, S. Abiteboul


    Query evaluation over a document l.jpg
    Query evaluation over a document is changing:

    A

    D

    C

    Ids for A

    (1,8,0)…

    Ids for C

    Ids for D

    “John”

    Ids for “John”

    Ids are sorted in lexicographical order

    Goals is to find “matching Ids”

    P2P Data Management, 2006, S. Abiteboul


    The holistic twig join algorithm l.jpg
    The Holistic Twig Join Algorithm is changing:

    a

    b

    c

    a (2,8)

    a (18,25)

    c (9,14)

    a (15,17)

    b (12,14)

    c (6,8)

    c (23,25)

    a (3,5

    c (7,8)

    b (24,25)

    c (4,5)

    b (13,14)

    a (8,8)

    c (25,25)

    c (14,14)

    a (5,5)

    level

    0

    r (1,25)

    1

    b (10,11)

    a (16,17)

    b (19,22)

    2

    c (11,11)

    c (17,17)

    b (20,21)

    3

    4

    c (21,21)

    c (22,22)

    P2P Data Management, 2006, S. Abiteboul


    The holistic twig join algorithm72 l.jpg
    The Holistic Twig Join Algorithm is changing:

    (a7, b4, c8),

    (a7, b5, c8),

    Sa

    b5

    Sb

    Sc

    Stacks

    (a7, b4 ,c9)

    (a7 ,b6 ,c11)

    a

    a7

    a1

    a5

    a7

    a4

    a2

    a6

    a3

    b6

    b4

    b1

    b2

    b4

    b6

    b

    b5

    b3

    c1

    c2

    c10

    c5

    c11

    c9

    c8

    c6

    c7

    c4

    c3

    c8

    c9

    c11

    c

    Legend:

    This is the end

    Head of the stream

    Find the match for the query sub-tree determined by this node !!!

    The ID is present also in the stack

    P2P Data Management, 2006, S. Abiteboul


    P2p xml processing l.jpg

    P2P XML processing is changing:


    Xml indexing in xyleme l.jpg

    History is changing:

    1999: INRIA research project

    2000: Creation of a spin-off

    2006: About 25 people

    Technology

    A scalable XML repository

    A content warehouse

    On a cluster of Linux PC

    XML query processing

    Twig join

    Index is distributed

    Keyword-based vs. document based

    XML indexing in Xyleme

    hash(C)

    LAN

    hash(“John)

    Put(C;[d,p,6,6,1])

    Put(“John”;[d,p,3,1,2])

    P2P Data Management, 2006, S. Abiteboul


    Query processing over a distributed collection l.jpg
    Query processing over a distributed collection is changing:

    A

    Ids for A

    (p12,d456, 1,7,0)…

    C

    D

    Ids for C

    Ids for D

    “John”

    Ids include peerId and docId

    Ids are sorted in lexicographical order

    Goals is to find “matching Ids” in the collection

    Ids for John

    P2P Data Management, 2006, S. Abiteboul


    Xml indexing in kadop l.jpg

    Use structural Ids is changing:

    Publish them via a DHT

    Distributed Hash Table

    Peers come and go

    Locate(k):: log(n) messages to fing the peer in charge of key k

    Put(k,v)

    Get(k): retrieves all the values for k

    We use Pastry

    We also tried P2PSim and JXTA

    XML indexing in KadoP

    hash(C)

    DHT

    posting for C

    hash(“John)

    put(C;[d,p,6,6,1])

    put(“John”;[d,p,3,1,2])

    put(C;[d,p,6,6,1])

    P2P Data Management, 2006, S. Abiteboul


    Xml query processing in kadop l.jpg
    XML query processing in KadoP is changing:

    • Given a tree pattern query Q

    • Evaluate an index query indexQ to locate the peers that can provide some answers

      • indexQ is a twig join

    • Ship Q to these peers and evaluate it there

    • If indexQ is imprecise, many false positive

      • Example: ship Q to all peers (maximal parallelism)

      • Example: Instead of structural Ids, just use (label/word,peerId,docId)

    P2P Data Management, 2006, S. Abiteboul


    Kadop architecture l.jpg
    KadoP architecture is changing:

    KadoP peer

    publish & query

    Semantic layer

    Web

    interface

    External

    Layer

    ActiveXML

    engine

    KadoP Engine

    Indexing

    Logical

    Layer

    Query

    processing

    DHT

    locate, put, get & delete

    Physical

    Layer

    Index

    P2P Data Management, 2006, S. Abiteboul


    Some technical issues l.jpg
    Some technical issues is changing:

    • Our goal: manage millions of documents with a large number of peers

    • First experiments were a disaster

    • Replace the index storage of the DHT in a FS by storage in a database (Berkeley DB)

    • Extend the API of the DHT to have Append and not only Read/Write

    • Extend the API of the DHT to have a streaming exchange of postings

      • Useful because the XML algebra works better with streams

    • Now it scales but there is the issue of long postings

    P2P Data Management, 2006, S. Abiteboul


    The issue of long postings google in p2p l.jpg

    Using keyword distribution is changing:

    Suppose

    Peer for Ullman is in Europe

    Peer for XML is in US

    we have to ship one long posting between US and Europe

    For a large number of users, we absorb all the bandwidth of Internet backbone

    Need for replication

    Even for thousands of peers, the exchange of long postings is an issue

    The issue of long postings: Google in P2P

    Ullmann & xml?

    DHT

    Ullman

    xml

    P2P Data Management, 2006, S. Abiteboul


    Intensional indexing in kadop l.jpg
    Intensional indexing in KadoP is changing:

    Distributed B-tree

    • Long posting = bad response time

    • No long posting

    • get h(name) then parallel fetch

    • Possibility to optimize further

    • f(docId55..docId75)

    • may be it does not match

    • no need to call f

    long posting

    h(Name)

    f g h i

    h(Name)

    P2P Data Management, 2006, S. Abiteboul


    More optimization l.jpg
    More optimization is changing:

    • Standard for P2P keyword search

      • Gap compression and adaptive set intersection

    • Standard distributed query optimization techniques

      • Ship smallest list

      • Load balancing

      • Caching

      • Replication

    • Semi-join techniques notably Bloom semi-join

    P2P Data Management, 2006, S. Abiteboul


    Outline83 l.jpg
    Outline is changing:

    • Introduction – the data ring

    • Calculus for P2P data management (ActiveXML)

    • Algebra for P2P data management (ActiveXML algebra)

    • Indexing in P2P (KadoP)

    • Conclusion

    P2P Data Management, 2006, S. Abiteboul


    6 conclusion l.jpg

    6. Conclusion is changing:


    Slide85 l.jpg

    Conclusion is changing:

    • Logic for distributed data management

      • Opinion: XQuery is a language for local XML management

      • Proposal: ActiveXML

    • Algebraic foundation of distributed query optimization

      • Proposal: ActiveXML algebra

    • P2P (Active) XML indexing

      • KadoP is now being tested and we are working on optimization

    • Software

      • ActiveXML is open-source – see activexml.net

      • KadoP soon will be – already available upon request

      • EDOS distribution system as well

    P2P Data Management, 2006, S. Abiteboul


    Lots of related work and related systems l.jpg
    Lots of related work and related systems is changing:

    • This is going very fast in system devepments

    • Structured P2P nets: Pastry, Chord

    • Content delivery net: Coral, Akamai

    • XML repositories: Xyleme, DBMonet

    • Multicas systemst: Avalanche, Bullet

    • File sharing systems: BitTorrent, Kazaa

    • Pub/Sub systems: Scribe, Hyper

    • Distributed storage systems: OceanStore, GoogleFS

    • Etc.

    • Fundamental research is somewhat left behind

    P2P Data Management, 2006, S. Abiteboul


    Issues l.jpg
    Issues is changing:

    • P2P query optimization

    • P2P access control

    • P2P archiving

    • P2P self tuning

    • P2P monitoring

    • P2P knowledge management [SomeWhere]

    • Also: analysis and verification of these systems

      • E.g., termination, error detection, diagnosis

    P2P Data Management, 2006, S. Abiteboul


    Find your own topic l.jpg
    Find your own topic is changing:

    • Pick your favorite problem for data or knowledge management and study it in a P2P setting

    • with gigabytes of data and thousands of peers

    • If you find it boring, consider it

    • with terabytes of data and millions of peers

    P2P Data Management, 2006, S. Abiteboul


    Merci l.jpg
    Merci is changing:

    Merci

    P2P Data Management, 2006, S. Abiteboul


    ad