Information management in p2p serge abiteboul inria futurs and univ paris 11
Download
1 / 89

Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11 - PowerPoint PPT Presentation


Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11. Introduction. Success stories at the time of the Internet bubble. Google: management of Web pages Mapquest: management of maps Amazone: book catalogue eBay: product catalogue

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Information Management in P2PSerge AbiteboulINRIA-Futurs and Univ. Paris 11

P2P Data Management, 2006, S. Abiteboul


Introduction


Success stories at the time of the Internet bubble

  • Google: management of Web pages

  • Mapquest: management of maps

  • Amazone: book catalogue

  • eBay: product catalogue

  • Napster (emule, bearshare, etc.): music database

  • Flickr: picture database

  • Wikipedia: dictionary

  • del.icio.us: annotations

  • InFrance:

    Meetic: dating database

    Kelkoo: comparative shopping

They are all about

publishing some database

P2P Data Management, 2006, S. Abiteboul


The trend is towards peer-to-peerand interactivity

  • P2P: A large and varying number of computers cooperate to solve some particular task without any centralized authority

  • Goal: build an efficient, robust, scalable system based (typically) on inexpensive, unreliable computers distributed in a wide area network

  • seti@home; kazaa; cabal

  • Switch from centralized servers to communities and syndication

  • Interaction and Web 2.0

  • Motivations: Social, organizational

P2P Data Management, 2006, S. Abiteboul


Information management in a P2P network

  • Private terminology: data ring

  • Information is heterogeneous, distributed, replicated, dynamic

  • Which info: Data + meta-data + knowledge + services

  • Peers are heterogeneous, autonomous and possibly mobile

    • From sensors to PDA to mainframe

  • Typically very large number of peers

  • Variety of requirements: QoS, performance, security, etc.

P2P Data Management, 2006, S. Abiteboul


Acknowledgement

  • Xyleme:Scalable XML warehousing

  • Sophie Cluet, Guy Ferran (Xyleme) & many others

  • ActiveXML:Language for P2P data management

  • Omar Benjelloun (Google), Ioana Manolescu, Tova Milo (Tel Aviv) & many others

  • KadoP: P2P scalable XML indexing

  • Ioana Manolescu, Nicoleta Preda & others

  • Data Ring: Infrastructure for P2P data management

  • Alkis Polyzotis (UC Santa Cruz)

P2P Data Management, 2006, S. Abiteboul


Outline

  • Introduction – the data ring

  • Calculus for P2P data management(ActiveXML)

  • Algebra for P2P data management (ActiveXML algebra)

  • Indexing in P2P (KadoP)

  • Conclusion

  • Goal of the tutorial: present issues and technology on p2p information management

  • Warning: it is very biased – it is not a survey

P2P Data Management, 2006, S. Abiteboul


Outline

  • Introduction – the data ring

  • Calculus for P2P data management(ActiveXML)

  • Algebra for P2P data management (ActiveXML algebra)

  • Indexing in P2P (KadoP)

  • Conclusion

P2P Data Management, 2006, S. Abiteboul


1. Introduction – the data ring


The information in a data ring

  • Data: tuples, collections, documents, relations…

  • Services: data sources, possibly some processing

  • Meta-data about resources: attribute/values pairs, annotations

  • Ontologies to explain data and metadata

  • View definitions

  • Data integration information, e.g., mappings between ontologies

  • Physical data: Indices and materialized views

P2P Data Management, 2006, S. Abiteboul


Functionalities of the data ring

  • Storage, persistence, replication

  • Indexing, caching, querying, updating, optimization

  • Schema management, access control

  • Fault tolerance, self tuning, monitoring

  • Resource discovery, history, provenance, annotations, multi-linguism,

  • Semantic enrichment, uncertain data

  • Each functionality may be achieved by a peer or by the network

P2P Data Management, 2006, S. Abiteboul


A mainframe database

A file system

Web server

A PC

A PDA

A telephone

A sensor

A home appliance

A car

A manufacturing tool

A telecom equipment

A toy

Another data ring

And now, what is a peer?

Any connected device or software with some information to share

A net address and some names of resources (e.g. document or service)

P2P Data Management, 2006, S. Abiteboul


Advantages and disadvantages of P2P

  • Scaling

  • Performance

    • Optimization of parallelism

    • Avoid bottleneck

    • Replication

  • Availability

    • Replication

  • Cost

    • Avoid the cost of server

    • Share operational cost

  • Dynamicity

    • add/remove new data sources

  • Complexity

  • Performance

    • Cost for complex queries

    • Communication cost

  • Availability

    • Peers can leave

  • Consistency maintenance

    • Difficult to support transaction

  • Quality

    • Difficult to guarantee quality

P2P Data Management, 2006, S. Abiteboul


Crash course on Web standards

Owl

RDFS

XML

  • Data exchange format = XML

    • Labeled ordered trees

    • Its main asset = XML schema

    • There is much more

  • Distributed computing protocol = Web services

    • SOAPSimple Object Access Protocols

    • WSDL Web Service Definition Language

    • UDDIUniversal Description, Discovery and Integration

    • BPEL Business Process Execution Language

  • Query languages = XPath and XQuery

    • Declarative query language for XML + full-text + update language

  • Knowledge representation = Owl or RDF/S

Xquery

Xpath

SOAP

WSDL

P2P Data Management, 2006, S. Abiteboul


Information used to live in islands but with the Web, this is changing:uniform access to information… …the dream for distributed data management


Do you like the standards?

  • It is the wrong question!

  • Correct questions: What can you do with it? What is missing?

  • Is Xquery the ultimate query language for the Web? No

    • It is a language for querying centralized XML

    • We will see what it is missing

  • We will not talk much about semantics

P2P Data Management, 2006, S. Abiteboul


Automatic and distributed management of the data ring

  • No centralized server

  • No information administrator (no info manager)

  • Most users are non-experts

    • E.g., scientists

  • Requirements

    • Ease of deployment (zero-effort)

    • Ease of administration (zero-effort)

    • Ease of publication (epsilon-effort)

    • Ease of exploitation (epsilon-effort)

    • Participation in community building notably via annotations

Happy database admin

P2P Data Management, 2006, S. Abiteboul


What should be made automatic

  • Self-statistics from the monitoring of the data ring

    • In particular, define the statistics that are needed*

  • Self-tuning based on the self-statistics

    • Choose the most appropriate organization

    • Decide to install access structures: indexes, views, etc.

    • Control replication of data and services

  • Self-healing

    • Recovery from errors

    • E.g., replacement of a failing Web service

  • And automatic file management

P2P Data Management, 2006, S. Abiteboul


Any hope?

  • Technology exists (database self-tuning, machine learning, etc.)

  • But self-tuning for databases has advanced very slowly

  • Why can this work?

  • There is no alternative (for db, this was just a cool gadget)

  • KISS (keep it simple stupid!)

  • The power of parallelism

  • This is assuming lots of machine have free cycles (true) and bandwidth is generous (not always true)

P2P Data Management, 2006, S. Abiteboul


Distributed access control

  • Goal: Control access to ring resources

  • Access to resources is based on access rights (ACL)

  • Who is controlling ACLs?

  • A node manages ACLs for a collection of distributed resources

    • Easy but against the spirit and possible bottleneck

  • The network manages access control

    • Anybody can get the data

    • The data is published with encryption and signatures; only nodes with proper access rights can perform reads/writes

    • Some techniques exist

P2P Data Management, 2006, S. Abiteboul


Monitoring

  • What is monitored?

  • Web service calls and database updates

  • The Web

    • Web pages

    • RSS feed

  • What is produced?

  • A stream of events

    • As a continuous service

    • As a RSS feed

    • As a Web site/page

  • Info-surveillance

  • Self-statistics and tracing

  • Basis for error diagnosis

  • P2P Data Management, 2006, S. Abiteboul


    Streams are everywhere

    • In query processing

    • In indexing (KadoP)

    • In recursive queries (AXML-QSQ)

    • In messaging, monitoring and pub/sub

    • That is why we will use an algebra over streams of trees and not simply an algebra over trees

    P2P Data Management, 2006, S. Abiteboul


    Example: Edos distribution system

    • A system for the management of Linux distribution

    • Joint work with Mandriva Software and U. Tel Aviv

    • Community of open-source developers: thousands

    • System releases: about 10 000 software packages + metadata

    • Functionalities

      • Query the metadata

      • Query subscription

      • Retrieve packages

      • Publish a new release or update an existing one

    P2P Data Management, 2006, S. Abiteboul


    Exemple: WebContent

    • WebContent: an ANR platform for the management of web content

    • Web surveillance

      • Business, technical, web watching…

    • Participation of Gemo

      • WP3: knowledge

      • WP5: P2P content management

    • Partners: CEA, EADS, Thales, Bongrain, Xyleme, Exalead, many research groups (UVSQ, Grenoble, Paris 6, etc.)

    P2P Data Management, 2006, S. Abiteboul


    Taxonomy of such applications

    • Parameters

      • Number of peers and quantity of data

      • How volatile the peers are

      • The query/update workload

      • The functionalities that are desired

    • Edos: peers and documents in thousands, mostly append for updates, peers not too volatile

    • An extreme: Google search engine in P2P for billions of documents using millions of hyper volatile peers

    • Mostly interested in the first case

    P2P Data Management, 2006, S. Abiteboul


    Thesis

    • XQuery is fine for local XML processing and publishing

    • Not sufficient for distributed data management

    • The success of the relational model, i.e., of tables on a server :

      • A logic for defining tables

      • An algebra for describing query plans over tables

    • By analogy, we need for trees in a P2P system

      • A logic for defining distributed tree data and data services

      • An algebra for describing query plans over these

    • Proposal: ActiveXML logic and algebra

    P2P Data Management, 2006, S. Abiteboul


    Outline

    • Introduction – the data ring

    • Calculus for P2P data management(ActiveXML)

    • Algebra for P2P data management (ActiveXML algebra)

    • Indexing in P2P (KadoP)

    • Conclusion

    P2P Data Management, 2006, S. Abiteboul


    2. Active XML:a logic for distributed data management


    The basis

    • AXML is a declarative language for distributed information management and an infrastructure to support the language in a P2P framework

    • Simple idea: XML documents with embedded service calls

    • Intensional data

      • Some of the data is given explicitly whereas for some, its definition (i.e. the means to acquire it when needed) is given

    • Dynamic data

      • If the data sources change, the same document will provide different information

    P2P Data Management, 2006, S. Abiteboul


    Example(omitting syntactic details)

    <resorts state=‘Colorado’>

    <resort>

    <name> Aspen </name>

    <sc> Unisys.com/snow(“Aspen”) </sc>

    <depth unit=“meter”>1</depth>

    <hotels ID=AspHotels > ….

    Yahoo.com/GetHotels(<city name=“Aspen”/>)

    </hotels>

    </resort> …

    </resorts>

    • May contain calls

    • to any SOAP web service :

      • e-bay.net, google.com…

    • to any AXML web services

      • to be defined

    P2P Data Management, 2006, S. Abiteboul


    Marketing  Philosophy

    Active answer = intensional and dynamic and flexible

    Embedding calls in data is an old idea in database

    Manon: What’s the capital of Brazil?

    Dad: Let’s ask Wikipedia.com!

    Manon: How do I get a cheap ticket to Galapagos?

    Dad: Let’s place a subscription on LastMinute.com!

    Manon: What are the countries in the EC?

    Dad: France, Germany, Holland, Belgium, and hum… Let’s ask YouLists.com for more!

    P2P Data Management, 2006, S. Abiteboul


    Active XML peer

    AXML

    peer

    soap

    • Peer-to-peer architecture

    • Each Active XML peer

      • Repository: manages Active XML data

      • Web client: calls the services inside a document

      • Web server: provides (parameterized) queries/updates over the repository as web services

    • Exchange of AXML instead of XML

    P2P Data Management, 2006, S. Abiteboul


    What is an AXML peer?

    • PC

      • Now open source ObjectWeb – queries in OQL

    • Peer on a mass storage system

      • eXist (open source XML database) queries in XQuery

      • Xyleme queries in XyQL

    • PDA or cell phone

      • Persistence in file system and XPATH

    • On going: the entire network

      • Data is stored in a P2P network - KadoP

    • More: java card, a relational database…

    P2P Data Management, 2006, S. Abiteboul


    When to activate the call?

    Explicit pull mode: active databases

    Implicit pull mode: deductive databases

    Push mode: query subscription

    What to do with its result?

    How long is the returned data valid?

    Mediation and caching

    Where to find the arguments?

    Under the service call: XML,XPATH or a service call

    A key issue: call activation

    P2P Data Management, 2006, S. Abiteboul


    Another key issue: what to send?

    • Send some AXML tree t

      • As result of a query or as parameter of a call

    • The tree t contains calls, do we have to evaluate them?

      • If I do, I may introduce service calls, do we have to evaluate all these calls before transmitting the data?

    • Hi John, what is the phone number of the Prime Minister of France?

    • Find his name at whoswho.com then look in the phone dir

    • Look in the yellow pages for deVillepin’s in phone dir of www.gov.fr

    • (33) 01 56 00 01

    P2P Data Management, 2006, S. Abiteboul


    Blasphemous claim:

    Active XML is the proper paradigm for data exchange!

    Not XML + not XQuery

    Brings to a unique setting

    distributed db, deductive db, active db, stream data

    warehousing, mediation

    This is unreasonable? Yes!

    Plenty of works ahead… to make it work

    But first, the algebra

    Active XMLcool idea – complex problems

    P2P Data Management, 2006, S. Abiteboul


    Outline

    • Introduction – the data ring

    • Calculus for P2P data management(ActiveXML)

    • Algebra for P2P data management (ActiveXML algebra)

      • Query processing

      • Query optimization

    • Indexing in P2P (KadoP)

    • Conclusion

    P2P Data Management, 2006, S. Abiteboul


    3. Active XML algebra


    Motivation

    • Relational model: centralized tables

    • optimization: algebraic expression and rewriting

    • Active XML model: distributed trees

    • optimization: algebraic expression and rewriting

    • Distributed query optimization based on algebraic rewriting of Active XML trees

    • Based on experiences with AXML optimization

    P2P Data Management, 2006, S. Abiteboul


    We focus on positive AXML

    Set-oriented data

    Positive/monotone services

    Services = tree-pattern-query-with-join queries

    Services produce streams

    Optimized by a local query optimizer

    Evaluated by a local query processor

    Out of our scope

    Active XML peers

    output

    stream

    π

    Local

    query

    processing

    join

    π

    input

    stream

    input

    stream

    P2P Data Management, 2006, S. Abiteboul


    The problem

    • An AXML system

      • A set of peers

      • For each peer a set of documents and services

      • Extensional data is distributed

      • Intensional data (knowledge) is distributed

        • Defined using query services (TPQJ queries)

        • These services are generic: any peer can evaluate a query

    • A query q to some peer

    • Evaluate the answer to q with optimal response time

    P2P Data Management, 2006, S. Abiteboul


    AXML algebra

    l

    En

    E1

    E2

    s@p

    En

    E1

    E2

    send@p

    #n@p’

    E1

    eval@p

    receive@p

    E1

    E1

    • (AXML) algebraic expressions:

    AXML logic

    d@p

    Each such expression lives at some peer

    Includes the AXML trees

    P2P Data Management, 2006, S. Abiteboul


    Algebraic expressions annotations

    • Executing service call:

    • Terminated service call: 

    • Subtlety

      q@p(5): definition of intensional data

      eval(q@p(5)): request to evaluate it; during query optimization

       q@p(5): query is being evaluated; during query processing

       q@p(5): query evaluation is complete

    P2P Data Management, 2006, S. Abiteboul


    Evaluation rules: local rules

    l

    l

    eval@p

    eval@p

    eval@p

    eval@p

    eval@p

    eval@p

    eval@p

    eval@p

    eval@p

    eval@p

    eval@p

    tn

    t1

    t2

    t1

    t2

    tn

    t2

    t1

    tn

    E1

    ●s@p

    E1

    s@p

    tn

    t1

    t2

    for l ≠ sc, s ≠ send, receive

    P2P Data Management, 2006, S. Abiteboul


    Evaluation rules: transfer rules

    #x@p

    #x@p

    newRoot()@p’

    eval@p

     receive@p

    eval@p’

    send@p’

    s@p’

    s@p’

    s@p’

    t1

    t1

    t1

    t2

    t2

    t2

    &

    #x@p

    • Site p asks p’ to do the work and send the result to p

    P2P Data Management, 2006, S. Abiteboul


    Evaluation rules: more transfer rules

    eval@p’

    eval@p’

    send@p’

    send@p’

    ●s@p’

    ●s@p’

    #x@p

    #x@p

    Z

    Z

    • When a query is evaluated, results appear

    • They are sent to the place that requested them

    • Also some rules for eof

    #x@p

    &

    Z

    P2P Data Management, 2006, S. Abiteboul


    Evaluation

    • Reminder: setting

      • An AXML system

      • A request to evaluate query q at peer p – eval@p( q )

    • Rewrite the trees in peer workspaces until termination of the process

    • Results

    • For positive XML, this process converges… to a possibly infinite state

    • This process computes the answer to q

    • May be fairly inefficient: need for optimization!

    P2P Data Management, 2006, S. Abiteboul


    More rewrite rules to evaluate a query more efficiently

    Optimization


    Query optimization

    • Well-known optimization techniques for distributed data management

    • Pushing selections

    • Semijoin reducers

    • Horizontal, vertical, hybrid decomposition

    • Recursive query processing and query-subquery

    • Some “specific” AXML optimizations

    • Pushing queries over service calls

    • Lazy service call evaluation …

    • Optimizing subscription management

    • All are captured by the algebraic framework

    P2P Data Management, 2006, S. Abiteboul


    Example: pushing selections

    eval@p

    eval@p

    eval@p

    q

    q1

    q1@p

    eval@p

    (q2(d@p2))

    d@p2

    q1@p

    eval@p2

    @p2(q2@p2(d@p2))

    (q2(d))

    • Same rule applies if d@p2 is replaced by a continuous query

    Suppose q ≡ q1((q2)):

    P2P Data Management, 2006, S. Abiteboul


    Example: interleaving of processing and optimization

    (r2)

    (r3)

    (r4)

    (r1)

    • At peer i: di = ri di+1

    • Query at p1: (d1)

    • (d1)→ (r1)(d2)eval@p1((r1)(d2)) → eval@p1((r1))  eval@p1((d2))eval@p1((r1))→●(r1) (starts streaming data)

    • (d2)→(r2)(d3); (r2) starts streaming data

    • (d3)→ (r3)(d4) …

    P2P Data Management, 2006, S. Abiteboul


    Transfer and load balancing rules

    #x@p1

    #x@p1

    eval@p1

    eval@p1

    send@p1

    send@p2

    newRoot@p2()

    eval@p2

    #x@p1

    E

    E

    Peer p1 delegates the evaluation of E to p2

    P2P Data Management, 2006, S. Abiteboul


    Transfer and load balancing rules

    eval@p2

    E

    #x@p1

    #x@p1

    eval@p1

    eval(E)

    send@p1

    send@p2

    newRoot@p2()

    #x@p1

    Peer p1 delegates the evaluation of E to p2

    P2P Data Management, 2006, S. Abiteboul


    Transfer and load balancing rules

    eval@p2

    E

    #x@p1

    newRoot@p2()

    #x@p1

    &

    send@p2

    eval(E)

    #x@p1

    Peer p1 delegates the evaluation of E to p2

    P2P Data Management, 2006, S. Abiteboul


    Transfer and load balancing rules

    eval(E)

    #x@p1

    newRoot@p2()

    #x@p1

    &

    send@p2

    eval(E)

    #x@p1

    Peer p1 delegates the evaluation of E to p2

    P2P Data Management, 2006, S. Abiteboul


    Transfer and load balancing rules

    eval(E)

    eval(E)

    eval(E)

    #x@p1

    newRoot@p2()

    #x@p1

    &

    send@p2

    #x@p1

    Peer p1 delegates the evaluation of E to p2

    P2P Data Management, 2006, S. Abiteboul


    Transfer and load balancing rules

    #x@p1

    eval@p1

    E

    E

    #x@p1

    eval@p1

    send@p1

    send@p2

    newRoot@p2()

    eval@p2

    #x@p1

    Peer p1 delegates the evaluation of E to p2

    P2P Data Management, 2006, S. Abiteboul


    Back to interleaved execution and optimization

    (r2)

    (r3)

    (r4)

    (r1)

    Repeated transfers

    (r2)

    (r3)

    (r4)

    (r1)

    Data transfers reduced

    More work for p1: merging all the streams

    Hierarchical stream merging

    P2P Data Management, 2006, S. Abiteboul


    Example: Horizontal and vertical decomposition

    • A relation d over ABC that is split both horizontally and vertically

      d = (d1  d2) d3

      d1 = B<5 (d') and d2 = B>=5 (d')

      d', d1, d2 over AB and d3 over BC; each di is at a peer pi

    • Consider the query B=0@p(d)

    • → B=0@p( (B<5 (d') B>=5 (d'))) d3@p3 )

    • → B=0 @p( d1@p1 d3@p3 )

    • → @p (#x@preceive(d1@p1), #y@preceive(d3@p3))

    • & send@p1(#x@p; B=0@p1(d1@p1) )

    • &  send@p3(#y@p; d3@p3)

    P2P Data Management, 2006, S. Abiteboul


    Common sub-expression elimination

    #x@p

    receive@p

    eval@p

    E

    E

    E

    • eval@p(E), #x@preceive@p(E) →

    • eval@p(#x@p), #x@preceive@p(E)

    eval@p

    #x@p

    &

    &

    receive@p

    #x@p

    P2P Data Management, 2006, S. Abiteboul


    Common sub-expression elimination

    newRoot()@p

    newRoot()@p

    q@p’

    eval@p’

    send@p

    send@p

    &

    q@p’

    eval@p’

    #x@p’

    s@p

    s@p

    s@p

    s@p

    #x@p’

    #x@p’

    s@p

    receive@p’

    s@p

    q@p’

    newRoot()@p’

    &

    &

    send@p’

    #x@p’

    #y@p’

    receive@p’

    receive@p’

    #y@p’

    #x@p’

    s@p

    s@p

    In spite of the two calls to s@p, the function is evaluated only once

    P2P Data Management, 2006, S. Abiteboul


    Example: recursive query processing

    • Using a pseudo Datalog syntax

    • s1@p($x, $y)  d2@p'($x, $z), s2@p'($z, $y)

    • s2@p'($x, $y)  d1@p($x, $z), s1@p($y, $z)

    • After rewriting:

    • on p : #x@p receive@p(q1@p'(d2@p', s2@p') ) 

    • root@p send@p(#y@p', q2@p( d1@p, #x@p) ) 

    • on p' : root@p'send@p'(#x@p,

    •  q1@p'(d2@p', #y@p' receive@p'(s2@p')  ) ) 

    P2P Data Management, 2006, S. Abiteboul


    Generic and global services

    • q@any: where q is a query

      • Any peer that has some query processor for q can do it

    • f@any: where f is a processing service call

      • Example: decryption or gene comparison

    • q over a P2P collection

    eval@p

    eval@p

    eval@p

    eval@p

    q@p2

    q@p1

    index

    q

    q

    coll

    @

    @

    q

    P2P Data Management, 2006, S. Abiteboul


    The AXML algebra – conclusion

    • Captures distributed XML query processing/optimization

    • Based on a communication model a la CCS

    • Algebraic – stream-oriented

    • Orthogonal to the local XML query optimizer

    • Orthogonal to the network support (DHT, small world etc.)

    • What is not yet available? A cost model and heuristics

    P2P Data Management, 2006, S. Abiteboul


    Outline

    • Introduction – the data ring

    • Calculus for P2P data management(ActiveXML)

    • Algebra for P2P data management (ActiveXML algebra)

    • Indexing in P2P (KadoP)

    • Conclusion

    P2P Data Management, 2006, S. Abiteboul


    4. P2P XML indexing and query processing


    Efficient evaluation of tree-pattern-queries

    • Many optimization techniques

    • We are interested here in distributed query evaluation/optimization

    • 1) We consider XML indexing

    • 2) Holistic twig join that is based on indexing

    • 3) P2P indexing

    • 4) P2P query processing

    • 5) Optimizing P2P indexing

    P2P Data Management, 2006, S. Abiteboul


    XML indexing: structural identifiers

    1

    A

    8

    0

    7

    2

    B

    C

    8

    6

    1

    1

    X ancestor of Y <=>

    pre(X) < pre(Y) and

    post(X) ≥ post(Y)

    3

    8

    5

    D

    F

    E

    4

    8

    6

    2

    2

    2

    6

    4

    G

    X parent of Y <=>

    X ancestor of Y and

    level(X) = level(Y) - 1

    “John”

    6

    4

    3

    3

    -Level

    Structural IDs = Prefix-Postfix

    P2P Data Management, 2006, S. Abiteboul


    Holistic Twig Join

    • Input a document and a tree pattern query

    • Find the bindings of the query in the document

    • Holistic = holistique

    • (le tout et pas juste les parties)

    • Twig = brindille

    • Join = you know…

    • Sounds like Harry Potter?

    P2P Data Management, 2006, S. Abiteboul


    Query evaluation over a document

    A

    D

    C

    Ids for A

    (1,8,0)…

    Ids for C

    Ids for D

    “John”

    Ids for “John”

    Ids are sorted in lexicographical order

    Goals is to find “matching Ids”

    P2P Data Management, 2006, S. Abiteboul


    The Holistic Twig Join Algorithm

    a

    b

    c

    a (2,8)

    a (18,25)

    c (9,14)

    a (15,17)

    b (12,14)

    c (6,8)

    c (23,25)

    a (3,5

    c (7,8)

    b (24,25)

    c (4,5)

    b (13,14)

    a (8,8)

    c (25,25)

    c (14,14)

    a (5,5)

    level

    0

    r (1,25)

    1

    b (10,11)

    a (16,17)

    b (19,22)

    2

    c (11,11)

    c (17,17)

    b (20,21)

    3

    4

    c (21,21)

    c (22,22)

    P2P Data Management, 2006, S. Abiteboul


    The Holistic Twig Join Algorithm

    (a7, b4, c8),

    (a7, b5, c8),

    Sa

    b5

    Sb

    Sc

    Stacks

    (a7, b4 ,c9)

    (a7 ,b6 ,c11)

    a

    a7

    a1

    a5

    a7

    a4

    a2

    a6

    a3

    b6

    b4

    b1

    b2

    b4

    b6

    b

    b5

    b3

    c1

    c2

    c10

    c5

    c11

    c9

    c8

    c6

    c7

    c4

    c3

    c8

    c9

    c11

    c

    Legend:

    This is the end

    Head of the stream

    Find the match for the query sub-tree determined by this node !!!

    The ID is present also in the stack

    P2P Data Management, 2006, S. Abiteboul


    P2P XML processing


    History

    1999: INRIA research project

    2000: Creation of a spin-off

    2006: About 25 people

    Technology

    A scalable XML repository

    A content warehouse

    On a cluster of Linux PC

    XML query processing

    Twig join

    Index is distributed

    Keyword-based vs. document based

    XML indexing in Xyleme

    hash(C)

    LAN

    hash(“John)

    Put(C;[d,p,6,6,1])

    Put(“John”;[d,p,3,1,2])

    P2P Data Management, 2006, S. Abiteboul


    Query processing over a distributed collection

    A

    Ids for A

    (p12,d456, 1,7,0)…

    C

    D

    Ids for C

    Ids for D

    “John”

    Ids include peerId and docId

    Ids are sorted in lexicographical order

    Goals is to find “matching Ids” in the collection

    Ids for John

    P2P Data Management, 2006, S. Abiteboul


    Use structural Ids

    Publish them via a DHT

    Distributed Hash Table

    Peers come and go

    Locate(k):: log(n) messages to fing the peer in charge of key k

    Put(k,v)

    Get(k): retrieves all the values for k

    We use Pastry

    We also tried P2PSim and JXTA

    XML indexing in KadoP

    hash(C)

    DHT

    posting for C

    hash(“John)

    put(C;[d,p,6,6,1])

    put(“John”;[d,p,3,1,2])

    put(C;[d,p,6,6,1])

    P2P Data Management, 2006, S. Abiteboul


    XML query processing in KadoP

    • Given a tree pattern query Q

    • Evaluate an index query indexQ to locate the peers that can provide some answers

      • indexQ is a twig join

    • Ship Q to these peers and evaluate it there

    • If indexQ is imprecise, many false positive

      • Example: ship Q to all peers (maximal parallelism)

      • Example: Instead of structural Ids, just use (label/word,peerId,docId)

    P2P Data Management, 2006, S. Abiteboul


    KadoP architecture

    KadoP peer

    publish & query

    Semantic layer

    Web

    interface

    External

    Layer

    ActiveXML

    engine

    KadoP Engine

    Indexing

    Logical

    Layer

    Query

    processing

    DHT

    locate, put, get & delete

    Physical

    Layer

    Index

    P2P Data Management, 2006, S. Abiteboul


    Some technical issues

    • Our goal: manage millions of documents with a large number of peers

    • First experiments were a disaster

    • Replace the index storage of the DHT in a FS by storage in a database (Berkeley DB)

    • Extend the API of the DHT to have Append and not only Read/Write

    • Extend the API of the DHT to have a streaming exchange of postings

      • Useful because the XML algebra works better with streams

    • Now it scales but there is the issue of long postings

    P2P Data Management, 2006, S. Abiteboul


    Using keyword distribution

    Suppose

    Peer for Ullman is in Europe

    Peer for XML is in US

    we have to ship one long posting between US and Europe

    For a large number of users, we absorb all the bandwidth of Internet backbone

    Need for replication

    Even for thousands of peers, the exchange of long postings is an issue

    The issue of long postings: Google in P2P

    Ullmann & xml?

    DHT

    Ullman

    xml

    P2P Data Management, 2006, S. Abiteboul


    Intensional indexing in KadoP

    Distributed B-tree

    • Long posting = bad response time

    • No long posting

    • get h(name) then parallel fetch

    • Possibility to optimize further

    • f(docId55..docId75)

    • may be it does not match

    • no need to call f

    long posting

    h(Name)

    f g h i

    h(Name)

    P2P Data Management, 2006, S. Abiteboul


    More optimization

    • Standard for P2P keyword search

      • Gap compression and adaptive set intersection

    • Standard distributed query optimization techniques

      • Ship smallest list

      • Load balancing

      • Caching

      • Replication

    • Semi-join techniques notably Bloom semi-join

    P2P Data Management, 2006, S. Abiteboul


    Outline

    • Introduction – the data ring

    • Calculus for P2P data management(ActiveXML)

    • Algebra for P2P data management (ActiveXML algebra)

    • Indexing in P2P (KadoP)

    • Conclusion

    P2P Data Management, 2006, S. Abiteboul


    6. Conclusion


    Conclusion

    • Logic for distributed data management

      • Opinion: XQuery is a language for local XML management

      • Proposal: ActiveXML

    • Algebraic foundation of distributed query optimization

      • Proposal: ActiveXML algebra

    • P2P (Active) XML indexing

      • KadoP is now being tested and we are working on optimization

    • Software

      • ActiveXML is open-source – see activexml.net

      • KadoP soon will be – already available upon request

      • EDOS distribution system as well

    P2P Data Management, 2006, S. Abiteboul


    Lots of related work and related systems

    • This is going very fast in system devepments

    • Structured P2P nets: Pastry, Chord

    • Content delivery net: Coral, Akamai

    • XML repositories: Xyleme, DBMonet

    • Multicas systemst: Avalanche, Bullet

    • File sharing systems: BitTorrent, Kazaa

    • Pub/Sub systems: Scribe, Hyper

    • Distributed storage systems: OceanStore, GoogleFS

    • Etc.

    • Fundamental research is somewhat left behind

    P2P Data Management, 2006, S. Abiteboul


    Issues

    • P2P query optimization

    • P2P access control

    • P2P archiving

    • P2P self tuning

    • P2P monitoring

    • P2P knowledge management [SomeWhere]

    • Also: analysis and verification of these systems

      • E.g., termination, error detection, diagnosis

    P2P Data Management, 2006, S. Abiteboul


    Find your own topic

    • Pick your favorite problem for data or knowledge management and study it in a P2P setting

    • with gigabytes of data and thousands of peers

    • If you find it boring, consider it

    • with terabytes of data and millions of peers

    P2P Data Management, 2006, S. Abiteboul


    Merci

    Merci

    P2P Data Management, 2006, S. Abiteboul


    ad
  • Login