information management in p2p serge abiteboul inria futurs and univ paris 11
Download
Skip this Video
Download Presentation
Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

Loading in 2 Seconds...

play fullscreen
1 / 89

Information Management in P2P - PowerPoint PPT Presentation


  • 338 Views
  • Uploaded on

Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11. Introduction. Success stories at the time of the Internet bubble. Google: management of Web pages Mapquest: management of maps Amazone: book catalogue eBay: product catalogue

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Management in P2P' - Jimmy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
information management in p2p serge abiteboul inria futurs and univ paris 11
Information Management in P2PSerge AbiteboulINRIA-Futurs and Univ. Paris 11

P2P Data Management, 2006, S. Abiteboul

success stories at the time of the internet bubble
Success stories at the time of the Internet bubble
  • Google: management of Web pages
  • Mapquest: management of maps
  • Amazone: book catalogue
  • eBay: product catalogue
  • Napster (emule, bearshare, etc.): music database
  • Flickr: picture database
  • Wikipedia: dictionary
  • del.icio.us: annotations
  • InFrance:

Meetic: dating database

Kelkoo: comparative shopping

They are all about

publishing some database

P2P Data Management, 2006, S. Abiteboul

the trend is towards peer to peer and interactivity
The trend is towards peer-to-peerand interactivity
  • P2P: A large and varying number of computers cooperate to solve some particular task without any centralized authority
  • Goal: build an efficient, robust, scalable system based (typically) on inexpensive, unreliable computers distributed in a wide area network
  • [email protected]; kazaa; cabal
  • Switch from centralized servers to communities and syndication
  • Interaction and Web 2.0
  • Motivations: Social, organizational

P2P Data Management, 2006, S. Abiteboul

information management in a p2p network
Information management in a P2P network
  • Private terminology: data ring
  • Information is heterogeneous, distributed, replicated, dynamic
  • Which info: Data + meta-data + knowledge + services
  • Peers are heterogeneous, autonomous and possibly mobile
    • From sensors to PDA to mainframe
  • Typically very large number of peers
  • Variety of requirements: QoS, performance, security, etc.

P2P Data Management, 2006, S. Abiteboul

acknowledgement
Acknowledgement
  • Xyleme:Scalable XML warehousing
  • Sophie Cluet, Guy Ferran (Xyleme) & many others
  • ActiveXML:Language for P2P data management
  • Omar Benjelloun (Google), Ioana Manolescu, Tova Milo (Tel Aviv) & many others
  • KadoP: P2P scalable XML indexing
  • Ioana Manolescu, Nicoleta Preda & others
  • Data Ring: Infrastructure for P2P data management
  • Alkis Polyzotis (UC Santa Cruz)

P2P Data Management, 2006, S. Abiteboul

outline
Outline
  • Introduction – the data ring
  • Calculus for P2P data management (ActiveXML)
  • Algebra for P2P data management (ActiveXML algebra)
  • Indexing in P2P (KadoP)
  • Conclusion
  • Goal of the tutorial: present issues and technology on p2p information management
  • Warning: it is very biased – it is not a survey

P2P Data Management, 2006, S. Abiteboul

outline8
Outline
  • Introduction – the data ring
  • Calculus for P2P data management (ActiveXML)
  • Algebra for P2P data management (ActiveXML algebra)
  • Indexing in P2P (KadoP)
  • Conclusion

P2P Data Management, 2006, S. Abiteboul

the information in a data ring
The information in a data ring
  • Data: tuples, collections, documents, relations…
  • Services: data sources, possibly some processing
  • Meta-data about resources: attribute/values pairs, annotations
  • Ontologies to explain data and metadata
  • View definitions
  • Data integration information, e.g., mappings between ontologies
  • Physical data: Indices and materialized views

P2P Data Management, 2006, S. Abiteboul

functionalities of the data ring
Functionalities of the data ring
  • Storage, persistence, replication
  • Indexing, caching, querying, updating, optimization
  • Schema management, access control
  • Fault tolerance, self tuning, monitoring
  • Resource discovery, history, provenance, annotations, multi-linguism,
  • Semantic enrichment, uncertain data
  • Each functionality may be achieved by a peer or by the network

P2P Data Management, 2006, S. Abiteboul

and now what is a peer
A mainframe database

A file system

Web server

A PC

A PDA

A telephone

A sensor

A home appliance

A car

A manufacturing tool

A telecom equipment

A toy

Another data ring

And now, what is a peer?

Any connected device or software with some information to share

A net address and some names of resources (e.g. document or service)

P2P Data Management, 2006, S. Abiteboul

advantages and disadvantages of p2p
Advantages and disadvantages of P2P
  • Scaling
  • Performance
    • Optimization of parallelism
    • Avoid bottleneck
    • Replication
  • Availability
    • Replication
  • Cost
    • Avoid the cost of server
    • Share operational cost
  • Dynamicity
    • add/remove new data sources
  • Complexity
  • Performance
    • Cost for complex queries
    • Communication cost
  • Availability
    • Peers can leave
  • Consistency maintenance
    • Difficult to support transaction
  • Quality
    • Difficult to guarantee quality

P2P Data Management, 2006, S. Abiteboul

crash course on web standards
Crash course on Web standards

Owl

RDFS

XML

  • Data exchange format = XML
    • Labeled ordered trees
    • Its main asset = XML schema
    • There is much more
  • Distributed computing protocol = Web services
    • SOAP Simple Object Access Protocols
    • WSDL Web Service Definition Language
    • UDDIUniversal Description, Discovery and Integration
    • BPEL Business Process Execution Language
  • Query languages = XPath and XQuery
    • Declarative query language for XML + full-text + update language
  • Knowledge representation = Owl or RDF/S

Xquery

Xpath

SOAP

WSDL

P2P Data Management, 2006, S. Abiteboul

slide15

Information used to live in islands but with the Web, this is changing:uniform access to information… …the dream for distributed data management

do you like the standards
Do you like the standards?
  • It is the wrong question!
  • Correct questions: What can you do with it? What is missing?
  • Is Xquery the ultimate query language for the Web? No
    • It is a language for querying centralized XML
    • We will see what it is missing
  • We will not talk much about semantics

P2P Data Management, 2006, S. Abiteboul

automatic and distributed management of the data ring
Automatic and distributed management of the data ring
  • No centralized server
  • No information administrator (no info manager)
  • Most users are non-experts
    • E.g., scientists
  • Requirements
    • Ease of deployment (zero-effort)
    • Ease of administration (zero-effort)
    • Ease of publication (epsilon-effort)
    • Ease of exploitation (epsilon-effort)
    • Participation in community building notably via annotations

Happy database admin

P2P Data Management, 2006, S. Abiteboul

what should be made automatic
What should be made automatic
  • Self-statistics from the monitoring of the data ring
    • In particular, define the statistics that are needed*
  • Self-tuning based on the self-statistics
    • Choose the most appropriate organization
    • Decide to install access structures: indexes, views, etc.
    • Control replication of data and services
  • Self-healing
    • Recovery from errors
    • E.g., replacement of a failing Web service
  • And automatic file management

P2P Data Management, 2006, S. Abiteboul

any hope
Any hope?
  • Technology exists (database self-tuning, machine learning, etc.)
  • But self-tuning for databases has advanced very slowly
  • Why can this work?
  • There is no alternative (for db, this was just a cool gadget)
  • KISS (keep it simple stupid!)
  • The power of parallelism
  • This is assuming lots of machine have free cycles (true) and bandwidth is generous (not always true)

P2P Data Management, 2006, S. Abiteboul

distributed access control
Distributed access control
  • Goal: Control access to ring resources
  • Access to resources is based on access rights (ACL)
  • Who is controlling ACLs?
  • A node manages ACLs for a collection of distributed resources
    • Easy but against the spirit and possible bottleneck
  • The network manages access control
    • Anybody can get the data
    • The data is published with encryption and signatures; only nodes with proper access rights can perform reads/writes
    • Some techniques exist

P2P Data Management, 2006, S. Abiteboul

monitoring
Monitoring
  • What is monitored?
  • Web service calls and database updates
  • The Web
      • Web pages
      • RSS feed
  • What is produced?
  • A stream of events
      • As a continuous service
      • As a RSS feed
      • As a Web site/page
  • Info-surveillance
  • Self-statistics and tracing
  • Basis for error diagnosis

P2P Data Management, 2006, S. Abiteboul

streams are everywhere
Streams are everywhere
  • In query processing
  • In indexing (KadoP)
  • In recursive queries (AXML-QSQ)
  • In messaging, monitoring and pub/sub
  • That is why we will use an algebra over streams of trees and not simply an algebra over trees

P2P Data Management, 2006, S. Abiteboul

example edos distribution system
Example: Edos distribution system
  • A system for the management of Linux distribution
  • Joint work with Mandriva Software and U. Tel Aviv
  • Community of open-source developers: thousands
  • System releases: about 10 000 software packages + metadata
  • Functionalities
    • Query the metadata
    • Query subscription
    • Retrieve packages
    • Publish a new release or update an existing one

P2P Data Management, 2006, S. Abiteboul

exemple webcontent
Exemple: WebContent
  • WebContent: an ANR platform for the management of web content
  • Web surveillance
    • Business, technical, web watching…
  • Participation of Gemo
    • WP3: knowledge
    • WP5: P2P content management
  • Partners: CEA, EADS, Thales, Bongrain, Xyleme, Exalead, many research groups (UVSQ, Grenoble, Paris 6, etc.)

P2P Data Management, 2006, S. Abiteboul

taxonomy of such applications
Taxonomy of such applications
  • Parameters
    • Number of peers and quantity of data
    • How volatile the peers are
    • The query/update workload
    • The functionalities that are desired
  • Edos: peers and documents in thousands, mostly append for updates, peers not too volatile
  • An extreme: Google search engine in P2P for billions of documents using millions of hyper volatile peers
  • Mostly interested in the first case

P2P Data Management, 2006, S. Abiteboul

thesis
Thesis
  • XQuery is fine for local XML processing and publishing
  • Not sufficient for distributed data management
  • The success of the relational model, i.e., of tables on a server :
    • A logic for defining tables
    • An algebra for describing query plans over tables
  • By analogy, we need for trees in a P2P system
    • A logic for defining distributed tree data and data services
    • An algebra for describing query plans over these
  • Proposal: ActiveXML logic and algebra

P2P Data Management, 2006, S. Abiteboul

outline27
Outline
  • Introduction – the data ring
  • Calculus for P2P data management (ActiveXML)
  • Algebra for P2P data management (ActiveXML algebra)
  • Indexing in P2P (KadoP)
  • Conclusion

P2P Data Management, 2006, S. Abiteboul

the basis
The basis
  • AXML is a declarative language for distributed information management and an infrastructure to support the language in a P2P framework
  • Simple idea: XML documents with embedded service calls
  • Intensional data
    • Some of the data is given explicitly whereas for some, its definition (i.e. the means to acquire it when needed) is given
  • Dynamic data
    • If the data sources change, the same document will provide different information

P2P Data Management, 2006, S. Abiteboul

example omitting syntactic details
Example(omitting syntactic details)

<resorts state=‘Colorado’>

<resort>

<name> Aspen </name>

<sc> Unisys.com/snow(“Aspen”) </sc>

<depth unit=“meter”>1</depth>

<hotels ID=AspHotels > ….

Yahoo.com/GetHotels(<city name=“Aspen”/>)

</hotels>

</resort> …

</resorts>

  • May contain calls
  • to any SOAP web service :
    • e-bay.net, google.com…
  • to any AXML web services
    • to be defined

P2P Data Management, 2006, S. Abiteboul

marketing philosophy
Marketing  Philosophy

Active answer = intensional and dynamic and flexible

Embedding calls in data is an old idea in database

Manon: What’s the capital of Brazil?

Dad: Let’s ask Wikipedia.com!

Manon: How do I get a cheap ticket to Galapagos?

Dad: Let’s place a subscription on LastMinute.com!

Manon: What are the countries in the EC?

Dad: France, Germany, Holland, Belgium, and hum… Let’s ask YouLists.com for more!

P2P Data Management, 2006, S. Abiteboul

active xml peer
Active XML peer

AXML

peer

soap

  • Peer-to-peer architecture
  • Each Active XML peer
    • Repository: manages Active XML data
    • Web client: calls the services inside a document
    • Web server: provides (parameterized) queries/updates over the repository as web services
  • Exchange of AXML instead of XML

P2P Data Management, 2006, S. Abiteboul

what is an axml peer
What is an AXML peer?
  • PC
    • Now open source ObjectWeb – queries in OQL
  • Peer on a mass storage system
    • eXist (open source XML database) queries in XQuery
    • Xyleme queries in XyQL
  • PDA or cell phone
    • Persistence in file system and XPATH
  • On going: the entire network
    • Data is stored in a P2P network - KadoP
  • More: java card, a relational database…

P2P Data Management, 2006, S. Abiteboul

a key issue call activation
When to activate the call?

Explicit pull mode: active databases

Implicit pull mode: deductive databases

Push mode: query subscription

What to do with its result?

How long is the returned data valid?

Mediation and caching

Where to find the arguments?

Under the service call: XML,XPATH or a service call

A key issue: call activation

P2P Data Management, 2006, S. Abiteboul

another key issue what to send
Another key issue: what to send?
  • Send some AXML tree t
    • As result of a query or as parameter of a call
  • The tree t contains calls, do we have to evaluate them?
    • If I do, I may introduce service calls, do we have to evaluate all these calls before transmitting the data?
  • Hi John, what is the phone number of the Prime Minister of France?
  • Find his name at whoswho.com then look in the phone dir
  • Look in the yellow pages for deVillepin’s in phone dir of www.gov.fr
  • (33) 01 56 00 01

P2P Data Management, 2006, S. Abiteboul

active xml cool idea complex problems
Blasphemous claim:

Active XML is the proper paradigm for data exchange!

Not XML + not XQuery

Brings to a unique setting

distributed db, deductive db, active db, stream data

warehousing, mediation

This is unreasonable? Yes!

Plenty of works ahead… to make it work

But first, the algebra

Active XMLcool idea – complex problems

P2P Data Management, 2006, S. Abiteboul

outline37
Outline
  • Introduction – the data ring
  • Calculus for P2P data management (ActiveXML)
  • Algebra for P2P data management (ActiveXML algebra)
    • Query processing
    • Query optimization
  • Indexing in P2P (KadoP)
  • Conclusion

P2P Data Management, 2006, S. Abiteboul

motivation
Motivation
  • Relational model: centralized tables
  • optimization: algebraic expression and rewriting
  • Active XML model: distributed trees
  • optimization: algebraic expression and rewriting
  • Distributed query optimization based on algebraic rewriting of Active XML trees
  • Based on experiences with AXML optimization

P2P Data Management, 2006, S. Abiteboul

active xml peers
We focus on positive AXML

Set-oriented data

Positive/monotone services

Services = tree-pattern-query-with-join queries

Services produce streams

Optimized by a local query optimizer

Evaluated by a local query processor

Out of our scope

Active XML peers

output

stream

π

Local

query

processing

join

π

input

stream

input

stream

P2P Data Management, 2006, S. Abiteboul

the problem
The problem
  • An AXML system
    • A set of peers
    • For each peer a set of documents and services
    • Extensional data is distributed
    • Intensional data (knowledge) is distributed
      • Defined using query services (TPQJ queries)
      • These services are generic: any peer can evaluate a query
  • A query q to some peer
  • Evaluate the answer to q with optimal response time

P2P Data Management, 2006, S. Abiteboul

axml algebra
AXML algebra

l

En

E1

E2

[email protected]

En

E1

E2

[email protected]

#[email protected]

E1

[email protected]

[email protected]

E1

E1

  • (AXML) algebraic expressions:

AXML logic

[email protected]

Each such expression lives at some peer

Includes the AXML trees

P2P Data Management, 2006, S. Abiteboul

algebraic expressions annotations
Algebraic expressions annotations
  • Executing service call:
  • Terminated service call: 
  • Subtlety

[email protected](5): definition of intensional data

eval([email protected](5)): request to evaluate it; during query optimization

[email protected](5): query is being evaluated; during query processing

[email protected](5): query evaluation is complete

P2P Data Management, 2006, S. Abiteboul

evaluation rules transfer rules
Evaluation rules: transfer rules

#[email protected]

#[email protected]

newRoot()@p’

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

t1

t1

t1

t2

t2

t2

&

#[email protected]

  • Site p asks p’ to do the work and send the result to p

P2P Data Management, 2006, S. Abiteboul

evaluation rules more transfer rules
Evaluation rules: more transfer rules

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

#[email protected]

#[email protected]

Z

Z

  • When a query is evaluated, results appear
  • They are sent to the place that requested them
  • Also some rules for eof

#[email protected]

&

Z

P2P Data Management, 2006, S. Abiteboul

evaluation
Evaluation
  • Reminder: setting
  • Rewrite the trees in peer workspaces until termination of the process
  • Results
  • For positive XML, this process converges… to a possibly infinite state
  • This process computes the answer to q
  • May be fairly inefficient: need for optimization!

P2P Data Management, 2006, S. Abiteboul

query optimization
Query optimization
  • Well-known optimization techniques for distributed data management
  • Pushing selections
  • Semijoin reducers
  • Horizontal, vertical, hybrid decomposition
  • Recursive query processing and query-subquery
  • Some “specific” AXML optimizations
  • Pushing queries over service calls
  • Lazy service call evaluation …
  • Optimizing subscription management
  • All are captured by the algebraic framework

P2P Data Management, 2006, S. Abiteboul

example pushing selections
Example: pushing selections

[email protected]

[email protected]

[email protected]

q

q1

[email protected]

[email protected]

(q2([email protected]))

[email protected]

[email protected]

[email protected]

@p2([email protected]([email protected]))

(q2(d))

Suppose q ≡ q1((q2)):

P2P Data Management, 2006, S. Abiteboul

example interleaving of processing and optimization
Example: interleaving of processing and optimization

(r2)

(r3)

(r4)

(r1)

P2P Data Management, 2006, S. Abiteboul

transfer and load balancing rules54
Transfer and load balancing rules

[email protected]

E

#[email protected]

[email protected]()

#[email protected]

&

[email protected]

eval(E)

#[email protected]

Peer p1 delegates the evaluation of E to p2

P2P Data Management, 2006, S. Abiteboul

transfer and load balancing rules55
Transfer and load balancing rules

eval(E)

#[email protected]

[email protected]()

#[email protected]

&

[email protected]

eval(E)

#[email protected]

Peer p1 delegates the evaluation of E to p2

P2P Data Management, 2006, S. Abiteboul

transfer and load balancing rules56
Transfer and load balancing rules

eval(E)

eval(E)

eval(E)

#[email protected]

[email protected]()

#[email protected]

&

[email protected]

#[email protected]

Peer p1 delegates the evaluation of E to p2

P2P Data Management, 2006, S. Abiteboul

back to interleaved execution and optimization
Back to interleaved execution and optimization

(r2)

(r3)

(r4)

(r1)

Repeated transfers

(r2)

(r3)

(r4)

(r1)

Data transfers reduced

More work for p1: merging all the streams

Hierarchical stream merging

P2P Data Management, 2006, S. Abiteboul

example horizontal and vertical decomposition
Example: Horizontal and vertical decomposition
  • A relation d over ABC that is split both horizontally and vertically

d = (d1  d2) d3

d1 = B<5 (d\') and d2 = B>=5 (d\')

d\', d1, d2 over AB and d3 over BC; each di is at a peer pi

P2P Data Management, 2006, S. Abiteboul

generic and global services
Generic and global services
  • [email protected]: where q is a query
    • Any peer that has some query processor for q can do it
  • [email protected]: where f is a processing service call
    • Example: decryption or gene comparison
  • q over a P2P collection

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

index

q

q

coll

@

@

q

P2P Data Management, 2006, S. Abiteboul

the axml algebra conclusion
The AXML algebra – conclusion
  • Captures distributed XML query processing/optimization
  • Based on a communication model a la CCS
  • Algebraic – stream-oriented
  • Orthogonal to the local XML query optimizer
  • Orthogonal to the network support (DHT, small world etc.)
  • What is not yet available? A cost model and heuristics

P2P Data Management, 2006, S. Abiteboul

outline65
Outline
  • Introduction – the data ring
  • Calculus for P2P data management (ActiveXML)
  • Algebra for P2P data management (ActiveXML algebra)
  • Indexing in P2P (KadoP)
  • Conclusion

P2P Data Management, 2006, S. Abiteboul

efficient evaluation of tree pattern queries
Efficient evaluation of tree-pattern-queries
  • Many optimization techniques
  • We are interested here in distributed query evaluation/optimization
  • 1) We consider XML indexing
  • 2) Holistic twig join that is based on indexing
  • 3) P2P indexing
  • 4) P2P query processing
  • 5) Optimizing P2P indexing

P2P Data Management, 2006, S. Abiteboul

xml indexing structural identifiers
XML indexing: structural identifiers

1

A

8

0

7

2

B

C

8

6

1

1

X ancestor of Y <=>

pre(X) < pre(Y) and

post(X) ≥ post(Y)

3

8

5

D

F

E

4

8

6

2

2

2

6

4

G

X parent of Y <=>

X ancestor of Y and

level(X) = level(Y) - 1

“John”

6

4

3

3

-Level

Structural IDs = Prefix-Postfix

P2P Data Management, 2006, S. Abiteboul

holistic twig join
Holistic Twig Join
  • Input a document and a tree pattern query
  • Find the bindings of the query in the document
  • Holistic = holistique
  • (le tout et pas juste les parties)
  • Twig = brindille
  • Join = you know…
  • Sounds like Harry Potter?

P2P Data Management, 2006, S. Abiteboul

query evaluation over a document
Query evaluation over a document

A

D

C

Ids for A

(1,8,0)…

Ids for C

Ids for D

“John”

Ids for “John”

Ids are sorted in lexicographical order

Goals is to find “matching Ids”

P2P Data Management, 2006, S. Abiteboul

the holistic twig join algorithm
The Holistic Twig Join Algorithm

a

b

c

a (2,8)

a (18,25)

c (9,14)

a (15,17)

b (12,14)

c (6,8)

c (23,25)

a (3,5

c (7,8)

b (24,25)

c (4,5)

b (13,14)

a (8,8)

c (25,25)

c (14,14)

a (5,5)

level

0

r (1,25)

1

b (10,11)

a (16,17)

b (19,22)

2

c (11,11)

c (17,17)

b (20,21)

3

4

c (21,21)

c (22,22)

P2P Data Management, 2006, S. Abiteboul

the holistic twig join algorithm72
The Holistic Twig Join Algorithm

(a7, b4, c8),

(a7, b5, c8),

Sa

b5

Sb

Sc

Stacks

(a7, b4 ,c9)

(a7 ,b6 ,c11)

a

a7

a1

a5

a7

a4

a2

a6

a3

b6

b4

b1

b2

b4

b6

b

b5

b3

c1

c2

c10

c5

c11

c9

c8

c6

c7

c4

c3

c8

c9

c11

c

Legend:

This is the end

Head of the stream

Find the match for the query sub-tree determined by this node !!!

The ID is present also in the stack

P2P Data Management, 2006, S. Abiteboul

xml indexing in xyleme
History

1999: INRIA research project

2000: Creation of a spin-off

2006: About 25 people

Technology

A scalable XML repository

A content warehouse

On a cluster of Linux PC

XML query processing

Twig join

Index is distributed

Keyword-based vs. document based

XML indexing in Xyleme

hash(C)

LAN

hash(“John)

Put(C;[d,p,6,6,1])

Put(“John”;[d,p,3,1,2])

P2P Data Management, 2006, S. Abiteboul

query processing over a distributed collection
Query processing over a distributed collection

A

Ids for A

(p12,d456, 1,7,0)…

C

D

Ids for C

Ids for D

“John”

Ids include peerId and docId

Ids are sorted in lexicographical order

Goals is to find “matching Ids” in the collection

Ids for John

P2P Data Management, 2006, S. Abiteboul

xml indexing in kadop
Use structural Ids

Publish them via a DHT

Distributed Hash Table

Peers come and go

Locate(k):: log(n) messages to fing the peer in charge of key k

Put(k,v)

Get(k): retrieves all the values for k

We use Pastry

We also tried P2PSim and JXTA

XML indexing in KadoP

hash(C)

DHT

posting for C

hash(“John)

put(C;[d,p,6,6,1])

put(“John”;[d,p,3,1,2])

put(C;[d,p,6,6,1])

P2P Data Management, 2006, S. Abiteboul

xml query processing in kadop
XML query processing in KadoP
  • Given a tree pattern query Q
  • Evaluate an index query indexQ to locate the peers that can provide some answers
    • indexQ is a twig join
  • Ship Q to these peers and evaluate it there
  • If indexQ is imprecise, many false positive
    • Example: ship Q to all peers (maximal parallelism)
    • Example: Instead of structural Ids, just use (label/word,peerId,docId)

P2P Data Management, 2006, S. Abiteboul

kadop architecture
KadoP architecture

KadoP peer

publish & query

Semantic layer

Web

interface

External

Layer

ActiveXML

engine

KadoP Engine

Indexing

Logical

Layer

Query

processing

DHT

locate, put, get & delete

Physical

Layer

Index

P2P Data Management, 2006, S. Abiteboul

some technical issues
Some technical issues
  • Our goal: manage millions of documents with a large number of peers
  • First experiments were a disaster
  • Replace the index storage of the DHT in a FS by storage in a database (Berkeley DB)
  • Extend the API of the DHT to have Append and not only Read/Write
  • Extend the API of the DHT to have a streaming exchange of postings
    • Useful because the XML algebra works better with streams
  • Now it scales but there is the issue of long postings

P2P Data Management, 2006, S. Abiteboul

the issue of long postings google in p2p
Using keyword distribution

Suppose

Peer for Ullman is in Europe

Peer for XML is in US

we have to ship one long posting between US and Europe

For a large number of users, we absorb all the bandwidth of Internet backbone

Need for replication

Even for thousands of peers, the exchange of long postings is an issue

The issue of long postings: Google in P2P

Ullmann & xml?

DHT

Ullman

xml

P2P Data Management, 2006, S. Abiteboul

intensional indexing in kadop
Intensional indexing in KadoP

Distributed B-tree

  • Long posting = bad response time
  • No long posting
  • get h(name) then parallel fetch
  • Possibility to optimize further
  • f(docId55..docId75)
  • may be it does not match
  • no need to call f

long posting

h(Name)

f g h i

h(Name)

P2P Data Management, 2006, S. Abiteboul

more optimization
More optimization
  • Standard for P2P keyword search
    • Gap compression and adaptive set intersection
  • Standard distributed query optimization techniques
    • Ship smallest list
    • Load balancing
    • Caching
    • Replication
  • Semi-join techniques notably Bloom semi-join

P2P Data Management, 2006, S. Abiteboul

outline83
Outline
  • Introduction – the data ring
  • Calculus for P2P data management (ActiveXML)
  • Algebra for P2P data management (ActiveXML algebra)
  • Indexing in P2P (KadoP)
  • Conclusion

P2P Data Management, 2006, S. Abiteboul

slide85

Conclusion

  • Logic for distributed data management
    • Opinion: XQuery is a language for local XML management
    • Proposal: ActiveXML
  • Algebraic foundation of distributed query optimization
    • Proposal: ActiveXML algebra
  • P2P (Active) XML indexing
    • KadoP is now being tested and we are working on optimization
  • Software
    • ActiveXML is open-source – see activexml.net
    • KadoP soon will be – already available upon request
    • EDOS distribution system as well

P2P Data Management, 2006, S. Abiteboul

lots of related work and related systems
Lots of related work and related systems
  • This is going very fast in system devepments
  • Structured P2P nets: Pastry, Chord
  • Content delivery net: Coral, Akamai
  • XML repositories: Xyleme, DBMonet
  • Multicas systemst: Avalanche, Bullet
  • File sharing systems: BitTorrent, Kazaa
  • Pub/Sub systems: Scribe, Hyper
  • Distributed storage systems: OceanStore, GoogleFS
  • Etc.
  • Fundamental research is somewhat left behind

P2P Data Management, 2006, S. Abiteboul

issues
Issues
  • P2P query optimization
  • P2P access control
  • P2P archiving
  • P2P self tuning
  • P2P monitoring
  • P2P knowledge management [SomeWhere]
  • Also: analysis and verification of these systems
    • E.g., termination, error detection, diagnosis

P2P Data Management, 2006, S. Abiteboul

find your own topic
Find your own topic
  • Pick your favorite problem for data or knowledge management and study it in a P2P setting
  • with gigabytes of data and thousands of peers
  • If you find it boring, consider it
  • with terabytes of data and millions of peers

P2P Data Management, 2006, S. Abiteboul

merci
Merci

Merci

P2P Data Management, 2006, S. Abiteboul

ad