Making Peer Databases Interact – A Vision for an Architecture Supporting Data Coordination

Making Peer Databases Interact–A Vision for an Architecture Supporting Data Coordination • Working Group • (in alph. order): • Bernstein Phil (4) • Kementsietsidis Tasos (2) • Kuper Gabriel (1) • Mylopoulos John (2) • Serafini Luciano (3) • Shvaiko Pavel (1) • Zaihrayeu Ilya (1) • Sites: • University of Trento • University of Toronto • ITC-Irst, Trento • Microsoft Research Fausto Giunchiglia (1) Madrid, 20 September 2002

Peer-to-Peer Databases – The intuition Preliminary Logical Architecture The Running Example Conclusion … and Agents??? The Talk

PEER-TO-PEER DATABASES – THE INTUITION

“Peer-to-peer is a class of applications that take advantage of resources – storage, cycles, content, human presence – available at the edges of the Internet. Because accessing these decentralized resources means operating in an environment of unstable connectivity and unpredictable IP addresses, peer-to-peer nodes must … have significant or total autonomy of central servers” Quote from Clay Shirkey (www.shirky.com) The Peer-to-Peer (P2P)

Napster – a shared directory of available music and client software which allows, for instance, to import and export files Gnutella – a decentralized group membership and search protocol, mainly used for file sharing Groove – a system which implements a secure shared space among peers JXTA – which aims at creating a common platform that makes it simple and easy to build a wide range of distributed services and applications in which every device is addressable by a peer Is there a place for databases? Examples of P2P Computing

Motivating Example: Databases of Medical Patients • One patient may be described in several databases: pharmacist, family doctor and hospital • But the databases can use different patient ID formats, disease descriptions, etc • Nevertheless they still may need to interoperate • At this point data integration may suffice, if the patient goes to the same doctor, pharmacist and hospital • When a patient is injured on a ski holiday in another country, yet more databases need to get involved • Complete integration is likely to be infeasible • But dynamic integration of databases relevant to one patient could have high value  

“... Coordination is managing dependencies between interacting databases” Why is it different from data (base) Integration? No statically maintained global schema many of the parameters (metadata) influencing the interaction among peer databases are decided at run time, whereas Integration is made in design time Change in content of a node does not affect the overall system performance … and For any given query, nodes coordinate in order to define and use the most “appropriate” (virtual) schema – this is crucial for dealing with the strong dynamics of a P2P network Data (base) Coordination

Data integration mechanisms for randomly acquainted databases become impractical We have three kinds of unpredictable run time factors, which influence the answer to a given query in a P2P network: Network (dependent) variance: the network changes over time Database (dependent) variance: different databases, if asked the same global query will provide different answers Query (dependent) variance: different queries, even if posed to the same database, will impose different points of view on the network The Three Variances

In data coordination, it becomes hard to maintain a high quality level in the answers provided bythe P2P network High quality data can flow among the databases preserving (at the best possible level of approximation) soundness and completeness Good Enough Answer (intuition) – high quality level answer which serves its purposes given the amount of effort made in computing it Good Enough Answers

Example of a Good Enough Answer • When planning his vacation in Trentino, John goes to a local travel agency (TA) • TA unluckily can not offer John anything from their own database • Instead TA searches for single operators in the Trentino region (hotels, ski resorts, etc) • TA starts communication sessions with some operators • TA queries for the necessary information (e.g., prices, conditions, availability) • As long as, for instance, TA gets a hotel John likes, this is Good Enough • Compared to the Motivating Example, much lower quality data coordination will probably suffice Cost: 150 $ Avail: 05/01/03 – 15/01/03 Services: …

A lot of metadata needs to be produced and maintained Due to the strong dynamics of a P2P network, this is a crucial and hard task to perform because: A node will never know the full list of its peers A node will never know everything about its peers Its knowledge will be hard to maintain and will easily become obsolete There is a need of tuning/improving, on each peer, the quality of the interaction (for instance, with the help of learning algorithms, metadata editors, and so on) There is an obvious trade-off between the quality of the answers and the effort made in maintaining coordination Tuning Coordination Over Time

VERY PRELIMINARY HINTS OF A LOGICAL ARCHITECTURE

Four basic ingredients: Interest Groups Acquaintances Coordination Rules Correspondence Rules A Proposed Architecture

Peer nodes know very little of the other nodes of the P2P network, and about the topics (e.g., Tourism, Medical care, …) their peers are able to answer queries An Interest Group is a set of nodes which are able to answer queries about a certain topic There is a Group Manager (GM) which is in charge of the management of the metadata needed in order to run the group The main goal of GM is to compute the Query Scope (QS) – the set of nodes a query should be propagated to Interest Groups

Acquaintances are nodes that a node knows about and that have data relevant to answer specific queries A node is an acquaintance of another node only with respect to (possibly, a schematic representation of) a query There must be a way to compute how to propagate a query, to propagate results back, and to reconcile them with the results coming from the other acquaintances Acquaintances

Each acquaintance may be associated with one or moreCoordination Rules coordination rules specify under what conditions, when, how and where to propagate queries or updates A proposed implementation of coordination rules is as Event-Condition-Action (ECA) rules Event can be an update or a query coming from the user or from another node Condition refers to properties of the update or query (e.g., the type of query and/or which data are referenced by the query) Action can be the translation and propagation of a given update or query to a particular acquaintance Coordination Rules

Each acquaintance is associated with one or more Correspondence Rules Correspondence Rules translate queries and query results (semantic heterogeneity) Implemented as rewrite rules and are called by coordination rules, in action and condition components They can be used, for instance to translate attribute or element names (Domain Relations) Correspondence Rules

P2P Layer P2P functionality’s add-on Local Data Source Database File system Web site … User Interface User queries Results … Query Manager and Update Manager Responsible for query and update propagation Manage coordination and correspondence rules, acquaintances, and interest groups Wrapper provides a translation layer between QM and UM, and LDS Level One Architecture

User submits query Q () Node defines query topic Node sends to Group Manager (GM) request to define Query Scope (QS) GM computes and sends back QS Node 1 sends query to acquaintances in QS, and reports this fact to GM Nodes 2 and 4 send answer to node 1 Nodes propagate the query to theirs acquaintances from QS and report this fact to GM And so on… Nodes which do not propagate any further, report this fact to GM Propagation stops when “no more propagation” received from all boundary nodes A Proposed Strategy for Query Propagation “no more propagation from 8” “no more propagation from 9” 5. “nodes 2 and 4 are reached” “node 8 is reached” “node 6 is reached” GM 3. QS (, topic) = ? 4. QS (, topic)= (2, 4, 6, 8, 9, 11) 9 6 2 2. Q (, topic) ←Res2 10 7 1. Q () ←Res4 1 4 11 3 5 8

THE RUNNING EXAMPLE

Recall Motivating Example: Family Doctor DB F: Prescription (PatID, P_Name, Illness_Desc, StartDate, RecoveryDate, Treatment, Type, Prescriptions); Hospital DB H: Patients (PID, Name, Disease, Treatment_Desc, In, Out); Medical Office DB M: Accidents (P_id, FN, LN, Address_Reason, Treatment_Taken, Prescription_Given, Date) John, who suffers the accident, is described in H with ID “P12”, in F as “8”, and, when addressed to M, he is assigned ID “A13” “Toy” Databases

Lets suppose QM is asked to M: Select FN, LN, Address_Reason, Treatment_Taken, Prescription_Given, Date From “M:Accidents” Where Address_Reason Like (‘%Fracture%’ Or ‘%Dislocation%’) And PID = ‘A13’ With the indication QM is a global query with topic T = “Medical Care in Canada” After some search T is matched with the topic “Medical Care in Toronto” of the interest groupG Query Example

H is acquainted with F and P is acquainted with F; dashed lines are group metadata channels; H is GM of G GM computes query scope QS = G = {F, H, P} for query QM M gets acquainted with H M: Accidents and H: Patients are matched As the result a set of Coordination Rules is generated Group G

Coor # 1 Event: M:Q Condition: Q:(Address_Reason Select OR Treatment_Taken Select) AND (PID = ‘A13’ Where) Action: Q = Apply (Q, Corr_Rules_Query) Send (Q, H) Coor # 2 Event: M:RH Condition: None Action: RM = Apply (RH, Corr_Rules_Results) Examples of Coordination Rules Where Corr_Rules_Queryand Corr_Rules_Resultsare correspondence rules which translate outgoing query and incoming results

P is not reachable because there is no acquaintance graph from M to P In the graph the following queries are circulating: QH = SelectName, Disease, Treatment_Desc From“H:Patients” WhereDisease Like (‘%Fracture%’ Or ‘%Dislocation%’) And PID = ‘P12’ QF = SelectP_Name, Illness_Desc, Treatment From“F:Prescriptions” WhereIllness_Desc Like (‘%Fracture%’ Or ‘%Dislocation%’) And PID = ‘8’ Query Propagation

H and F generate the following results: ResH = <’John’, ‘Forearm dislocation’, ‘Bandage’> ResF = <’John’, ‘Leg fracture’, ‘Leg put in plaster’> When reached M, the results are reconciled as follows: ResM = <’John’, ‘Forearm dislocation’, ‘Bandage’> <’John’, ‘Leg fracture’, ‘Leg put in plaster’> Results Propagation and Reconciliation

Good Enough answers ResM is incomplete, some fields from H: Patients and F: Prescription are missing Nevertheless the results are good enough because they still serve the needs of M Network Variance If F is down, the results are even more incomplete Database Variance If M gets acquainted with F instead of H – only ResF is retrievable. F has a different “vision” of the world, as it is not acquainted with H Query Variance If in QM ID of John is substituted by ID of another, not shared patient, then no Coordination Rules and therefore no propagation Variance and Good Enough Answers

First investigation of how to make databases interact in a P2P network. There are four main dimensions: We must integrate data coming from autonomous, most often semantically heterogeneous, databases; We must deal with network, database, and query variance. This is why we talk of data coordination, as distinct from data integration; We will almost never get correct and complete answers. We must be content with answers which are good enough; There is a need to tune metadata. This is requires in order to cope with the dynamics of a P2P network. Conclusion

Project website: http://www.dit.unitn.it/~p2p/ “Data Management for Peer-to-Peer Computing: A Vision”, WebDB 2002, P. Bernstein, F. Giunchiglia, A. Kementsietsidis, J. Mylopoulos, L. Serafini, and I. Zaihrayeu L. Serafini, F. Giunchiglia, J. Mylopoulos and P. Bernstein “The Local Relational Model: Model and Proof Theory”, tech. rep. IRST, Trento References

Making Peer Databases Interact – A Vision for an Architecture Supporting Data Coordination