Large-scale data sharing by exploiting gossiping

1st Gossple Workshop on Social Networking (december 2010) Large-scale data sharing by exploiting gossiping Esther Pacitti Saphir SOPHIA ANTIPOLIS - MéDITERRANéE

Context: P2P Data Sharing • We consider P2P online communities where participants can be • Professionals (researchers, engineers, support staff, etc.) who use web-scale collaboration in their workplace • Large scale of users and data (clouds, grids, internet) • Example of applications: • P2P Recommendation Systems • Useful for processing scientific workflows among participants’ peers • P2P Query Reformulation • Clinical case sharing among doctors or physicians • P2P CDN • Projects: • ANR DataRing (2009-2012, P2P online communities ) • Datluge (2010-2012, with UFRJ, Brazil on P2P scientific workflows)

MOTIVATIONS Bioinformatics Chemistry, Materials Science and Physics Computer Science

P2PRec: document recommender • Hudge graph:G = (D,U,E,T), where • D is the set of shared documents • U is the set of users in the system • E is the set of edgesbetween the userssuchthatthereis an edge e(u,v) if users u and v are friends • T is the set of users’ topics of intrest. • Problem: Given a query, recommend the most relevant documents • Our approach • Reduce the researchspace by indentifing relevant users • Identify relevant users • Usersthat stores/downloads enough high-quality documents, and become kind of providers in specific topics • Recommended by trusted friends • P2P Overlay : Semantic-Gossiping • Disseminate relevant users and theirtopics of intrests

P2PRec*: document recommender • Topics of intrest • With respect to the documents a user store • Extractedautomatically • Friendship network • Explicit friendship (maybelaveragedwithimplicit) • Expresses users trusts • Implementedis FOAF files (friend of friends files, machine-readablevocabularyserialized in RDF/XML) • Key-wordQueries • Mapped to topics • Mostlyrelated to the user topics of intrest • Mesure to • Check the similarity of userswrt to theirtopics (Dice coefficient) • Relevance of a user *Joint workwith F. Draidi, P. Valduriez, B. Kemme, to appear as Inria report

Semantic-Gossiping u1’s local-view before gossip u1’s viewafter gossip u1 FOAF If distance betweenuuand uv > τ ask for friendship u1topics: t1,t2 Friends: link to u5 FOAF u5 topics Dice coefficient u1 u1 u2 t3 t1 If friendshipisaccepted adduv to FOAF file u6 gossip t2,t3 u5 u4 t1,t2 t1 u5’s local-view before gossip u5’s local-view after gossip

Relevant Users • Users topics of intrest are automatically extracted using LDA* • by inspecting the documents topic vector • A user is considered relevant on a topic tTu,if a percentage of its documents have high quality in topic t • Each document doc at user u has • A rate given to doc: ratedoc • doc topic Vector (extracted using LDA) • Vdoc={wdoct1,…..,wdoctd} • doc is considered a high quality in a topic tqualityt(doc,u) • If wdoct *ratedoc> a threshold value • A user can be relevant in more than one topic *Latent Dirichlet Allocation (topic classifier)

QueryProcessing • ImplementsRecommendation • Input: Key words • Output: • Links to a set of good quality documents. May include links to documents on the topic of intrests of a friend (query expansion) • Popularity and Similarity info • Example: doctorsstuding the behavior of a gene X maybeglad to learn about the deseasesitcan cause and check someexperimental data sets

QueryProcessing Summary of Docs similarity and classification info query q requester q.t = t1, q.TTL=2 Computesim(doc,q) t3 u1 u2 t1 u7 q.TTL=1 q.TTL=0 query q.TTL=1 u3 Rec. docs u6 t2 u5 t2,t3 u4 t1,t2 t1 u1 FOAF • 1) Query q is mapped to a topic or topics Tq • 2) Select Top-kfriends in the FOAF wrt to the querytopics • (cosine similarity) • 3) Redirect Query • 4) Do 2) and 3) Recursively until TTL Computesim(doc,q) Computesim(doc,q) u1topics of intrests Friends: link to u5FOAF u5 topics

Conclusions P2PRec • P2PRec (BDA2010) • Findfriends (relevant users on similartopics) whilegossiping • Queryprocessing exploits relevant userswrt to the querytopics, recursively(FOAF friends) • Perf. Evaluation • Recall x Precision x Response Times • Limitation of LDA: needs some centralization for training, but good to validate our general approach • However there are other possibilities: • Ontology based automatic annotation • This exists for biomedical documents

P2P Query Reformulation* • P2P Data Management System (PDMS) • Eachpeer has: • Its own schema (and data) • 1 or more mapping acquaintances to/from which at least 1 mapping rule exists • Goal: Given a query, exploit mapping acquaintances as much a possible to enhance query responses. • ?= Hospital(x, “San Francisco”) Schema B __________ Schema A __________ Mb,a data A B data *Joint workwith A. Bonifati, G. Summa, P. Valduriez, to appear as Inria report

Concepts Hospital($X, “San Francisco”) HealtCareInst($X, “San Francisco”, $Z) ?= Q ?= Q’ ALONG Mb,a B A Schema __________ SourceHospital [0..*]name locationGrant [0..*]amountistitution managerDoctor [0..*]namesalary Schema __________ TargetHealthCareInst [0..*]name cityidGrant [0..*]amountscientist data data atoms MAPPING RULE Hospital(x, y) ⇢HealthCareInst(x, y, z) Mb,a BODY HEAD

Mapping Relevance • Each time a query gets translated by exploiting a mapping we got a Relevant Rewriting • The relevance can be Forward (along) or Backward (against) depending on the matched side of the mapping • Goal: • Collect as many rewriting as possible • Find the most intresting paths to take (avoid useless paths) • ?= Hospital(x, “San Francisco”) M1 Hospital(x, y) ⇢HealthCareInst(x, y, z) M2 Institution(x, y, z) ⇢ Hospital(x, y)

Problem … ?= Q’ ?= Q Mc,a Mb,a A C B AGAINST ALONG Mb,d . ALONG ?= Q’’ M Mb,z L D Z H 1) How to choose the most relevant paths to undertake in the reformulation task? 2) Are there other peers in the network which can be contacted?

Acquaintances • Gossiping acquaintances • Potential friends that dynamically appears in the local semantic view (LSV) • Mapping acquaintance • There is at least 1 direct mapping towards it (friend) • Established manually • Social acquaintance (FOAF friend) • No direct mapping is needed towards it • There are some common interests • Established explicitly

Our Approach • Gossip to disseminate mapping rules information to find friends • Users topics of intrest • are expressed according to the schema information or past queries topics • Measure to • Compute the relevance of a mapping wrt to a query • Compute similarity between users • Exploits recursively (to translate a query) • Mapping acquaintances • Social acquaintances

GossipingAcquaintances

Social Acquaintances • Friend • Share common topics of • interests • Interests • Formulated by queries • Elements of peer’s schema • Approach: use the semantic view to discover friends • ?= Hospital(x, “San Francisco”) Schema __________ • ?= State( y, z, “California”) • ?= Doctor( w, k) • ?= Patology(“heart”, x) • … ……

Compute Relevance Goal: Given an Query and a mapping rule, determine if the mapping is relevant to the query Method (Standard Match Semantics) • Atom Label matching • Parameters compatibility • ?= Hospital(x, “San Francisco”) M1 Hospital(x, y) AND State (x,z) ⇢HealthCareInst(x, y, z) M2 Hospital(x, y,w) AND State (x,z) ⇢ HealthCareInst(x, y, z) M3 Ospedale(x,y) AND State (x,z) ⇢ HealthCareInst(x, y, z)

Compute Relevance • AF-IMF Measure, inspired by TF-IDF* • AF (Atom Frequency) • Localmeasure, establishing the importance of the query atom in the current mapping • IMF (Inverse Mapping Frequency) • Distributed measure, establishing the overall importance of the query atom • Relevance of a mapping wrt to q is AF * IMF *termfrequency-inverse document frequency

Compute Relevance (AF) • About the applied measure • To increase the effectiveness of the measure we distinguish, again, Forward/Backward relevance FORWARD MEASURE body BACKWARD MEASURE head • ?= Hospital(x, “San Francisco”) AF = 1/2 M1 Hospital(x, y) AND State (x,z) ⇢HealthCareInst(x, y, z) AF = 1 M2 Institution(x, y, z) ⇢ Hospital(x, y)

Compute Relevance (IMF) • IMF requires a way to get a value for • The total number of mappings • The total number of mappingscontainingthatatom • To do that, wecaninspect the semanticview of the peer • Also by sendinginquiries to peers in the FOAF

Translate-Query • Compute Relevance on Local Mappings wrt Q • Choose the TopK Mappings • Apply the translation semantics, along/against the mapping direction • Trigger Translate-Query on the mapping acquaintance, recursively (until TTL) • Select FOAF friends to be contacted • By looking at the best Mapping summaries wrt Q • Trigger query Translate-Query on the social acquaintance, recursively (until TTL)

Performance Evaluation • Baseline • No gossiping, original query propagated • Baseline+ • No gossiping, translated query propagated • Baseline# • No gossiping, translated query propagated, local measure to sort mappings (by using AF only) • Full- • Gossiping, translated query propagated, AF-IMF measure to sort mappings, no FOAF links (only local mappings) • Full (P2PRec) • Gossiping, translated query propagated, AF-IMF measure to sort mappings, FOAF links exploited Effectiveness of AF-IMF, LSV and gossiping

Conclusions • P2P Query Reformulation • Gossipingisused to disseminatedmappingsrules information • Exploits recursively relevant mappings • Mappingacquaintances • Social acquaintances • Initial Perf. Resuts: • Very good recallresults (over 90%) • Linearscale-up • Trade-off of Recall and Responses Times • Previouswork uses • DHTs or a centralizedmediation model.

About Montpellier Best quality of life in France Important laboratories (LIRMM) and research instituts (INRA, CIRAD, etc) University of Montpellier is part of the « opération campus » Soonwewill have a direct TGV line to Barcelona (1 hour)

Large-scale data sharing by exploiting gossiping

Large-scale data sharing by exploiting gossiping

Presentation Transcript

iRODS and Large-Scale Data Management

Large Scale Data Visualization with VisIt

Large-Scale Data Processing with MapReduce

Large scale genomic data mining

Exploiting Large-Scale Check-in Data to Recommend Time-Sensitive Routes

LARGE SCALE

Large-scale Data Processing Challenges

Large scale genomic data mining

Large- scale Linked Data Management

European Master in Materials Science Exploiting Large Scale Facilities

Large scale data processing

Large scale

Exploiting Large Scale Web Semantics

Large Scale Sharing

Large Scale Sharing

Large Scale Data Processing with DryadLINQ

Large Scale Data Integration

Strategies for Exploiting Large Data

Large Scale Data Analytics

Are Large Scale Data Breaches Inevitable?

large scale data analysis