Odysseas Papapetrou 18 April 2011

OdysseasPapapetrou 18 April 2011 Approximate algorithms for efficient indexing, clustering, and classification in Peer-to-peer networks L3S Research Center, University of Hannover, Germany

Introduction • Application scenarios of Peer-to-peer • File sharing, IP telephony, video streaming, data analysis, collaborative spam filtering, … • Frequent building blocks • Information retrieval • Data mining • Challenges • Large networks • High churn • High network cost

Introduction • Information retrieval and data mining in P2P networks • Information retrieval • Maintaining an inverted index for keyword search • Near-duplicate detection • Data mining • Clustering over a P2P network • Classification over a P2P network

Outline • Introduction • PCIR: Maintaining the inverted index for keyword search • Related work • Basic PCIR • Clustering-enhanced PCIR • Experimental evaluation • PCP2P: P2P text clustering • Related work • PCP2P • Experimental evaluation • Brief summary • POND: P2P near duplicate detection • CSVM: P2P classification • Conclusions

Information retrieval over P2P • The P2P information retrieval model • Thousands of nodes, constantly changing! • Standard users • Digital libraries • No central server! • Google-style search chania.png crete.png winter hannover.png 12 days of christmas.mp3 christmas carol.mp3 athens.png football.txt tennis.txt basket.doc … les miserables.doc recipes.pdf beautiful mind.avi recipes.doc the king speech.mpeg

Unstructured P2P networks • Peers form a connected graph • Query flooding with a time-to-live • Synopses: Gnutella-QRP[Gnu], EDBFs [Infocom05],PlanetP [HPDC] • Super peers: Gnutella 0.6, FastTrack [ComNet06], [ICDE03], [WWW03] • Scalability to large networks and quality of results • Rodrigues and Druschel: ‘Good at finding hay, but bad at findingneedles’ [CACM10]

Structured P2P over DHT • Distributed Hash Tables (DHTs) • Functionality of a hash table: put(key, value)and get(key) – similar to centralized hash tables • Chord: Peers organized in a ring structure • Finger tables • Peers establish links to peers with • Similar to binary search  Log(n) messages per DHT lookup

Structured P2P over DHT List of relevant peers for each term • State of the art vary in index granularity: • Minerva • Alvis • sk-Stat, mk-Stat • … DHT key DHT value

IR and P2P DHT publishing steps • Each peer extracts the frequencies for all its terms • Each peer publishes its scores in the DHT inverted index • One DHT lookup for each of its terms - log(n) messages • Periodic execution

Structured P2P over DHT • DHT-based indexes for distributed search • O(log(n)) per term lookup per peer  Total publishing cost: • 5000 peers, 1000 terms per peer: 61 million msgs • How to reduce the network cost Key insight: Some terms are very popular across peers! Can we exploit this to reduce the indexing cost?

PCIR: Peer Clusters for Inf. Retrieval • Basic approach • All peers are part of the global DHT • Peers also form groups • Each peer submits its index to its super-peer • Super-peers perform: • DHT lookups • DHT updates for all distinct group terms

Updating the super-peers • Step 1: Peer joins a group, or creates a group itself • Prob[newGroup]=0.1 • Used to determine the ratio of peers/super-peers

Updating the super-peers • Step 2: Peers submit their terms to the group’s super peer • No DHT lookup required

Updating the DHT • Step 3: Super peer publishes the group’s terms to the DHT • Exploits term overlap! • 1 DHT lookup per term per group

PCIR algorithm • Steps • Peer joins a group or forms its own • Peer submits its terms at the super peer of its group • Super peer publishes the group’s data to the DHT • Steps 2-3 repeated periodically to compensate churn • Result: a superset of the SOTA inverted index – no information loss  Query execution as in the SOTA!

How many super-peers? • Tradeoff maximum overlap less overlap super-peer gets overloaded low workload at super-peers not a P2P solution anymore • Balance the super peer workload and term overlap • User sets an acceptable load per super-peer • Maximum network cost • Analysis relying on network statistics  number of super-peers • Still high overlap 1 super-peer only many super-peers

Clustering-enhanced PCIR • Clustering-enhanced PCIR • Cluster peers around similar peers to increase term overlap Larger term overlap  fewer distinct terms per cluster  even fewer DHT lookups

How to cluster the peers Clustering a peer: • Peers and super-peers: term sets Bloom filters • Peer selects the most promising super peers using the DHT, and sends its Bloom filter to them • Probabilistic guarantees that the peer joins the best cluster BFsp1 BFsp2 BFp BFsp3 BFsp4

Evaluation • Measures • Average messages per peer • Average transfer volume per peer • More results in the thesis • Datasets • Reuters Corpus Volume 1, 160,000 articles • Medline, 100,000 abstracts • Comparisons • Flat DHT indexing (e.g., Minerva, Alvis, mk-Stat, sk-Stat) • Basic PCIR • Clustering-enhanced PCIR

Network cost Vs super-peer workload Baseline (100%): Minerva – peer granularity index

Network cost at super peers

PCIR: Indexing for keyword search • Conclusions • Basic and clustering-enhanced PCIR • Exploit term overlap across peers • Maintains the same inverted index as SOTA approaches • No peer gets overloaded • OdysseasPapapetrou, Wolf Siberski, Wolfgang Nejdl: PCIR: Combining DHTs and peer clusters for efficient full-text P2P indexing. Computer Networks 54(12): 2019-2040 (2010) • OdysseasPapapetrou, Wolf Siberski, Wolfgang Nejdl: Cardinality estimation and dynamic length adaptation for Bloom filters. Distributed and Parallel Databases 28(2): 119-156 (2010) • OdysseasPapapetrou. Full-text Indexing and Information Retrieval in P2P systems, in: Proc. Extending Database Technology PhD Workshop (EDBT), 2008, Nantes, France. • OdysseasPapapetrou, Wolf Siberski, Wolf-TiloBalke, Wolfgang Nejdl. DHTs over Peer Clusters for Distributed Information Retrieval, in: Proc. IEEE 21st International Conference on Advanced Information Networking and Applications (AINA), 2007, Niagara Falls, Canada.

P2P text clustering • Clustering of documents without a central server • Important data mining technique • Useful for information retrieval • Challenging because of network size, and high dimensionality of documents and cluster centroids!

Related work • LSP2P [TKDE09] • Unstructured P2P network • Peers gossip their centroids • Algorithm repeats until convergence • Assumption: Peers have documents from all classes!

Related work • HP2PC [TKDE08] • Peers organized in a hierarchy • Each level divided into neighborhoods • Super-peers at each neighborhood Root ... ... ... ... ... ... ...

Related work KMeans • Initialize k random cluster centroids • Assign each document to nearest cluster • Repeat until convergence • Example in two dimensions o C o o o o o o o o dimension 2 o o o o o o o C o o o o o o o dimension 1

Related work KMeans • Initialize k random cluster centroids • Assign each document to nearest cluster • Repeat until convergence • Example in two dimensions cosine=0.5 o C o o o o o cosine=0.8 o o o dimension 2 o o o o o o o C o o o o o o o dimension 1

Related work KMeans • Initialize k random cluster centroids • Assign each document to nearest cluster • Repeat until convergence • Example in two dimensions o C o o o o o o o o dimension 2 o C o o C C o o o o o o o o o o o dimension 1

Distributing K-Means • DKMeans: An unoptimized distributed K-Means • Assign maintenance of each cluster to one peer: Cluster holders • Peer P1 wants to cluster its document d • Send d to all cluster holders • Cluster holders compute cosine(d,c) • P1 assigns d to cluster with max. cosine, and notifies the cluster holder • Problem • Each document sent to all cluster holders • Network cost: O(|docs|  k) • Cluster holders get overloaded Cluster holder for cluster 1 P1 send d P2 cos(d,c1) P3 P8 P4 P9 P6 Cluster holder for cluster 2 P7 P5

PCP2P: Probabilistic Clustering over P2P PCP2P: Approximation to reduce the network and computational cost… • Compare each document only with the most promisingclusters • Pre-filtering step: Find candidate clusters for a document using an inverted index • Full comparison step: Use compact cluster summaries to exclude more candidate clusters

PCP2P: Probabilistic Clustering over P2P • Approximation to reduce the network and computational cost… • Compare each document only with the most promising clusters • Key insight: • Probabilistic topic models A cluster and a document about the same topic will share some of the most frequent topic terms, e.g., Topic “Economy”: crisis, shares, financial, market, … • Estimate these terms, and use them as rendezvous terms between the documents and the clusters of each topic Probab. topic model Topic: Economy crisis shares market

PCP2P: Probabilistic Clustering over P2P Identifying the rendezvous terms • Frequent cluster/document terms: term freq. > thres1 / thres2 • Clusters index their summaries at all terms with TF > thres1 • Cluster summary: <Cluster holder IP address, frequent cluster terms, length> • E.g. <132.11.23.32, (politics,157),(merkel,149), 3211> thres1 = 140 Add to “politics” summary(cluster1) Add to “merkel” summary(cluster1)

Pre-filtering step • Approximation to reduce the network cost… • Pre-filtering step: Efficiently locate the most promising centroids from the DHT and the rendezvous terms • Lookup most frequent terms only  candidate clusters • Send d to only these clusters for comparing • Assign d to the most similar cluster Which clusters published “politics” Which clusters published “germany” thres2 = 12 cluster1: summary cluster7: summary cluster4: summary Candidate Clusters cluster1 cluster7  Cos: 0.3  Cos: 0.2  Cos: 0.4 cluster4

Pre-filtering step • Probabilistic guarantees • User selects correctness probability Prprecost/quality tradeoff • Cluster holders/peers determine the frequent term thresholds per cluster/document (thres1 and thres2) • The optimal cluster will be included in with probability >Prpre • Key idea: Probabilistic topic models + Chernoff bounds to get the probability that a term will not be published Probab. topic model Topic: Economy Cluster or document Topic: Economy crisis Error when: Pr[tf(crisis)<4 | doc Economy] (for all top terms) shares market

Full comparison step • Full comparison step • Use the summaries collected from the DHT to estimate the cosine similarity for all clusters in • Use estimations to filter out unpromising clusters  Send d only to the remaining • Three strategies to estimate cosine similarity • Conservative: upper bound  always correct • Zipf-based and Poisson-based • Assumptions about the term distribution  small error probability • Poisson-based PCP2P • Tight probabilistic guarantees • Enables fine-tuning of cost/quality ratio

Evaluation • Evaluation objectives • Clustering quality • Network efficiency • Document collections • Reuters, Medline (100,000 documents) • Synthetic created using generative topic models • More results in the thesis • Baselines • DKMeans: Baseline distributed K-Means • LSP2P: State-of-the-art in P2P clustering based on gossiping

Evaluation – Clustering quality • Increasing desired probabilistic guarantees improves quality • Correctness probability always satisfied • LSP2P very bad at high-dimensional datasets • More results in the thesis: • Quality independent of network and dataset size • Independent of #clusters and collection characteristics

Evaluation – Network cost • At least an order of magnitude less cost than baseline • Efficiency: Poisson ~ Zipf > Conservative >> DKMeans • Performance gains increase with number of clusters

P2P text clustering • Conclusions • Probabilistic text clustering over P2P networks using probabilistic topic models • Pre-filtering step relying on inverted index • Full comparison step: Conservative, Zipf-based, Poisson-based • OdysseasPapapetrou, Wolf Siberski, Norbert Fuhr. Text Clustering for Peer-to-Peer Networks with Probabilistic Guarantees, in: Proc. ECIR 2010. • OdysseasPapapetrou. Full-text Indexing and Information Retrieval in P2P systems, in: Proc. EDBT PhD workshop 2008. • OdysseasPapapetrou, Wolf Siberski, Fabian Leitritz, Wolfgang Nejdl. Exploiting Distribution Skew for Scalable P2P Text Clustering Databases, in: Proc. DBISP2P 2008. • OdysseasPapapetrou, Wolf Siberski, Norbert Fuhr. Decentralized Probabilistic Text Clustering, under revision at TKDE, 2010.

Additional work in the thesis… • POND: Efficient and effective near duplicate detection in P2P networks with probabilistic guarantees (P2P 2010:1-10) • Locality Sensitive Hashing for NDD of multimedia and text files • POND: Finding the most efficient configuration to satisfy the probabilistic guarantees • CSVM: Collaborative classification in P2P networks (WWW (Companion Volume) 2011: 97-98, extended version under submission) • Dimensionality reduction • Share classifiers to construct meta-classifiers • Avoids privacy issues • Closely approximates the centralized case without centralization

Future work • PCIR and PCP2P extensions • Consider difference in update rate: Some information is more ‘static’ than other • Apply the clustering core idea to different scenarios • Index-based clustering for streaming data • Other clustering algorithms and other similarity measures • Bloom filter extensions for different scenarios, e.g., sensor networks • A good synopsis is always useful

References [Gnu] I. J. Taylor. “Gnutella”. In From P2P to Web Services and Grids, Computer Communications and Networks, pages 101–116. Springer London, 2005 [Infocom05] A. Kumar, J. Xu, E. Zegura. “Efficient and scalable query routing for unstructured peer-to-peer networks”. INFOCOM’05 [HPDC] F. M. Cuenca-Acuna, C. Peery, R. P. Martin, and T. D. Nguyen. “PlanetP: Using gossiping to build content addressable peer-to-peer information sharing communities”. HPDC’03 [ComNet06] J. Liang, R. Kumar, and K. W. Ross. The fasttrack overlay: A measurement study. Computer Networks, 50(6):842 – 858, 2006. [ICDE03] B. Yang, H. Garcia-Molina, "Designing a Super-Peer Network," ICDE'03 [WWW03] W. Nejdl et al. Super-peer-based routing and clustering strategies for rdf-based peer-to-peer networks. WWW 2003. [CACM10] R. Rodrigues and P. Druschel. Peer-to-peer systems. Commun. ACM, 53(10):72–82, 2010.

Support slides

Presented papers • Journals • Computer Networks • Distributed and Parallel Databases • TKDE (in communication) • Papers • WWW’11 poster • ECIR’10 • P2P’10 • DBISP2P’08 • EDBT PhD workshop 2008 • AINA 2007 • Total published • 3 journals • 19 peer-reviewed conferences • 2 peer-reviewed workshops

Why P2P research is important • Some solutions just scale better and are cheaper when done in P2P • video streaming, telephony, search on distributed data • P2P results can be directly applied in different problems • Apache Hadoop: Builds on location-based optimization for assigning jobs: Execute the job next to the data. Combines key ideas from P2P and mobile agents • Amazon Dynamo: A key-value store, inheriting the key concept of DHTs • Reliability, robustness, reputation: Widely considered in P2P networks • Ad-hoc collaboration and distributed computing: Einstein@home, SETI@home, ... • Query optimization for distributed databases and P2P

PCIR

Super-peers A A Q Q Q Q • Peers send summaries to super-peers • Super-peers form a connected graph • Peer broadcasts query to super-peers, with a TTL • e.g., Gnutella 0.6, FastTrack [ComNet06], [ICDE03], [WWW03] • Does not scale to large networks

Gossip-based A Q Q Q Q Q Q Q Q Q Q A Q Q • Peers form a connected graph • Query flooding with a time-to-live • Top-k results returned following the same path • E.g. Gnutella, Gnutella-QRP[Gnu], EDBFs [Infocom05],PlanetP [HPDC] • Does not scale to large networks

Odysseas Papapetrou 18 April 2011

Odysseas Papapetrou 18 April 2011

Presentation Transcript

Friday April 18

April 18, 2012

April 18, 2001

DE Agenda for April 18, 2011

Odysseas Papapetrou 18 April 2011

Odysseas Ntotsikas Managing Director

Ako Victoria 18 th April 2011

April 18, 2013

Webinar on New Federal Regulations April 18, 2011

April 18, 2008

18 April 2006

April 18, 2012

UW Senate President ’ s Report April 18, 2011

Nutrition Services 2011 State Data Conference APRIL 18-19, 2011

April 18, 2009

Cal State East Bay April 18, 2011

Space News Update - April 18 - 22, 2011 -

Monday, April 18, 2011

April 18, 2013

AFSL SHIPPERS MEETING April 18, 2011

Webinar on New Federal Regulations April 18, 2011

April 18