1 / 17

DisCo : Distributed Co-clustering with Map-Reduce

DisCo : Distributed Co-clustering with Map-Reduce. Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui , Ku. Outline. Introduction Related Work Distributed Mining Process Co-clustering Huge Datasets Experiments Conclusions.

diata
Download Presentation

DisCo : Distributed Co-clustering with Map-Reduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DisCo: Distributed Co-clustering with Map-Reduce Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku

  2. Outline • Introduction • Related Work • Distributed Mining Process • Co-clustering Huge Datasets • Experiments • Conclusions

  3. Introduction • Problems • Huge datasets • Natural sources of data are impure form • Proposed Method • A comprehensive Distributed Co-clustering (DisCo) solution • Using Hadoop • DisCo is a scalable framework under which various co-clustering algorithms can be implemented

  4. Related Work • Map-Reduce framework • employs a distributed storage cluster • block-addressable storage • a centralized metadata server • a convenient data access • storage API for Map-Reduce tasks

  5. Related Work • Co-clustering • Algorithm • cluster shapes • checkerboard partitions • single bi-cluster • Exclusive row and column partitions • overlapping partitions • Optimization criteria • code length

  6. Distributed Mining Process Identifying the source and obtaining the data Transform raw data into the appropriate format for data analysis Visual results, or turned into the input for other applications.

  7. Distributed Mining Process (cont.) • Data pre-processing • Processing 350 GB raw network event log • Needs over 5 hours to extract source/destination IP pairs • Achieve much better performanceon a few commodity nodes running Hadoop • Setting up Hadoop required minimal effort

  8. Distributed Mining Process (cont.) • Specifically for co-clustering, there are two main preprocessing tasks: • Building the graph from raw data • Pre-computing the transpose • During co-clustering optimization, we need to iterate over both rows and columns. • Need to pre-compute the adjacency lists for both the original graph as well as its transpose

  9. Co-clustering Huge Datasets • Definitions and overview • Matrices are denoted by boldface capital letters • Vectors are denoted by boldface lowercase letters • aij:the(i, j)-th element of matrix A • Co-clustering algorithms employs a checkerboard • the original adjacency matrix a grid of sub-matrices • An m x n matrix, a co-clustering is a pair of row and column labeling vectors • r(i):the i-th row of the matrix • G: the k×ℓ group matrix A a

  10. Co-clustering Huge Datasets (cont.) • gpq gives the sufficient statistics for the (p, q) sub-matrix

  11. Co-clustering Huge Datasets (cont.) • Map function

  12. Co-clustering Huge Datasets (cont.) • Reduce function

  13. Co-clustering Huge Datasets (cont.) • Global sync

  14. Experiments • Setup • 39 nodes • Two dual-core processors • 8GM RAM • Linux RHEL4 • 4Gbps Ethernets • SATA, 65MB/sec or roughly 500 Mbps • The total capacity of our HDFS cluster was just 2.4 terabytes • HDFS block size was set to 64MB (default value) • JAVA • Sun JDK version 1.6.0_03

  15. Experiments (cont.) • The pre-processing step on the ISS data • Default values • 39 nodes • 6 concurrent maps per node • 5 reduce tasks • 256MB input split size

  16. Experiments (cont.)

  17. Conclusions • Using relatively low-cost components • I/O rates that exceed those of high-performance storage systems. • Performance scales almost linearly with the number of machines/disks.

More Related