disco distributed co clustering with map reduce n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
DisCo : Distributed Co-clustering with Map-Reduce PowerPoint Presentation
Download Presentation
DisCo : Distributed Co-clustering with Map-Reduce

Loading in 2 Seconds...

play fullscreen
1 / 41

DisCo : Distributed Co-clustering with Map-Reduce - PowerPoint PPT Presentation


  • 233 Views
  • Uploaded on

DisCo : Distributed Co-clustering with Map-Reduce. 2008 IEEE International Conference on Data Engineering (ICDE). S. Papadimitron , J. Sun. Tzu-Li Tai, Tse -En Liu Kai-Wei Chan, He- Chuan Hoh. IBM T.J. Watson Research Center NY, USA. National Cheng Kung University

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'DisCo : Distributed Co-clustering with Map-Reduce' - tien


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
disco distributed co clustering with map reduce

DisCo: Distributed Co-clustering with Map-Reduce

2008 IEEE International Conference on Data Engineering (ICDE)

S. Papadimitron, J. Sun

Tzu-Li Tai, Tse-En Liu

Kai-Wei Chan, He-Chuan Hoh

IBM T.J. Watson Research Center

NY, USA

National Cheng Kung University

Dept. of Electrical Engineering

HPDS Laboratory

agenda

Agenda

Motivation

Background: Co-Clustering + MapReduce

Proposed Distributed Co-Clustering Process

Implementation Details

Experimental Evaluation

Conclusions

Discussion

0

39

motivation

Fast Growth in Volume of Data

Motivation

  • Google processes 20 petabytes of data per day
  • Amazon and eBay with petabytes of transactional data every day

Highly variant structure of data

  • Data sources naturally generate data in impure forms
  • Unstructured, semi-structured

1

39

motivation1

Problems with Big Data mining for DBMSs

Motivation

  • Significant preprocessing costs for the majority of data mining tasks
  • DBMS lacks performance for large amount of data

2

39

motivation2

Why distributed processing can solve the issues:

Motivation

  • MapReduceis irrelevant to the schema or form of the input data
  • Many preprocessing tasks are naturally expressible with MapReduce
  • Highly scalable with commodity machines

3

39

motivation3

Contributions of this paper:

Motivation

  • Presents the whole process for distributed data mining
  • Specifically, focuses on the Co-Clustering mining task, and designs a distributed co-clustering method using MapReduce

4

39

background co clustering

BackGround: Co-Clustering

  • Also named biclustering, or two-mode clustering
  • Input format: a matrix of rows and columns
  • Output: Co-clusters (sub-matrices) which rows that exhibit similar behavior across a subset of columns

4*5

4*5

5

39

background co clustering1

BackGround: Co-Clustering

Why Co-Clustering?

Traditional Clustering:

Social

Science

Chinese

English

Math

A C

Student A

Student B

BD

Student C

Can only know that students A & C / B & D

have similar scores

Student D

6

39

background co clustering2

Why Co-Clustering?

BackGround: Co-Clustering

Social

Science

Chinese

English

Math

Co-Clustering:

Student A

Student B

Student C

Cluster 1

Cluster 2

Student D

Good at

Science + Math

Good at

English + Chinese

+ Social Studies

Chinese

Science

English

Social

Math

B & D

A & C

Student D

Rows that have similar properties for a subset of selected columns

Student B

Student C

Student A

7

39

background mapreduce

The MapReduce Paradigm

BackGround: MapReduce

Map

Reduce

Map

Reduce

Map

Reduce

Map

11

39

distributed co clustering process

Mining Network Logs to

Co-Cluster Communication Behavior

Distributed Co-Clustering Process

12

39

distributed co clustering process1

Mining Network Logs to

Co-Cluster Communication Behavior

Distributed Co-Clustering Process

13

39

distributed co clustering process2

The Preprocessing Process

Distributed Co-Clustering Process

HDFS

HDFS

MapReduce Job

Build transpose adjacency list

MapReduce Job

Extract SrcIP + DstIP and build adjacency matrix

DstIP

HDFS

MapReduce Job

Build adjacency list

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

HDFS

SrcIP

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

0 1 0 1 1 0 0 0 1 1 1 0 ……

0 1 0 1 1 0 0 0 1 1 1 0 ……

0 0 1 00 0 0 0 0 00 0 ……

0 1 0 1 1 0 0 0 1 1 1 0 ……

0 0 1 00 0 0 0 0 00 0 ……

0 0 1 00 0 0 0 0 00 0 ……

14

39

distributed co clustering process3

Co-Clustering (Generalized Algorithm)

Distributed Co-Clustering Process

Goal:

c(1) = 1

c(5) =2

c(2) = 1

c(4) =2

c(3) = 1

Co-cluster into 2x2 = 4 sub-matrices

r(1) = 1

1 or 2,

r(2) = 1

1 or 2,

r(3) = 1

r(4) = 2

Random Initialize:

15

39

distributed co clustering process4

Co-Clustering (Generalized Algorithm)

Distributed Co-Clustering Process

Fix column labels,

Iterate through rows:

c(1) = 1

c(5) =2

c(2) = 1

c(4) =2

c(3) = 1

r(1) = 1

r(2) = 1

r(3) = 1

r(4) = 2

r(2) = 2

16

39

distributed co clustering process5

Co-Clustering (Generalized Algorithm)

Distributed Co-Clustering Process

Fix row labels,

Iterate through columns:

c(1) = 1

c(5) =2

c(2) = 1

c(4) =2

c(3) = 1

c(2) = 2

17

39

distributed co clustering process6

Co-Clustering with MapReduce

Distributed Co-Clustering Process

1 -> 2,4,5

1 -> 2, 4, 5

2 -> 1, 3

3 -> 2, 4, 5

4 -> 1, 3

MR

2 -> 1,3

3 -> 2,4,5

4 -> 1,3

18

39

distributed co clustering process7

Co-Clustering with MapReduce

Distributed Co-Clustering Process

1 -> 2,4,5

1 -> 2, 4, 5

2 -> 1, 3

3 -> 2, 4, 5

4 -> 1, 3

MR

2 -> 1,3

MapReduce Job

based on parameters

3 -> 2,4,5

4 -> 1,3

19

39

distributed co clustering process8

M

c(1) = 1

c(5) =2

c(2) = 1

c(4) =2

c(3) = 1

Distributed Co-Clustering Process

1 -> 2,4,5

M

if r(1) = 2, cost becomes higher

r(1) = 1

2 -> 1,3

emit

(r(k), () ) =

(1, {(1,2), 1})

M

Mapper Function:

3 -> 2,4,5

For each K-V input,

Calculate (with and )

Change row labels if results in lower cost (function of )

Emit (r(k), ())

M

4 -> 1,3

20

39

distributed co clustering process9

M

c(1) = 1

c(5) =2

c(2) = 1

c(4) =2

c(3) = 1

Distributed Co-Clustering Process

1 -> 2,4,5

M

2 -> 1,3

if r(2) = 2, cost becomes lower

r(2) = 2

M

Mapper Function:

emit

(r(k), () ) =

(2, {(2,0), 2})

3 -> 2,4,5

For each K-V input,

Calculate (with and )

Change row labels if results in lower cost (function of )

Emit (r(k), ())

M

4 -> 1,3

21

39

distributed co clustering process10

M

Distributed Co-Clustering Process

R

1 -> 2,4,5

M

2 -> 1,3

M

3 -> 2,4,5

R

M

4 -> 1,3

22

39

distributed co clustering process11

Distributed Co-Clustering Process

R

Emit

Reducer Function:

For each K-V input,

For each ,

Accumulate all into

Union of all

Emit

R

23

39

distributed co clustering process13

Preprocessing

Co-Clustering

Random

given

Distributed Co-Clustering Process

Synced

with best permutation

Sync

Results

HDFS

MapReduce Job

Fix column

Row iteration

MapReduce Job

Build transpose adjacency list

MapReduce Job

Fix row

Column iteration

Final Co-Clustering result

with best permutations

HDFS

25

39

implementation details

Tuning the number of Reduce Tasks

Implementation Details

  • The number of reduce tasks is related to the number of intermediate keys during the shuffle and sort phase
  • For the co-clustering row-iteration/column-iteration jobs, the number of intermediate keys is either or

26

39

implementation details1

M

Implementation Details

R

1 -> 2,4,5

M

2 -> 1,3

(row-iterate)

inter-keys

M

3 -> 2,4,5

R

M

4 -> 1,3

27

39

implementation details2

Tuning the number of Reduce Tasks

Implementation Details

  • So, for the row-iteration/column-iteration jobs, 1 reduce task is enough
  • However, for some preprocessing tasks such as graph construction where there are a lot of intermediate keys, needs much more reduce tasks

28

39

implementation details3

The Preprocessing Process

Implementation Details

HDFS

HDFS

MapReduce Job

Build transpose adjacency list

MapReduce Job

Extract SrcIP + DstIP and build adjacency matrix

DstIP

HDFS

MapReduce Job

Build adjacency list

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

HDFS

SrcIP

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

0 1 0 1 1 0 0 0 1 1 1 0 ……

0 1 0 1 1 0 0 0 1 1 1 0 ……

0 0 1 00 0 0 0 0 00 0 ……

0 1 0 1 1 0 0 0 1 1 1 0 ……

0 0 1 00 0 0 0 0 00 0 ……

0 0 1 00 0 0 0 0 00 0 ……

29

39

experimental evaluation

Environment

Experimental Evaluation

  • There are 39 nodes in four different blade enclosure
  • Gigabit Ethernet
  • Blade Server
      • CPU: two dual-core (Intel Xeon 2.66GHz)
  • Memory: 8GB
  • OS: Red Hat Enterprise Linux
  • Hadoop Distributed File System(HDFS) capacity: 2.4 TB

30

39

experimental evaluation2

Preprocessing ISS Data

Experimental Evaluation

Optimal values of each situation

Map tasks number 6

Reduce tasks number 5

Input splitsize 256MB

6

256MB

5

32

39

experimental evaluation3

Co-Clustering TREC Data

Experimental Evaluation

After 25 nodes per iteration is roughly about 20 ± 2 seconds. It is better than what we can get on a machine with 48GB RAM.

33

39

conclusion

Conclusion

  • Authors of the paper shared their lessons learnt from data mining experiences with vast quantities of data, particularly in the context of co-clustering, and recommends using a distributed approach
  • Designed a general MapReduce approach for co-clustering algorithms
  • Showed that the MapReduce co-clustering framework scales well with real world large datasets (ICC, TREC)

34

39

discussion

Discussion

  • Necessity of the global sync action
  • Questionable Scalability for DisCo

35

39

discussion1

Co-Clustering

Random

given

Necessity of the global

sync action

Discussion

Synced

with best permutation

Sync

Results

MapReduce Job

Fix column

Row iteration

MapReduce Job

Fix row

Column iteration

Final Co-Clustering result

with best permutations

36

39

discussion2

M

Discussion

R

1 -> 2,4,5

M

2 -> 1,3

M

3 -> 2,4,5

R

M

4 -> 1,3

37

39

discussion3

Questionable Scalability of DisCo

Discussion

  • For row-iteration jobs (or column-iteration jobs), the number of intermediate keys is fixed to be (or )
  • This implies that for a given and , as the input matrix gets larger, the reducer size* will increase dramatically
  • Since a single reducer (key+associating values) is sent to one reduce task, the memory capacity of a computing node will be a severe bottleneck for overall performance

*reference: Upper Bound and Lower Bound of a MapReduce Computation, 2013 VLDB

38

39

discussion4

M

Discussion

R

1 -> 2,4,5

M

2 -> 1,3

M

3 -> 2,4,5

R

M

4 -> 1,3

39

39