- By
**tien** - Follow User

- 225 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'DisCo : Distributed Co-clustering with Map-Reduce' - tien

Download Now**An Image/Link below is provided (as is) to download presentation**

Download Now

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### DisCo: Distributed Co-clustering with Map-Reduce

### Agenda

### Motivation

### Motivation

### Motivation

### Motivation

### BackGround: Co-Clustering

### BackGround: Co-Clustering

### BackGround: Co-Clustering

### Distributed Co-Clustering Process

### Distributed Co-Clustering Process

### Distributed Co-Clustering Process

### Distributed Co-Clustering Process

### Distributed Co-Clustering Process

### Distributed Co-Clustering Process

### Distributed Co-Clustering Process

### Distributed Co-Clustering Process

### Distributed Co-Clustering Process

### Distributed Co-Clustering Process

### Implementation Details

### Implementation Details

### Implementation Details

### Experimental Evaluation

### Experimental Evaluation

### Experimental Evaluation

### Conclusion

### Discussion

### Discussion

2008 IEEE International Conference on Data Engineering (ICDE)

S. Papadimitron, J. Sun

Tzu-Li Tai, Tse-En Liu

Kai-Wei Chan, He-Chuan Hoh

IBM T.J. Watson Research Center

NY, USA

National Cheng Kung University

Dept. of Electrical Engineering

HPDS Laboratory

Motivation

Background: Co-Clustering + MapReduce

Proposed Distributed Co-Clustering Process

Implementation Details

Experimental Evaluation

Conclusions

Discussion

0

39

- Google processes 20 petabytes of data per day
- Amazon and eBay with petabytes of transactional data every day

Highly variant structure of data

- Data sources naturally generate data in impure forms
- Unstructured, semi-structured

1

39

Problems with Big Data mining for DBMSs

- Significant preprocessing costs for the majority of data mining tasks
- DBMS lacks performance for large amount of data

2

39

Why distributed processing can solve the issues:

- MapReduceis irrelevant to the schema or form of the input data
- Many preprocessing tasks are naturally expressible with MapReduce
- Highly scalable with commodity machines

3

39

- Presents the whole process for distributed data mining
- Specifically, focuses on the Co-Clustering mining task, and designs a distributed co-clustering method using MapReduce

4

39

- Also named biclustering, or two-mode clustering
- Input format: a matrix of rows and columns
- Output: Co-clusters (sub-matrices) which rows that exhibit similar behavior across a subset of columns

4*5

4*5

5

39

Why Co-Clustering?

Traditional Clustering:

Social

Science

Chinese

English

Math

A C

Student A

Student B

BD

Student C

Can only know that students A & C / B & D

have similar scores

Student D

6

39

Social

Science

Chinese

English

Math

Co-Clustering:

Student A

Student B

Student C

Cluster 1

Cluster 2

Student D

Good at

Science + Math

Good at

English + Chinese

+ Social Studies

Chinese

Science

English

Social

Math

B & D

A & C

Student D

Rows that have similar properties for a subset of selected columns

Student B

Student C

Student A

7

39

HDFS

HDFS

MapReduce Job

Build transpose adjacency list

MapReduce Job

Extract SrcIP + DstIP and build adjacency matrix

DstIP

HDFS

MapReduce Job

Build adjacency list

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

HDFS

SrcIP

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

…

0 1 0 1 1 0 0 0 1 1 1 0 ……

0 1 0 1 1 0 0 0 1 1 1 0 ……

0 0 1 00 0 0 0 0 00 0 ……

0 1 0 1 1 0 0 0 1 1 1 0 ……

0 0 1 00 0 0 0 0 00 0 ……

0 0 1 00 0 0 0 0 00 0 ……

14

39

Co-Clustering (Generalized Algorithm)

Goal:

c(1) = 1

c(5) =2

c(2) = 1

c(4) =2

c(3) = 1

Co-cluster into 2x2 = 4 sub-matrices

r(1) = 1

1 or 2,

r(2) = 1

1 or 2,

r(3) = 1

r(4) = 2

Random Initialize:

15

39

Co-Clustering (Generalized Algorithm)

Fix column labels,

Iterate through rows:

c(1) = 1

c(5) =2

c(2) = 1

c(4) =2

c(3) = 1

r(1) = 1

r(2) = 1

r(3) = 1

r(4) = 2

r(2) = 2

16

39

Co-Clustering (Generalized Algorithm)

Fix row labels,

Iterate through columns:

c(1) = 1

c(5) =2

c(2) = 1

c(4) =2

c(3) = 1

c(2) = 2

17

39

1 -> 2,4,5

1 -> 2, 4, 5

2 -> 1, 3

3 -> 2, 4, 5

4 -> 1, 3

MR

2 -> 1,3

3 -> 2,4,5

4 -> 1,3

18

39

1 -> 2,4,5

1 -> 2, 4, 5

2 -> 1, 3

3 -> 2, 4, 5

4 -> 1, 3

MR

2 -> 1,3

MapReduce Job

based on parameters

3 -> 2,4,5

4 -> 1,3

19

39

c(1) = 1

c(5) =2

c(2) = 1

c(4) =2

c(3) = 1

1 -> 2,4,5

M

if r(1) = 2, cost becomes higher

r(1) = 1

2 -> 1,3

emit

(r(k), () ) =

(1, {(1,2), 1})

M

Mapper Function:

3 -> 2,4,5

For each K-V input,

Calculate (with and )

Change row labels if results in lower cost (function of )

Emit (r(k), ())

M

4 -> 1,3

20

39

c(1) = 1

c(5) =2

c(2) = 1

c(4) =2

c(3) = 1

1 -> 2,4,5

M

2 -> 1,3

if r(2) = 2, cost becomes lower

r(2) = 2

M

Mapper Function:

emit

(r(k), () ) =

(2, {(2,0), 2})

3 -> 2,4,5

For each K-V input,

Calculate (with and )

Change row labels if results in lower cost (function of )

Emit (r(k), ())

M

4 -> 1,3

21

39

R

Emit

Reducer Function:

For each K-V input,

For each ,

Accumulate all into

Union of all

Emit

R

23

39

Co-Clustering

Random

given

Synced

with best permutation

Sync

Results

HDFS

MapReduce Job

Fix column

Row iteration

MapReduce Job

Build transpose adjacency list

MapReduce Job

Fix row

Column iteration

Final Co-Clustering result

with best permutations

HDFS

25

39

Tuning the number of Reduce Tasks

- The number of reduce tasks is related to the number of intermediate keys during the shuffle and sort phase
- For the co-clustering row-iteration/column-iteration jobs, the number of intermediate keys is either or

26

39

Tuning the number of Reduce Tasks

- So, for the row-iteration/column-iteration jobs, 1 reduce task is enough
- However, for some preprocessing tasks such as graph construction where there are a lot of intermediate keys, needs much more reduce tasks

28

39

HDFS

HDFS

MapReduce Job

Build transpose adjacency list

MapReduce Job

Extract SrcIP + DstIP and build adjacency matrix

DstIP

HDFS

MapReduce Job

Build adjacency list

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

HDFS

SrcIP

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

IPAddress

…

0 1 0 1 1 0 0 0 1 1 1 0 ……

0 1 0 1 1 0 0 0 1 1 1 0 ……

0 0 1 00 0 0 0 0 00 0 ……

0 1 0 1 1 0 0 0 1 1 1 0 ……

0 0 1 00 0 0 0 0 00 0 ……

0 0 1 00 0 0 0 0 00 0 ……

29

39

- There are 39 nodes in four different blade enclosure

- Gigabit Ethernet

- Blade Server

- CPU: two dual-core (Intel Xeon 2.66GHz)
- Memory: 8GB
- OS: Red Hat Enterprise Linux

- Hadoop Distributed File System(HDFS) capacity: 2.4 TB

30

39

Optimal values of each situation

Map tasks number 6

Reduce tasks number 5

Input splitsize 256MB

6

256MB

5

32

39

After 25 nodes per iteration is roughly about 20 ± 2 seconds. It is better than what we can get on a machine with 48GB RAM.

33

39

- Authors of the paper shared their lessons learnt from data mining experiences with vast quantities of data, particularly in the context of co-clustering, and recommends using a distributed approach
- Designed a general MapReduce approach for co-clustering algorithms
- Showed that the MapReduce co-clustering framework scales well with real world large datasets (ICC, TREC)

34

39

Random

given

Necessity of the global

sync action

Synced

with best permutation

Sync

Results

MapReduce Job

Fix column

Row iteration

MapReduce Job

Fix row

Column iteration

Final Co-Clustering result

with best permutations

36

39

Questionable Scalability of DisCo

- For row-iteration jobs (or column-iteration jobs), the number of intermediate keys is fixed to be (or )
- This implies that for a given and , as the input matrix gets larger, the reducer size* will increase dramatically
- Since a single reducer (key+associating values) is sent to one reduce task, the memory capacity of a computing node will be a severe bottleneck for overall performance

*reference: Upper Bound and Lower Bound of a MapReduce Computation, 2013 VLDB

38

39

Download Presentation

Connecting to Server..