Loading in 2 Seconds...

Techniques for pipelined broadcast on ethernet switched clusters

Loading in 2 Seconds...

- By
**lanai** - Follow User

- 90 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Techniques for pipelined broadcast on ethernet switched clusters' - lanai

Download Now**An Image/Link below is provided (as is) to download presentation**

Download Now

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Techniques for pipelined broadcast on ethernet switched clusters

FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY

UNIVERSITY PUTRA OF MALAYSIA

SELECTED TOPICS FOR DISTRIBUTED COMPUTING [SKR 5800]

DEPARTMENT OF COMMUNICATION TECHNOLOGY AND NETWORKING

LECTURER: DR. NOR ASILA WATI BT ABD HAMID

PREPARED BY : MUHAMAD RAFIQ BIN OSMAN

METRIC NO.: GS18838

Contents

- Introduction
- Literature review
- Problem statements
- Objectives
- Methodology
- Cluster designs
- Broadcast trees
- Contention-free linear tree
- Contention-free binary tree
- Heuristic algorithms
- Model for computing appropriate segment sizes
- Experiments
- Results/ finding
- Conclusion

Introduction

- Broadcast = the root process sends message to all other processes in the system.

Literature review

- Binomial tree based pipelined broadcast algorithm have been developed [11],[13], [23],[24], and [25].
- K-binomial tree algorithm [13], has shown to have better performance than traditional binomial trees.
- Does not propose new pipelined broadcast schemes, otherwise the paper develop practical techniques to facilitate the deployment of pipelined broadcast on clusters connected by multiple Ethernet switches.

Problem statements

- The problem wants to be state here are:
- We have to determine the proper broadcast tree when to apply with pipelined broadcast.
- Two or more communication could be processed actively just only when they comes from different branches.
- Appropriate segment sizes must be selected because small segment size may excessive start-up overheads while large segment size may decrease pipeline efficiency.

Objectives

- The paper has few objectives to be achieved:-

a) broadcasting large messages using pipelined broadcast approach.

b) develop adaptive MPI routines that use different algorithms according to the message sizes.

c) allowing the algorithms and the complementary algorithms for broadcasting small messages to co-exist in one MPI routine.

Methodology

n5

n0

switches

- Example of path (n0 -> n3) = {(n0,s0),(s0,s1),(s1,s3),(s3,n3)}
- Contention-free pattern is a pattern where no two communications in the pattern have contention.

s0

s1

machines

n1

s3

s2

n2

n3

n4

Cont..(1)Broadcast trees

Linear tree

Binary tree

3-ary tree

0

0

0

1

2

1

2

3

1

2

3

4

3

4

5

6

5

4

5

6

7

6

7

7

Binomial tree

Flat tree

0

0

1

4

5

1

2

3

4

5

6

7

2

3

6

7

Contention-free linear trees

- All communications in contention-free linear tree must be contention-free.
- G=(S U M,E) as tree graph.
- S = switches, M = machines, E = edges.
- P = |M| and G’ = (S,E’) as subgraph of G.
- Step 1:
- Start from switch that nr is connected to, perform Depth First Search (DFS) on G’.
- Numbering the switches based on the DFS arrival order.
- Step 2:
- Numbering ni,0,ni,1,…,ni,Xi-1. Xi=0 when no machine attaching to si.

Contention-free binary tree

- Tree height affects the time to complete the operation, smallest tree height is an ideal for pipelined broadcast binary tree.
- Example (i<j≤k<l and a≤b≤c≤d):
- Path (mi mj) has three components: (mi,sa), path(sa sb) and (sb,mj).
- Path (mk ml) has three components: (mk,sc), path (sc sd), and (sd,ml).
- When a=b, communication mimj does not have contention with communication mkml since (mi,sa) and (sb,mj) are not in path (scsd) and vice versa.
- Question: How about k-ary broadcast tree. Is there have any contention-free from up to k children?

Heuristic algorithms

- Tree[i][j] stores tree(i,j) and best[i][j] stores the height of tree(i,j).
- Tree(i,j), j>i+2 is formed by having mi as the root, tree(i+1,k-1) as the left child, and tree(k,j) as the right child.
- Make sure that mimk does not have contention with communications in tree (i+1,k-1), which ensure that the binary tree is contention-free.
- Choose k with the smallest max (best[i+1][k-1],best[k][j])+1, which minimizes tree height.
- Tree[0][P-1] stores the contention-free binary tree.

Model for computing appropriate segment sizes

- The point-to-point communication performance is characterized by five parameters:
- L = Latency
- Os(m) = the times that the CPUs are busy sending message of size m.
- Or(m) = the times that the CPUs are busy receiving message of size m.
- g(m) = the minimum time interval between consecutive message (size m) transmission and receptions.
- P = the number of processors in the system.
- Os, or, and g are functions which allows the communication time of large messages to be modeled more accurately.

Experiments

- Evaluate the performance of pipelined broadcast with various broadcast trees on 100 Mbps (fast) Ethernet and 1000 Mbps (Giga-bit) Ethernet clusters with different physical topologies.
- Topology (1) contains 16 machines connected by a single switch.
- Topology (2),(3),(4) and (5) are 32-machine clusters with different network connectivity.
- Topology (4) and (5) have same physical topology, but different node assignments.

Cont..(1)

- Machine specifications:

Cont..(2)

- Extended parameterized LogP model characterizes the system with five parameters, L(m), os(m), or(m), g(m),P.
- Select range of potential sizes from 256B to 32kB.
- To obtain L(m), use pingpong program to measure the round trip time for the messages of size m(RTT(m)) and derive L(m) based on formula RTT(m)=L(m)+g(m)+L(m)+g(m).
- The CPU is the bottleneck with 1000 Mbps Ethernet when the message size is more than 8kB. That’s why L(m) decreases when m increases from 8 to 32kB for the 1000 Mbps.

Cont..(3)

- Sometimes the predicted optimal segment size differ from the measured sizes.
- Factor: a) first, assuming that 1-port model where each node can send and receive at the link speed. The assumption holds for the clusters with 100 Mbps Ethernet, but processor cannot keep up with sending and receiving at 1000 Mbps at the same time. b) inaccuracy in the performance parameter measurements.

Results on 100 Mbps Ethernet switched clusters

- The time for binary trees is about twice the time to send single message.
- The segment size does not give impact to the pipelined broadcast.
- Changing from segment size of 512 Bytes to 2048 Bytes does not significantly affect the performance, especially comparison with different algorithm.

Performance of different broadcast trees, 100 Mbps

The linear tree offers

the best performance

when the message

size is large (>=32kB).

The binary tree offers

the best performance

when the medium sized

message (8-16kB).

the communication

completion time for

linear trees is very

close to T(msize),

Performance of different algorithms, 100 Mbps (LAM+MPICH) – topology 4

Poor

performance

MPICH gradually

has similar performance

to the pipelined broadcast

with binary trees.

(scatter followed by

all-gather algorithm)

Pipelined broadcast with

linear tree is about twice

as fast as MPICH.

Performance of different algorithms, 100 Mbps (LAM+MPICH) – topology 5

All algorithms in

LAM and MPICH

perform poorly.

Topology-unaware

algorithms is sensitive to

the physical topology and

manifests the advantage

of pipelined broadcast

with contention-free trees.

Results on 1000 Mbps Ethernet switched clusters

- The linear tree performs better than binary tree when the message is larger than 1MB.
- Factor: a) the processor cannot keep up with sending and receiving data at 1000 Mbps at the same time. Binary tree pipelined broadcast algorithm is less computational intensive than the linear tree algorithm. b) larger software start-up overheads in 1000 Mbps Ethernet.

Performance with different broadcast trees, 1000 Mbps

3-ary tree is always

worse than the binary tree

which confirms that k>2 ary

are not effective.

Insufficient CPU

speed significantly

affect the linear tree

algorithm.

Performance for different algorithms, 1000 Mbps (LAM+MPICH) -> topology 4

The recursive-doubling

algorithm introduces

severe network contention

and yields extremely

poor performance.

Although MPICH perform

well, but still 64% slower

than contention-free

broadcast tree.

Performance for different algorithms, 1000 Mbps (LAM+MPICH) -> topology 5

severe network

contention

pipelined broadcast

performs better than the

algorithms in MPICH and LAM

on 1000 Mbps clusters

in all different situations.

All algorithms used by

LAM and MPICH incur

severe network contention and

perform much worse

across all the message sizes.

Properties of pipelined broadcast algorithms

- Two conditions for pipelined broadcast to be effective:-
- the software overhead for splitting

large message into segments should not be

excessive.

- The pipeline term must dominate the delay term.
- For 100 Mbps, when the segment size≥1024 Bytes, X*T(msize/X) is within 10% of T(msize).
- For 1000 Mbps, when the segment size≥8kB, X*T(msize/X) is within 10% of T(msize).

Cont..(1)

- When the message size is smaller than these thresholds, the communication start-up overheads increase more dramatically. However, optimal segment size may less than thresholds because compromise between software overhead and pipeline efficiency.
- The linear tree pipelined algorithm is efficient for broadcasting on small number of processes while the binary tree algorithm may apply for large number of processes.

Conclusions

- Modeled segment size<>measured segment size but performance model == performance measured.
- Pipelined broadcast is more efficient than other commonly used broadcast algorithms on contemporary 100 Mbps and 1000 Mbps Ethernet switched clusters in many situations.
- The techniques can be applied to other types of clusters.
- The near-optimal broadcast performance can be achieved by irregular topology through finding and spanning tree plus apply the techniques.

References

- [1] O. Beaumont, A. Legrand, L. Marchal, Y. Robert, Pipelined broadcasts on heterogeneous platforms, IEEE Transactions on Parallel and Distributed Systems 16 (4) (2005) 300-313.
- [2] O. Beaumont, L. Marchal, Y. Robert, Broadcast trees for heterogeneous platforms, in: The 9th IEEE Int’l Parallel and Distributed Processing Symposium, 2005, p. 80b.
- [3] K.W. Cameron, X.-H. Sun, Quantifying locality effect in data access delay: Memory LogP, in: IEEE Int’l Parallel and Distributed Processing Symposium IPDPS, 2003 p. 48b.
- [4] K.W. Cameron, R. Ge, X.-H. Sun, LognP and log3P: Accurate analytical models of point-to-point communication in distributed systems, IEEE Transactions on Computers 56 (3) (2007) 314-327.
- [5] D. Culler, et. al., LogP: Towards a realistic model of parallel computation, in: Proceedings of the fourth ACM SIGPLAN Symposium on Principle and Practice of Parallel Programmings, PPoPP, 1993, pp. 1-12.
- [6] A. Faraj, X. Yuan, Automatic generation and tuning of MPI collective communication routines, in: The 19th ACM International Conference on Supercomputing, 2005, pp. 393-402.
- [7] A. Faraj, X. Yuan, Pitch Patarasuk, A message scheduling scheme for all-to-all personalized communication on Ethernet switched clusters, IEEE Transactions on Parallel and Distributed Systems 18 (2) (2007) 264-276.
- [8] A. Faraj, P. Patarasuk, X. Yuan, A study of process arrival patterns for mpi collective operations, International Journal of Parallel Programming, (in press)
- [9] A. Faraj, P. Patarasuk, X. Yuan, Bandwidth efficient all-to-all broadcast on switched clusters, International Journal of parallel Programming,(in press).
- [10] J. Pjesivac-Grbovic, T. Angskun, G. Bosilca, G.E. Fagg, E. Gabriel, J. Dongarra, Performance analysis on MPI collective operations, in: The 19th IEEE International Parallel and Distributed Processing Symposium, 2005, pp8-8.
- [11] S.L. Johnsson, C.T. Ho, Optimum broadcasting and personalized communication in hypercube, IEEE Transactions on Computers 38 (9) (1989) 1249-1268.
- [12] A. Karwande, X. Yuan, D.K. Lowenthal, An MPI prototype for compiled communication on Ethernet switched clusters, Journal of Parallel and Distributed Computing 65 (10) (2005) 1123-1133.
- [13] R. Kesavan, D.K. Panda, Optimal multicast with packetization and network interface support, in: Proceedings of International Conference onm Parallel Processing, 1997, pp. 370-377.

References (2)

- [14] T. Kielmann, H.E. Bal, K. Verstoep, Fast measurement of Logp parameters for message passing platforms, in: Proceeding of 2000 IPDPS Workshop on Parallel and Distributed Processing, Cancun, Mexico, May 2000, pp. 1176-1183.
- [15] LAM/MPI Parallel Computing, http://www.lam-mpi.org/.
- [16] R.G. Lane, S.Daniels, X. Yuan, An empirical study of reliable multicast protocols over Ethernet-connected networks, Performance Evaluation Journal 64 (2007) 210-228.
- [17] P.K McKinley, H. Xu, A. Esfahanian, L.M. Ni, Unicast-based multicast communication in wormhole-routed networks, IEEE Trans. on Parallel and Distributed Systems 5 (12) (1994) 1252-1264.
- [18] The MPI Forum. The MPI-2: Extensions to the Message Passing Interface, July 1997. Available at: http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html.
- [19] MPICH- A portable implementation of MPI. http://www.mcs.anl.gov/mpi/mpich.
- [20] P. Sanders, J.F. Sibeyn, A bandwidth latency tradeoff for broadcast and reduction, Information Processing letters 86 (1) (2003) 33-38.
- [21] SCI-MPICH: MPI for SCI-connected Clusters. Available at: www.lfbs.rwth-aachen.de/users/joachim/SCI-MPICH/pcast/html.
- [22] Andrew Tanenbaum, Computer Networks, 4th Edition, 2004.
- [23] J.-Y. Tien, C.-T. Ho, W.-P. Yang. Broadcasting on incomplete hypercubes, IEEE Transaction on Computers 42 (11) (1993) 1393-1398.
- [24] J.L. Traff, A. Ripke, Optimal broadcast for fully connected networks in: Processdings of High-Performance Computing and Communication (HPPC-05), 2005, pp. 45-46.
- [25] J.L. Traff, A. Ripke, An optimal broadcast algorithm adapted to SMP-clusters, EURO PVM/MPI (2005) 48-56.
- [26] S.S. Vadhiyar, G.E Fagg, J. Dongarra, Automatically tuned collective communications, in: Proceedings of SC’00: High Performance Networking and Computing (CDROM Proceeding), 2000.
- [27] J. Watts, R. Van De Gejin. A pipelined broadcast for multidimentional meshes, Parallel Processing Letters 5 (2) (1995) 281-292.
- [28] Xin Yuan, Rami Melhem, Rajiv Gupta, Algorithms for supporting compiled communication, IEEE Transaction of Parallel and Distributed System 14 (2) (2003) 107-118.

THE ENDS

- Thank you,
- Question and Answer.

Download Presentation

Connecting to Server..