Dealing with the Scale Problem. Innovative Computing Laboratory MPI Team. Binomial graph. Undirected graph G :=( V , E ),  V = n (any size) Node i ={0,1,2,…, n 1} has links to a set of nodes U U ={ i ±1, i ±2,…, i ±2 k  2 k ≤ n } in a circular space
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Undirected graph G:=(V, E), V=n(any size)
Node i={0,1,2,…,n1} has links to a set of nodes U
U={i±1, i±2,…, i±2k  2k ≤ n} in a circular space
U={ (i+1)mod n, (i+2)mod n,…, (i+2k)mod n  2k ≤ n } and
{ (n+i1)mod n, (n+i2)mod n,…, (n+i2k)mod n  2k ≤ n }
Merging all links create binomial graph from each node of the graph
Broadcast from any node in Log2(n) steps
Runtime scalability
Average Distance
NodeConnectivity (κ)
LinkConnectivity (λ)
Optimally Connected
κ = λ =δmin
Binomial graph propertiesDegree = number of neighbors = number of connections = resource consumption
Runtime scalability
Runtime scalability
# of new
connections
# of added nodes
# of nodes
Runtime scalability
Collective communications
Collective communications
Collective communications
P2
P3
P4
Diskless checkpointing4 available processors
Add a fifth and perform
a checkpoint(Allreduce)
P1
P2
P3
P4
P4
Pc
+
+
+
=
P1
P2
P3
P4
P4
Pc
Ready to continue
....
P1
P2
P3
P4
P4
Pc
Failure
P1
P3
P4
P4
Pc
Ready for recovery
Pc
P1
P3
P4
P2



=
Recover the processor/data
Fault tolerance
Fault tolerance
Performance of PCG with different MPI libraries
For ckpt we generate one ckpt every 2000 iterations
Fault tolerance
Fault tolerance
Fault tolerance
Fault tolerance
Testing
Design
and code
Release
Interactive Debugging
Fault tolerance
Fault tolerance
Fault tolerance
Log size per process on NAS Parallel Benchmarks (kB)
Log size on NAS parallel benchmarksFault tolerance
Dr. George Bosilca
Innovative Computing LaboratoryElectrical Engineering and Computer Science DepartmentUniversity of Tennessee