Algorithms for Dynamical Fermions

1 / 105

# Algorithms for Dynamical Fermions - PowerPoint PPT Presentation

Algorithms for Dynamical Fermions. A D Kennedy School of Physics, The University of Edinburgh. Outline. Monte Carlo integration Importance sampling Markov chains Detailed balance, Metropolis algorithm Symplectic integrators Hybrid Monte Carlo Pseudofermions RHMC QCD Computers.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Algorithms for Dynamical Fermions' - Mercy

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Algorithms for Dynamical Fermions

A D Kennedy

School of Physics, The University of Edinburgh

NCTS 2006 School on Modern Numerical Methods in Mathematics and Physics

Outline
• Monte Carlo integration
• Importance sampling
• Markov chains
• Detailed balance, Metropolis algorithm
• Symplectic integrators
• Hybrid Monte Carlo
• Pseudofermions
• RHMC
• QCD Computers

A D Kennedy

Monte Carlo methods: I
• Monte Carlo integration is based on the identification of probabilities with measures
• There are much better methods of carrying out low dimensional quadrature
• All other methods become hopelessly expensive for large dimensions
• In lattice QFT there is one integration per degree of freedom
• We are approximating an infinite dimensional functional integral

A D Kennedy

Generate a sequence of random field configurations chosen from the probability distribution

Measure the value of  on each configuration and compute the average

Monte Carlo methods: II

A D Kennedy

Law of Large Numbers

• Central Limit Theorem
• where the variance of the distribution of  is

The Laplace–DeMoivre Central Limit theoremis an asymptotic expansion for the probability distribution of

• Distribution of values for a single sample  = ()
Central Limit Theorem: I

A D Kennedy

Generating function for connected moments

• The first few cumulants are
Central Limit Theorem: II
• Note that this is an asymptotic expansion

A D Kennedy

Central Limit Theorem: III
• Distribution of the average of N samples
• Connected generating function

A D Kennedy

Integral

• Sample from distribution
• Probability
• Normalisation

Estimator of integral

• Estimator of variance
Importance Sampling: I

A D Kennedy

Minimise variance

• Constraint N=1
• Lagrange multiplier 

Optimal measure

Optimal variance

Importance Sampling: II

A D Kennedy

Example:

Optimal weight

Optimal variance

Importance Sampling: III

A D Kennedy

Importance Sampling: IV

1 Construct cheap approximation to |sin(x)|

2 Calculate relative area within each rectangle

3 Choose a random number uniformly

4 Select corresponding rectangle

5 Choose another random number uniformly

6 Select corresponding x value within rectangle

7 Compute |sin(x)|

A D Kennedy

For which

Importance Sampling: V

With 100 rectangles we have V = 16.02328561

• But we can do better!

With 100 rectangles we have V = 0.011642808

A D Kennedy

Distribution converges to unique fixed point

Markov Chains: I

State space 

• Deterministic evolution of probability distribution P: Q  Q

(Ergodic) stochastic transitions P’:   

A D Kennedy

Define a metric on the space of (equivalence classes of) probability distributions

Prove that with  > 0, so the Markov process P is a contraction mapping

The space of probability distributions is complete, so the sequence converges to a unique fixed point

Convergence of Markov Chains: I

The sequence Q, PQ, P²Q, P³Q,… is Cauchy

A D Kennedy

Markov Chains: II
• Use Markov chains to sample from Q
• Suppose we can construct an ergodic Markov process P which has distribution Q as its fixed point
• Iterate the Markov process until it has converged (“thermalized”)
• Thereafter, successive configurations will be distributed according to Q
• But in general they will be correlated
• To construct P we only need relative probabilities of states
• Do not know the normalisation of Q
• Cannot use Markov chains to compute integrals directly
• We can compute ratios of integrals

A D Kennedy

Detailed balance

Metropolis algorithm

• Consider cases and separately to obtain detailed balance condition
• Other choices are possible, e.g.,
Markov Chains: III
• Integrate w.r.t.y to obtain fixed point condition
• Sufficient but not necessary for fixed point
• Sufficient but not necessary for detailed balance

A D Kennedy

Markov Chains: IV
• Composition of Markov steps
• Let P1and P2be two Markov steps which have the desired fixed point distribution
• They need not be ergodic
• Then the composition of the two steps P2P1 will also have the desired fixed point
• And it may be ergodic
• This trivially generalises to any (fixed) number of steps
• For the case where P1is not ergodic but (P1 ) nis the terminology weakly and strongly ergodic are sometimes used

A D Kennedy

Markov Chains: V
• This result justifies “sweeping” through a lattice performing single site updates
• Each individual single site update has the desired fixed point because it satisfies detailed balance
• The entire sweep therefore has the desired fixed point, and is ergodic
• But the entire sweep does not satisfy detailed balance
• Of course it would satisfy detailed balance if the sites were updated in a random order
• But this is not necessary
• And it is undesirable because it puts too much randomness into the system

A D Kennedy

Markov Chains: VI
• Coupling from the Past
• Propp and Wilson (1996)
• Use fixed set of random numbers
• Flypaper principle:If states coalesce they stay together forever
• Eventually, all states coalesce to some state  with probability one
• Any state from t = - will coalesce to 
•  is a sample from the fixed point distribution

A D Kennedy

Markov Chains: VII
• Suppose we have a lattice
• That is, a partial ordering with a largest and smallest element
• And an update step that preserves it
• Then once the largest and smallest states have coalesced all the others must have too

a

b

c

d

e

f

g

A D Kennedy

A non-trivial example of where this is possible is the ferrormagnetic Ising model

Markov Chains: VIII

• β ≥ 0
• Ordering is spin configuration A ≥ B iff for every site As ≥ Bs
• Update is a local heatbath

A D Kennedy

In particular, the largest subleading eigenvalue is

• This corresponds to the exponential autocorrelation time
Autocorrelations: I
• Exponential autocorrelations
• The unique fixed point of an ergodic Markov process corresponds to the unique eigenvector with eigenvalue 1
• All its other eigenvalues must lie within the unit circle

A D Kennedy

Define the autocorrelation function

• Without loss of generality we may assume
Autocorrelations: II
• Integrated autocorrelations
• Consider the autocorrelation of some operator Ω

A D Kennedy

The autocorrelation function must fall faster that the exponential autocorrelation

• For a sufficiently large number of samples
• Define the integrated autocorrelation function
Autocorrelations: III

A D Kennedy

Hybrid Monte Carlo: I
• In order to carry out Monte Carlo computations including the effects of dynamical fermions we would like to find an algorithm which
• Update the fields globally
• Because single link updates are not cheap if the action is not local
• Take large steps through configuration space
• Because small-step methods carry out a random walk which leads to critical slowing down with a dynamical critical exponentz=2
• Does not introduce any systematic errors

A D Kennedy

Hybrid Monte Carlo: II
• A useful class of algorithms with these properties is the (Generalised) Hybrid Monte Carlo (HMC) method
• Introduce a “fictitious” momentum p corresponding to each dynamical degree of freedom q
• Find a Markov chain with fixed point  exp[-H(q,p) ]where H is the “fictitious” Hamiltonian ½p2 + S(q)
• The action S of the underlying QFT plays the rôle of the potential in the “fictitious” classical mechanical system
• This gives the evolution of the system in a fifth dimension, “fictitious” or computer time
• This generates the desired distribution exp[-S(q) ] if we ignore the momenta p (i.e., the marginal distribution)

A D Kennedy

Hybrid Monte Carlo: III

Hybrid Monte Carlo: III
• The HMC Markov chain alternates two Markov steps
• Molecular Dynamics Monte Carlo (MDMC)
• (Partial) Momentum Refreshment
• Both have the desired fixed point
• Together they are ergodic

A D Kennedy

MDMC: I
• If we could integrate Hamilton’s equations exactly we could follow a trajectory of constant fictitious energy
• This corresponds to a set of equiprobable fictitious phase space configurations
• Liouville’s theorem tells us that this also preserves the functional integral measure dp  dq as required
• Any approximate integration scheme which is reversible and area preserving may be used to suggest configurations to a Metropolis accept/reject test
• With acceptance probabilitymin[1,exp(-H)]

A D Kennedy

Molecular Dynamics (MD), an approximate integrator which is exactly

• Area preserving,
• Reversible,
• A momentum flip

The composition of these gives

MDMC: II
• We build the MDMC step out of three parts
• A Metropolis accept/reject step
• with y being a uniformly distributed random number in [0,1]

A D Kennedy

Partial Momentum Refreshment
• The Gaussian distribution of p is invariant under F
• The extra momentum flip F ensures that for small  the momenta are reversed after a rejection rather than after an acceptance
• For  =  / 2 all momentum flips are irrelevant

A D Kennedy

If A and B belong to any (non-commutative) algebra then , where  constructed from commutators of A and B (i.e., is in the Free Lie Algebra generated by {A,B })

• More precisely, where and
Symplectic Integrators: I

Baker-Campbell-Hausdorff (BCH) formula

A D Kennedy

Explicitly, the first few terms are

• The following identity follows directly from the BCH formula
Symplectic Integrators: II
• In order to construct reversible integrators we use symmetric symplectic integrators

A D Kennedy

We are interested in finding the classical trajectory in phase space of a system described by the Hamiltonian

Symplectic Integrators: III
• The basic idea of such a symplectic integrator is to write the time evolution operator as

A D Kennedy

Define and so that

• Since the kinetic energy T is a function only of p and the potential energy S is a function only of q, it follows that the action of and may be evaluated trivially
Symplectic Integrators: IV

A D Kennedy

Symplectic Integrators: V
• From the BCH formula we find that the PQP symmetric symplectic integrator is given by
• In addition to conserving energy to O (² ) such symmetric symplectic integrators are manifestly area preserving and reversible

A D Kennedy

Symplectic Integrators: VI
• For each symplectic integrator there exists a Hamiltonian H’ which is exactly conserved
• This is given by substituting Poisson brackets for commutators in the BCH formula

A D Kennedy

Symplectic Integrators: VII

Poisson brackets form a Lie algebra

• Homework exercise: verify the Jacobi identity

A D Kennedy

The explicit result for H’ is

Symplectic Integrators: VIII
• Note that H’ cannot be written as the sum of a p-dependent kinetic term and a q-dependent potential term
• As H’ is conserved, δH is of O(δτ2) for trajectories of arbitrary length
• Even if τ= O (δτ-k) with k > 1

A D Kennedy

Multiple timescales

• Split the Hamiltonian into pieces
• Define and so that
• Introduce a symmetric symplectic integrator of the form
• If then the instability in the integrator is tickled equally by each sub-step
Symplectic Integrators: IX
• This helps if the most expensive force computation does not correspond to the largest force

A D Kennedy

It is that is not positive, and thus we get poorimportance sampling

• We therefore integrate out the fermion fields to obtain the fermion determinant
Dynamical fermions: I
• Pseudofermions
• Direct simulation of Grassmann fields is not feasible
• The problem is not that of manipulating anticommuting values in a computer
• The overall sign of the exponent is unimportant

A D Kennedy

E.g., the fermion propagator is

Dynamical fermions: II
• Any operator  can be expressed solely in terms of the bosonic fields
• Use Schwinger’s source method and integrate out the fermions

A D Kennedy

Including the determinant as part of the observable to be measured is not feasible

• Represent the fermion determinant as a bosonic Gaussian integral with a non-local kernel
Dynamical fermions: III
• The determinant is extensive in the lattice volume, thus again we get poor importance sampling
• The fermion kernel must be positive definite (all its eigenvalues must have positive real parts) otherwise the bosonic integral will not converge
• The new bosonic fields are called pseudofermions

A D Kennedy

It is usually convenient to introduce two flavours of fermion and to write

• This not only guarantees positivity, but also allows us to generate the pseudofermions from a global heatbath by applying to a random Gaussian distributed field
• The equations for motion for the boson (gauge) fields are
• The evaluation of the pseudofermion action and the corresponding force then requires us to find the solution of a (large) set of linear equations
Dynamical fermions: IV

A D Kennedy

Dynamical fermions: V
• It is not necessary to carry out the inversions required for the equations of motion exactly
• There is a trade-off between the cost of computing the force and the acceptance rate of the Metropolis MDMC step
• The inversions required to compute the pseudofermion action for the accept/reject step does need to be computed exactly, however
• We usually take “exactly” to by synonymous with “to machine precision”

A D Kennedy

Reversibility: I
• Are HMC trajectories reversible and area preserving in practice?
• The only fundamental source of irreversibility is the rounding error caused by using finite precision floating point arithmetic
• For fermionic systems we can also introduce irreversibility by choosing the starting vector for the iterative linear equation solver time-asymmetrically
• We do this if we to use a Chronological Inverter, which takes (some extrapolation of) the previous solution as the starting vector
• Floating point arithmetic is not associative
• It is more natural to store compact variables as scaled integers (fixed point)
• Saves memory
• Does not solve the precision problem

A D Kennedy

Reversibility: II
• Data for SU(3) gauge theory and QCD with heavy quarks show that rounding errors are amplified exponentially
• The underlying continuous time equations of motion are chaotic
• Ляпунов exponents characterise the divergence of nearby trajectories
• The instability in the integrator occurs when H » 1
• Zero acceptance rate anyhow

A D Kennedy

Reversibility: III
• In QCD the Ляпунов exponents appear to scale with  as the system approaches the continuum limit   
•  = constant
• This can be interpreted as saying that the Ляпуновexponent characterises the chaotic nature of the continuum classical equations of motion, and is not a lattice artefact
• Therefore we should not have to worry about reversibility breaking down as we approach the continuum limit
• Caveat: data is only for small lattices, and is not conclusive

A D Kennedy

Best with respect to the appropriate normwhere n  1

Polynomial approximation

What is the best polynomial approximation p(x) to a continuous function f(x) for x in [0,1]?

A D Kennedy

Taking n →∞ this is the minimax norm

Weierstraß: Any continuous function can be arbitrarily well approximated by a polynomial

Weierstraß’ theorem

A D Kennedy

Чебышев: There is always a unique polynomial of any degree d which minimises

Чебышев’s theorem
• The error |p(x) - f(x)| reaches its maximum at exactly d+2 points on the unit interval

A D Kennedy

We can construct a polynomial qof degree dwhich has the opposite sign to p-f at each of these maxima (Lagrange interpolation)

Чебышев’s theorem: Necessity
• Suppose p-f has less than d+2 extrema of equal magnitude
• Then at most d+1 maxima exceed some magnitude
• This defines a “gap”
• And whose magnitude is smaller than the “gap”
• The polynomial p+q is then a better approximation than p to f

A D Kennedy

Suppose there is a polynomial

• Then at each of the d+2extrema
Чебышев’s theorem: Sufficiency
• Therefore p’ - pmust have d+1 zeros on the unit interval
• Thus p’ – p = 0 as it is a polynomial of degree d

A D Kennedy

The best approximation of degree d-1 over [-1,1]to xdis

• Where the Чебышев polynomials are
• The error is
Чебышев polynomials

Convergence is often exponential in d

• The notation is an old transliteration of Чебышев !

A D Kennedy

A simple (but somewhat slow) numerical algorithm for finding the optimal Чебышев rational approximation was given by Ремез

Чебышев rational functions
• Чебышев’s theorem is easily extended to rational approximations
• Rational functions with nearly equal degree numerator and denominator are usually best
• Convergence is still often exponential
• Rational functions usually give a much better approximation

A D Kennedy

A realistic example of a rational approximation is

• The partial fraction expansion of the rational function above is
Чебышев rationals: Example
• This is accurate to within almost 0.1% over the range [0.003,1]
• Using a partial fraction expansion of such rational functions allows us to use a multishift linear equation solver, thus reducing the cost significantly.
• This appears to be numerically stable.

A D Kennedy

Золотарев’s formula has L∞error

Optimal L2approximation with weight is

Polynomials versus rationals
• This has L2 error of O(1/n)

Optimal L∞approximation cannot be too much better (or it would lead to a better L2approximation)

A D Kennedy

Non-linearity of CG solver
• Suppose we want to solve A2x=bfor Hermitian Aby CG
• It is better to solve Ax=y, Ay=bsuccessively
• Condition number (A2) = (A)2
• Cost is thus 2(A) < (A2) in general
• Suppose we want to solve Ax=b
• Why don’t we solve A1/2x=y, A1/2y=b successively?
• The square root of A is uniquely defined if A>0
• This is the case for fermion kernels
• All this generalises trivially to nthroots
• No tuning needed to split condition number evenly
• How do we apply the square root of a matrix?

A D Kennedy

Rational matrix approximation
• Functions on matrices
• Defined for a Hermitian matrix by diagonalisation
• H = U D U -1
• f (H) = f (U D U -1) = U f (D) U -1
• Rational functions do not require diagonalisation
• Hm + Hn = U(Dm + Dn) U-1
• H -1 = U D -1 U -1
• Rational functions have nice properties
• Cheap (relatively)
• Accurate

A D Kennedy

No Free Lunch Theorem
• We must apply the rational approximation with each CG iteration
• M1/n r(M)
• The condition number for each term in the partial fraction expansion is approximately (M)
• So the cost of applying M1/nis proportional to(M)
• Even though the condition number (M1/n)=(M)1/n
• And even though (r(M))=(M)1/n
• So we don’t win this way…

A D Kennedy

We write this as a bosonic functional integral over a pseudofermion field with kernel M -1

Pseudofermions
• We want to evaluate a functional integral including the fermionic determinant det M

A D Kennedy

A better estimate is det M = [det M1/n]n

Multipseudofermions
• We are introducing extra noise into the system by using a single pseudofermion field to sample this functional integral
• This noise manifests itself as fluctuations in the force exerted by the pseudofermions on the gauge fields
• This increases the maximum fermion force
• This triggers the integrator instability
• This requires decreasing the integration step size

A D Kennedy

Violation of NFL Theorem
• So let’s try using our nthroot trick to implement multipseudofermions
• Condition number (r(M))=(M)1/n
• So maximum force is reduced by a factor of n(M)(1/n)-1
• This is a good approximation if the condition number is dominated by a few isolated tiny eigenvalues
• This is so in the case of interest
• Cost reduced by a factor of n(M)(1/n)-1
• Optimal value nopt ln (M)
• So optimal cost reduction is (eln) /
• This works!

A D Kennedy

RHMC algorithm for fermionic kernel

• Use accurate rational approximation
• Use less accurate approximation for MD,
• , so there are no double poles
Rational Hybrid Monte Carlo: I
• Generate pseudofermion from Gaussian heatbath
• Use accurate approximation for Metropolis acceptance step

A D Kennedy

Rational Hybrid Monte Carlo: II
• Reminders
• Apply rational approximations using their partial fraction expansions
• Denominators are all just shifts of the original fermion kernel
• All poles of optimal rational approximations are real and positive for cases of interest (Miracle #1)
• Only simple poles appear (by construction!)
• Use multishift solver to invert all the partial fractions using a single Krylov space
• Cost is dominated by Krylov space construction, at least for O(20) shifts
• Result is numerically stable, even in 32-bit precision
• All partial fractions have positive coefficients (Miracle #2)
• MD force term is of the usual form for each partial fraction
• Applicable to any kernel

A D Kennedy

Comparison with R algorithm: I

Binder cumulant of chiral condensate, B4, and RHMC acceptance rate A from a finite temperature study (2+1 flavour naïve staggered fermions, Wilson gauge action, V = 83×4, mud = 0.0076, ms = 0.25, τ= 1.0)

A D Kennedy

Comparison with R algorithm: II

The different masses at which domain wall results were gathered, together with the step-sizes δt, acceptance rates A, and plaquettes P (V = 163×32×8, DBW2 gauge action, β= 0.72)

The step-size variation of the plaquette with mud =0.02

A D Kennedy

Comparison with R algorithm: III

The integrated autocorrelation time of the 13th time-slice of the pion propagator from the domain wall test, with mud = 0.04

A D Kennedy

For example, look at the numerators in the partial fraction expansion we exhibited earlier

Multipseudofermions with multiple timescales

Semiempirical observation: The largest force from a single pseudofermion does not come from the smallest shift

Make use of this by using a coarser timescale for the more expensive smaller shifts

A D Kennedy

Berlin Wall for Wilson fermions
• HMC Results
• C Urbach, K Jansen, A Schindler, and U Wenger, hep-lat/0506011, hep-lat/0510064
• Comparable performance to Lüscher’s SAP algorithm
• RHMC?
• t.b.a.

A D Kennedy

Conclusions (RHMC)
• Exact
• No step-size errors; no step-size extrapolations
• Significantly cheaper than the R algorithm
• Allows easy implementation of Hasenbusch (multipseudofermion) acceleration
• Further improvements possible
• Such as multiple timescales for different terms in the partial fraction expansion
• ???

A D Kennedy

QCD Machines: I
• We want a cost-effective computer to solve interesting scientific problems
• In fact we wanted a computer to solve lattice QCD
• But it turned out that there is almost nothing that was not applicable to many other problems too
• Not necessary to solve all problems with one architecture
• Demise of the general-purpose computer?
• Development cost « hardware cost for one large machine
• Simple OS and software model
• Interleave a few long jobs without time- or space-sharing

A D Kennedy

QCD Machines: II
• Take advantage of mass market components
• It is not cost- or time-effective to compete with PC market in designing custom chips
• Use standard software and tools whenever possible
• Do not expect compilers or optimisers to anything particularly smart
• Parallelism has to be built into algorithms and programs from the start
• Hand code critical kernels in assembler
• And develop these along with the hardware design

A D Kennedy

QCD Machines: III
• Parallel applications
• Many real-world applications are intrinsically parallel
• Because they are approximations to continuous systems
• Lattice QCD is an good example
• Lattice is a discretisation of four dimensional space-time
• Lots of arithmetic on small complex matrices and vectors
• Relatively tiny amount of I/O required
• Amdahl’s law
• Amount of parallel work may be increased working on a larger volume
• Relevant parameter is σ, the number of sites per processor

A D Kennedy

Strong Scaling
• The amount of computation V δ required to equilibrate a system of volume V increases faster than linearly
• If we are to have any hope of equilibrating large systems the value of δcannot be much larger than one
• For lattice QCD we have algorithms with δ = 5/4
• We therefore are driven to as small a value of σ as possible
• This corresponds to “thin nodes,” as opposed to “fat nodes” as in PC clusters
• Clusters are competitive in price/performance up to a certain maximum problem size
• This borderline increase with time, of course

A D Kennedy

Data Parallelism
• All processors run the same code
• Not necessarily SIMD, where they share a common clock
• Synchronization on communication, or at explicit barriers
• Type of data parallel operations
• Pointwise arithmetic
• Nearest neighbour shifts
• Perhaps simultaneously in several directions
• Global operations
• Broadcasts, sums or other reductions

A D Kennedy

• Parallelism comes from running many separate more-or-less independent threads
• Recent architectures propose running many light-weight threads on each processor to overcome memory latency
• What are the threads that need almost no memory
• Calculating zillions of digits of ?
• Cryptography?

A D Kennedy

Computational Grids

In the future carrying out large scale computations using the Grid will be as easy as plugging into an electric socket

A D Kennedy

Hardware Choices
• Cost/MFlop
• Goal is to be about 10 times more cost-effective than commercial machines
• Otherwise it is not worth the effort and risk of building our own machine
• Our current Teraflops machines cost about \$1/MFlop
• For a Petaflops machine we will need to reach about \$1/GFlop
• Power/cooling
• Most cost effective to use low-power components and high-density packaging
• Life is much easier if the machine can be air cooled

A D Kennedy

Clock Speed
• Peak/Sustained speed
• The sustained performance should be about 20%—50% of peak
• Otherwise there is either too much or too little floating point hardware
• Real applications rarely have equal numbers of adds and multiplies
• Clock speed
• “Sweet point” is currently at 0.5—1 GHz chips
• High-performance chips running at 3 GHz are both hot and expensive
• Using moderate clock speed makes electrical design issues such as clock distribution much simpler

A D Kennedy

Memory Systems
• Memory bandwidth
• This is currently the main bottleneck in most architectures
• There are two obvious solutions
• Data prefetching
• Vector processing is one way of doing this
• Requires more sophisticated software
• Feasible for our class of applications because the control flow is essentially static (almost no data dependencies)
• Hierarchical memory system (NUMA)
• We make use of both approaches

A D Kennedy

Memory Systems
• On-chip memory
• For QCD the memory footprint is small enough that we can put all the required data into an on-chip embedded DRAM memory
• Cache
• Whether the on-chip memory is managed as a cache or as directly addressable memory is not too important
• Cache flushing for communications DMA is a nuisance
• Caches are built into most μ-processors, not worth designing our own!
• Off-chip memory
• The amount of off-chip memory is determined by cost
• If the cost of the memory is » 50% of the total cost then buy more processing nodes instead
• After all, the processors are almost free!

A D Kennedy

Communications Network: I
• This is where a massively parallel computer differs from a desktop or server machine
• In the future the network will become the principal bottleneck
• For large data parallel applications
• We will end up designing networks and decorating them with processors and memories which are almost free

A D Kennedy

Communications Network
• Topology
• Grid
• This is easiest to build
• How many dimensions?
• QCDSP had 4, Blue Gene/L has 3, and QCDOC has 6
• Extra dimensions allow easier partitioning of the machine
• Hypercube/Fat Tree/Ω network/Butterfly network
• These are all essentially an infinite dimensional grid
• Good for FFTs
• Switch
• Expensive, and does not scale well

A D Kennedy

6

8

10

12

36

1

2

3

4

5

6

7

8

Global Operations
• Combining network
• A tree which can perform arithmetic at each node
• Useful for global reductions
• But global operations can be performed using grid wires
• It is all a matter of cost
• Used in BGL, not in QCDx
• Grid wires
• Not very good error propagation
• O(N1/4) latency (grows as the perimeter of the grid)
• O(N) hardware

A D Kennedy

36

10

26

1

2

3

3

4

7

5

6

11

7

8

15

Global Operations
• Combining tree
• Good error propagation
• O(log N) latency
• O(N log N) hardware
• Bit-wise operations
• Allow arbitrary precision
• Used on QCDSP
• Not used on QCDOC or BGL
• Because data sent byte-wise (for ECC)

A D Kennedy

110100

110100

110100

000

001011

0

0001

00011

00

110100

0010

010010

10010

10

0

010

110

11

11010

1

1101

0

1

1

1

1

0

001

00101

00

0

000111

0010

111000

10100

0100

100

00

0

110100

upper

upper

?

?

upper

upper

1

11

111

1110

11100

Global Operations

+

Sum

LSB first

>

Carry

Max

MSB first

Select

A D Kennedy

Communications Network: II
• Parameters
• Bandwidth
• Latency
• Packet size
• The ability to send small [O(102) bytes] packets between neighbours is vital if we are to run efficiently with a small number of sites per processor (σ)
• Control (DMA)
• The DMA engine needs to be sophisticated enough to interchange neighbouring faces without interrupting the CPU
• Block-strided moves
• I/O completion can be polled (or interrupt driven)

A D Kennedy

Packet Size

A D Kennedy

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

8

9

10

11

12

13

14

15

0

1

2

3

4

5

6

7

12

2

4

6

8

10

0

12

4

0

14

2

6

8

14

10

0

10

4

6

8

12

14

2

14

8

6

4

2

0

12

10

7

5

3

11

13

9

13

15

11

11

9

13

7

9

15

5

1

3

13

5

15

11

15

1

5

9

1

3

7

3

7

1

Block-strided Moves

Block size

4

1

3

12

Stride

For each direction separately we specify in memory mapped registers:

• Block size
• Stride
• Number of blocks

A D Kennedy

Hardware Design Process
• Optimise machine for applications of interest
• We run simulations of lattice QCD to tune our design
• Most design tradeoffs can be evaluated using a spreadsheet as the applications are mainly static (data-independent)
• It also helps to debug the machine if you use the actual kernels you will run in production as test cases!
• But running QCD even on a RTL simulator is painfully slow
• Circuit design done using hardware description languages (e.g., VHDL)
• Time to completion is critical
• Trade-off between risk of respin and delay in tape-out

A D Kennedy

VHDL

entitylcount5is

port(signalclk, res: invlbit;

signalreset, set:invlbit;

signal count: inout vlbit_1d(4downto0)

);

endlcount5;

architecturebhvoflcount5is

begin

process

begin

wait untilprising(clk)orres=\'1\';

ifres=\'1\'thencount<=b"00000";

elseifreset=\'1\'

thencount<=b"00000";

elsifset=\'1\'then

count(4)<=count(4)xor(count(0)andcount(1)andcount(2)andcount(3));

count(3)<=count(3)xor(count(0)andcount(1)andcount(2));

count(2)<=count(2)xor(count(0)andcount(1));

count(1)<=count(1)xorcount(0);

count(0)<=notcount(0);

end if;end if;

end process ;

end bhv;

A D Kennedy

Columbia University

0.4 TFlops

BNL

0.6 Tflops

QCDSP

A D Kennedy

DCR SDRAM

Interrupt

Controller

Boot/Clock

Support

FPU

PEC

DCR SDRAM

Controller

440 Core

OFB

Arbiter

OFB-PLB

Bridge

O

F

B

DMA

Controller

IDC

Interface

Serial

number

PLB

HSSL

HSSL

HSSL

Slave

PLB

Arbiter

MCMAL

Ethernet

DMA

location

number

SCU

EMACS

IBM ASIC Library Component

MII

PLL

Custom Designed Logic

FIFO

4KB recv. FIFO

QCDOC ASIC Design

PEC = Prefetching EDRAM Controller

EDRAM = Embedded DRAM

DRAM = Dynamic RAM

RAM = Random Access Memory

EDRAM

2.6 Gbyte/sec

EDRAM/SDRAM

DMA

100 Mbit/sec

Fast Ethernet

Communications

Control

8 Gbyte/sec

Memory-Processor

Bandwidth

Complete Processor Node for

QCD computer on a Chip

fabricated by IBM using SA27E 0.18μ

logic + embedded DRAM process

4 Mbytes of

Embedded DRAM

1 (0.8) Glops

Double Precision

RISC Processor

12 Gbit/sec

Bandwidth

2.6 Gbyte/sec

Interface to

External Memory

A D Kennedy

QCDOC Components

A D Kennedy

QCDOC

A D Kennedy

Edinburgh ACF

UKQCDOC is located in the ACF near Edinburgh

Our insurers will not let us show photographs of the building for security reasons!

A D Kennedy

Blue Gene/L

A D Kennedy

Vicious Design Cycle
• Communications DMA controller required
• Otherwise control by CPU needed
• Requires interrupts at each I/O completion
• CPU becomes bottleneck
• Introduce more sophisticated DMA instructions
• Block-strided moves for QCDSP
• Sequences of block-strided moves for QCDOC
• Why not? It is cheap
• Use CPU as DMA engine
• Second CPU in Blue Gene/L
• Why not? It is cheap, and you double you peak Flops
• Use second FPU for computation
• Otherwise sustained fraction of peak is embarrassingly small
• Go back to step 1

A D Kennedy

What Next?
• We need a Petaflop computer for QCD
• New fermion formulations solve a lot of problems, but cost significantly more in computer time
• Algorithms are improving
• Slowly
• By factors of about 2
• Can we build a machine about 10 times more cost effective than commercial machines?
• Otherwise not worth the effort & risk of building our own
• Blue Gene/L and its successors provide a new opportunity

A D Kennedy