1 / 32

CSE 160 – Lecture 2

CSE 160 – Lecture 2. Today’s Topics. Flynn’s Taxonomy Bit-Serial, Vector, Pipelined Processors Interconnection Networks Topologies Routing Embedding Network Bisection. Taxonomy. Flynn (1966) Classified machines by data and control streams. SIMD. SIMD

tosca
Download Presentation

CSE 160 – Lecture 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 160 – Lecture 2

  2. Today’s Topics • Flynn’s Taxonomy • Bit-Serial, Vector, Pipelined Processors • Interconnection Networks • Topologies • Routing • Embedding • Network Bisection

  3. Taxonomy • Flynn (1966) Classified machines by data and control streams

  4. SIMD • SIMD • All processors execute the same program in lockstep • Data that each processor sees is different • Single control processor • Individual processors can be turned on/off at each cycle • Illiac IV, CM-2, MasPar are some examples • Silicon Graphics Reality Graphics engine

  5. MIMD • All processors execute their own set of instructions • Processors operate on separate datastreams • No centralized clock implied • SP-2, T3E, Clusters, Cray’s, etc.

  6. SPMD/MPMD • Single/Multiple Program Multiple Data • SPMD processors run the same program but processors are necessarily run in lock step. • Very popular and scalable programming style • MPMD is similar except that different processors run different programs • PVM distribution has some simple examples

  7. Processor Types • Four types • Bit serial • Vector • Cache-based, pipelined • Custom (eg. Tera MTA or KSR-1)

  8. Bit Serial • Only seen in SIMD machines like CM-2 or MasPar • Each clock cycle, one bit of the data is loaded/written • Simplifies memory system and memory trace count • Popular for very dense (64K) processor arrays

  9. Cache-based, Pipelined • Garden Variety Microprocessor • Sparc, Intel x86, MC68xxx, MIPs, … • Register-based ALUs and FPUs • Registers are of scalar type • Pipelined execution to improve performance of individual chips • Splits up components of basic operation like addition into stages • The more stages, the faster the speedup, but more problems with branching and data/control hazards • Per-processor caches make it challenging to build SMPs (coherency issues) • Now dominates the high-end market

  10. Vector Processors • Very specialized (eg. $$$$$) machines • Registers are true vectors with power of 2 lengths • Designed to efficiently perform matrix-style operations • Ax = b ( b(I) =  A(I,J)*x(J)) • Vector registers v1, v2, v3 • V1 = A(I,*), V2 = b(*) • MULV V3(I), V1, V2 • “Chaining” to efficiently handle larger vectors than size of vector registers • Cray, Hitachi, SGI (now Cray SV-1) are examples

  11. Some Custom Processors • Denelcor HEP/Tera MTA • Multiple register sets • Stack Pointer, Instruction Pointer, Frame Pointer, etc. • Facilitates hardware threads • Switch each clock cycle to different register set • Why? Stalls to memory subsystem in one thread can be hidden by concurrency • KSR-1 • Cache-only memory processor • Basically 2 generations behind standard micros

  12. Going Parallel • Late 70’s, even vector “monsters” started to to go parallel • For //-processing to work, individual processors must synchronize • SIMD – Synchronize every clock cycle • MIMD – Explicit sychronization • Message passing • Semaphores, monitors, fetch-and-increment • Focus on interconnection networks for rest of lecture

  13. Characterizing Networks • Bandwidth • Device/switch latency • Switching types • Circuit switched (eg. Telephone) • Packet switched (eg. Internet) • Store and forward • Virtual Cut Through • Wormhole routed • Topology • Number of connections • Diameter (how many hops through switches)

  14. Latency • Latency is the amount of time taken for a command to start before any effect is seen • Push on gas pedal before car goes forward • Time you enter a line, before cashier starts on your job • First bit leaves computer A, first bit arrives at computer B OR • (Message latency) First bit leaves computer A, last bit arrives at computer B • Startup latency is the amount of time to send a zero length message

  15. Bandwidth • Bits/second that can travel through a connection • A really simple model for calculating the time to send a message of N bytes • Time = latency + N/bandwidth • Bisection is the minimum number of wires that must be cut to divide a network of machines into two equal halves. • Bisection bandwidth is the total bandwidth through the bisection

  16. Interconnection Topologies • Completely connected • Every node has a direct wire connection to every other node (N x (N-1))/2 Wires, Clearly impractical

  17. Line/Ring 1 2 3 4 5 6 7 • Simple interconnection • First topology where routing is an issue • Needed when no direct connection exists between nodes • Want go to node 4 from node 2 have to pass through node 3 • What happens if 2 want to communicate with 3 at the same time 1 want to communicate with 4? • What is the bisection of a line/ring • If the links are of bandwidth B, what is the bisection bandwidth • What is the aggregate bandwidth of the network?

  18. Mesh/Torus • Generalization of line/ring to multiple dimensions • More routes between nodes • What is the bisection of this network? 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

  19. Hop Count • Networks are measured by diameter • This is the minimum number of hops that message must traverse for the two nodes that furthest apart • Line: Diameter = N-1 • 2D (NxM) Mesh: Diameter = N+M-2

  20. Tree-based Networks • Nodes organized in a tree fashion (important for some global algorithms) Diameter of this network? Bisection, Bisection Bandwidth?

  21. Hypercubes 1D 2D 4D 3D

  22. Hypercubes 2 • Dimension N Hypercube is constructed by connecting the “corners” of two N-1 hypercubes • Relatively low wire count to build large networks • Multiple routes from any destination to any node. • Exercise to the reader, what is the dimenision of a K-dimensional Hypercube

  23. Labeling/Routing in a Hypercube • Nodes a labeled in Gray Code • Connected neighbors have their binary node number representation differ by one bit. • 3D cube 000 001 101 100 010 011 110 111

  24. The e-cube routing algorithm • Source address S = S0 S1 S2 … Sn • Destination address D = D0 D1 D2 … Dn • Let R = R0 R1 R2 … Rn = S  R • Number of one bits in R indicate distance between S and D • Starting at S, go to neighbor where first Rj = 1 (if Sj = 0 then goto neighbor where Sj=1) • Continue routing from this intermediate node where the next Rk (k > j) is one, goto that neighbor.

  25. E-cube routing example • 8 Dimensional Hypercube (256 Nodes) • S = 134= 0x86 = 10000110 • D = 215 = 0xD7 = 11010111 • S  D = 0x51 = 01010001 • Distance = 3 • S  11000110 (198) • 11010110 (214) • 11010111 (215)

  26. Embedding • A network is embeddable if nodes and links can be mapped to a target network • A mesh is embeddable in a hypercube • There is mapping of hypercube nodes and networks to a mesh • The dilation of an embedding is how many links are needed in the embedding network to represent the embedded network • Perfect embeddings have dilation 1 • Embedding a tree into a mesh has a dilation of 2 (See example in book)

  27. Modern Parallel Machines are Packet Switched • Break message into smaller blocks and send these pieces through the network • Network intermediate points (routers) can be store-and-forward or virtual cut through • Store and forward requires buffering at each switch if an incoming packet has packets ahead of it on an outgoing port (congestion) • Virtual cut-through eliminates the always buffering for store and forward by “cutting through” the switch when the output port is free

  28. Wormhole Routing • Wormhole routing is a variation of virtual cut through • Small headers (flow control digits == Flits) pass through the network. • When a flit is allowed to cut through a switch, the original sender is guaranteed a clear path through that switch. • A tail flit closes the “connection” • Wormhole was defined by Seitz and is used in Myrinet, a very popular cluster interconnect.

  29. Latency of Circuit Switched and Virtual Cut Through • Circuit Switch Latency • (Lc/B) l + (L/B) • Lc = length of control packet • B = bandwidth • l = number of links • L = Length of Packet • Virtual Cut-through latency • (Lh/B) l + (L/B) • Lh = length of header packet

  30. Store-Forward and Wormhole routing Latency • Wormhole Routing Latency • (Lf/B) l + (L/B) • Lf = Length of flit • Store-Forward Latency • (L/B) l • Store and forward latency can be much worse for many hops. • Virtual Cut Through, Wormhole, and Circuit Switch reach (L/B) as message length increases

  31. Deadlock/Livelock • Livelock/Deadlock is a potential problem in any network design. • Livelock occurs in adaptive routing algorithms when a packet never finds destination • Deadlock occurs when packets cannot be forwarded because waiting for other packets to move out of the way. Blocking packet is waiting for blocked packet to move

  32. Next Time … • All about clusters • Introduction to PVM (and MPI)

More Related