1 / 41

Computer Architecture Guidance

Computer Architecture Guidance. Keio University AMANO, Hideharu hunga@am . ics . keio . ac . jp. Contents. Techniques of two key architectures for future system LSIs Parallel Architectures Reconfigurable Architectures Advanced uni-processor architecture

violet
Download Presentation

Computer Architecture Guidance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ComputerArchitectureGuidance Keio University AMANO, Hideharu hunga@am.ics.keio.ac.jp

  2. Contents Techniques of two key architectures for future system LSIs • Parallel Architectures • Reconfigurable Architectures • Advanced uni-processor architecture →Special Course of Microprocessors (by Prof. Yamasaki, Fall term)

  3. Class • Lecture using Powerpoint: (70mins. ) • The ppt file is uploaded on the web site http://www.am.ics.keio.ac.jp, and you can down load/print before the lecture. • When the file is uploaded, the message is sent to you by E-mail. • Textbook: “Parallel Computers” by H.Amano (Sho-ko-do) • Exercise (20mins.) • Simple design or calculation on design issues

  4. Evaluation • Exercise on Parallel Programming using SCore on RHiNET-2 (50%) • Exercise after every lecture (50%)

  5. ComputerArchitecture1Introduction to Parallel Architectures Keio University AMANO, Hideharu hunga@am.ics.keio.ac.jp

  6. Parallel Architecture A parallel architecture consists of multiple processing units which work simultaneously. • Purposes • Classifications • Terms • Trends

  7. Boundary between Parallel machines and Uniprocessors Uniprocessors • ILP(InstructionLevelParallelism) • A single Program Counter • Parallelism Inside/Between instructions • TLP(TreadLevelParallelism) • Multiple Program Counters • Parallelism between processes and jobs Parallel Machines Definition Hennessy & Petterson’s Computer Architecture: A quantitative approach

  8. Performance improvement Tightly Coupling Increasing of simultaneous issued instructions vs. Tightly coupling Single pipeline Multiple instructions issue Multiple Threads execution On chip implementation Shared memory, Shared register Connecting Multiple Processors

  9. Purposes of providing multiple processors • Performance • A job can be executed quickly with multiple processors • Fault tolerance • If a processing unit is damaged, total system can be available: Redundant systems • Resource sharing • Multiple jobs share memory and/or I/O modules for cost effective processing:Distributed systems • Low power • High performance with Low frequency operation Parallel Architecture: Performance Centric!

  10. Flynn’s Classification • The number of InstructionStream: M(Multiple)/S(Single) • The number of DataStream:M/S • SISD • Uniprocessors(including Super scalar、VLIW) • MISD: Not existing(AnalogComputer) • SIMD • MIMD

  11. Instruction SIMD (Single Instruction StreamsMultiple Data Streams • All Processing Units executes the same instruction • Low degree of flexibility • Illiac-IV/MMX(coarse grain) • CM-2 type(fine grain) Instruction Memory Processing Unit Data memory

  12. Two types of SIMD • Coarse grain:Each node performs floating point numerical operations • ILLIAC-IV,BSP,GF-11 • Multimedia instructions in recent high-end CPUs • Dedicated on-chip approach: NEC’s IMEP • Fine grain:Each node only performs a few bits operations • ICLDAP,CM-2,MP-2 • Image/Signal Processing • Connection Machines extends the application to Artificial Intelligence (CmLisp)

  13. A processing unit of CM-2 Flags A B F OP C Context s c 256bit memory 1bit serial ALU

  14. P P P P P P P P P P P P P P P P Element of CM2 4096 chips = 64KPE Instruction LSI chip Router 4x4 Processor Array 12links 4096 Hypercube connection 256bit x 16 PE RAM

  15. Thinking Machines’ CM2 (1996)

  16. The future of SIMD • Coarse grain SIMD • A large scale supercomputer like Illiac-IV/GF-11 will not revive. • Multi-media instructions will be used in the future. • Special purpose on-chip system will become popular. • Fine grain SIMD • Advantageous to specific applications like image processing • General purpose machines are difficult to be built ex.CM2 → CM5

  17. Each processor executes individual instructions • Synchronization is required • High degree of flexibility • Various structures are possible MIMD Processors Interconnection networks Memory modules (Instructions・Data)

  18. Classification of MIMD machines Structure of shared memory • UMA(UniformMemoryAccessModel) provides shared memory which can be accessed from all processors with the same manner. • NUMA(Non-UniformMemoryAccessModel) provides shared memory but not uniformly accessed. • NORA/NORMA(NoRemoteMemoryAccessModel) provides no shared memory. Communication is done with message passing.

  19. On-chip multiprocessor Chip multiprocessor Single chip multiprocessor UMA • The simplest structure of shared memory machine • The extension of uniprocessors • OS which is an extension for single processor can be used. • Programming is easy. • System size is limited. • Bus connected • Switch connected • A total system can be implemented on a single chip

  20. Snoop Cache Snoop Cache Snoop Cache Snoop Cache PU PU PU An example of UMA:Bus connected MainMemory sharedbus PU SMP(Symmetric MultiProcessor) On chip multiprocessor

  21. Switch connected UMA .... LocalMemory CPU Interface Switch …. MainMemory The gap between switch and bus becomes small

  22. Competitive to WS/PC clusters with Software DSM NUMA • Each processor provides a local memory, and accesses other processors’ memory through the network. • Address translation and cache control often make the hardware structure complicated. • Scalable: • Programs for UMA can run without modification. • The performance is improved as the system size.

  23. Typical structure of NUMA Node 0 0 Node 1 1 Interconnecton Network 2 Node2 3 Logical address space Node 3

  24. Classification of NUMA • Simple NUMA: • Remote memory is not cached. • Simple structure but access cost of remote memory is large. • CC-NUMA:CacheCoherent • Cache consistency is maintained with hardware. • The structure tends to be complicated. • COMA:CacheOnlyMemoryArchitecture • No home memory • Complicated control mechanism

  25. Cray’s T3D: A simple NUMA supercomputer (1993) • Using Alpha 21064

  26. The Earth simulator(2002)

  27. SGIOrigin BristledHypercube MainMemory HubChip Network MainMemory is connected directly with HubChip 1 cluster consists of 2PE.

  28. SGI’s CC-NUMA Origin3000(2000) • Using R12000

  29. ... ... ... DDM(DataDiffusionMachine) D ...

  30. The fastest processor is always NORA (except The Earth Simulator) Cluster computing Inter-PU communications NORA/NORMA • No shared memory • Communication is done with message passing • Simple structure but high peak performance Hard for programming

  31. Early Hypercube machine nCUBE2

  32. Fujitsu’s NORA AP1000(1990) • Mesh connection • SPARC

  33. Intel’s Paragon XP/S(1991) • Mesh connection • i860

  34. PC Cluster • Beowulf Cluster (NASA’s Beowulf Projects 1994, by Sterling) • Commodity components • TCP/IP • Free software • Others • Commodity components • High performance networks like Myrinet / Infiniband • Dedicated software

  35. RHiNET-2 cluster

  36. Terms(1) • Multiprocessors: • MIMD machines with shared memory • (Strict definition:by Enslow Jr.) • Shared memory • Shared I/O • Distributed OS • Homogeneous • Extended definition: All parallel machines(Wrong usage)

  37. Don’t use if possible Terms(2) • Multicomputer • MIMD machines without shared memory, that is NORA/NORMA • Arraycomputer • A machine consisting of array of processing elements : SIMD • A supercomputer for array calculation (Wrong usage) • Loosely coupled ・ Tightly coupled • Loosely coupled: NORA,Tightly coupled:UMA • But, are NORAs really loosely coupled??

  38. Classification Fine grain  SIMD Coarse grain  Multiprocessors Stored programming based Bus connected UMA Switch connected UMA Simple NUMA CC-NUMA COMA MIMD NUMA NORA Multicomputers Systolic architecture Data flow architecture Mixed control Demand driven architecture Others

  39. Systolic Architecture Data x Computational Array Data y Data streams are inserted into an array of special purpose computing node in a certain rhythm. Introduced in Reconfigurable Architectures

  40. Data flow machines d e A process is driven by the data c x a b + + x (a+b)x(c+(dxe)) Also introduced in Reconfigurable Systems

  41. Exercise 1 • In this class, a PC cluster RHiNET is used for exercising parallel programming. Is RHiNET classified into Beowulf cluster ? Why do you think so? • If you take this class, send the answer with your name and student number to hunga@am.ics.keio.ac.jp

More Related