1 / 18

Master Program (Laurea Magistrale) in Computer Science and Networking

Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi. Prerequisites Revisited 1.3. Assembler level , CPU architecture , and I/O. Uniprocessor architecture.

hei
Download Presentation

Master Program (Laurea Magistrale) in Computer Science and Networking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance ComputingSystems and EnablingPlatforms Marco Vanneschi • PrerequisitesRevisited • 1.3. Assemblerlevel, CPU architecture, and I/O

  2. Uniprocessorarchitecture The firmwareinterpretationofanyassemblylanguageinstructionshas a decentralizedorganization, evenforsimpleuniprocessorsystems. The interpreterfunctionalities are partitionedamong the various processing units (belongingto CPU, M, I/O subsystems), eachunithavingitsownControl Part and itsownmicroprogram. Cooperationofunitsthroughexplicitmessagepassing . A uniprocessor architecture Applications Processes Cache Memory Cache Memory Assembler Main Memory Firmware DMA Bus ... CPU Hardware Memory Management Unit I/O Bus I/O1 I/Ok ... . . . Processor Interrupt Arbiter MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

  3. Assemblylevelcharacteristics • RISC (ReducedInstriction Set Computer) vs CISC (ComplexInstriction Set Computer) • “The MIPS/Power PC vs the x86 worlds” RISC: • basicelementaryarithmeticoperations on integers, • simpleaddressingmodes, • intensive useofGeneralRegisters, LOAD/STORE architecture • complexfunctionalities are implemented at the assemblerlevelby (long) sequencesofsimpleinstructions • rationale: powerfuloptimizations are (feasible and) mademucheasier: • Firmwarearchitecture • Code compilation forCaching and InstructionLevelParallelism MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

  4. Assemblylevelcharacteristics CISC: • includescomplexistructionsforarithmeticoperations (reals, and more), • complexnestedaddressingmodes, • instructionscorrespondingto (partsof) systemsprimitives and services • processcooperation and management • memory management • … • graphics, • … • networking • …. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

  5. RISC and CISC approaches Applications RISC architecture Primitive 1 Processes CISC architecture Primitive 2 Assembler Long sequence of instructions Short sequenceofproperinstructions Sequence of instructions Single instruction • Firmware microprogram Hardware MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

  6. Test: tobesubmitted and discussed Basically, CISC implements at the firmwarelevelwhat RISC implements at the assemblerlevel. Assume: • No InstructionLevelParallelism • Sametechnology: clock cycle, memoryaccesstime, etc • Why the completiontimecouldbereducedwhenimplementing the samefunctionality at the firmwarelevel(“in hardware” … !)comparedto the implementation at the assemblerlevel(“in software” … !)? • Under whichconditionsis the ratioofcompletiontimessignificant? Whichorderofmagnitude? • Exercise: answerthesequestions in a non trivial way, i.e. exploting the concepts and featuresof the levels and levelstructuring. • Write the answer(1-2 pages), submit the work to me, and discussit at QuestionTime. • Goal: tohighlight background problems, weaknesses, … togainaptitude. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

  7. Performance parameters Average SERVICE TIME per instruction Average service time of individual instructions Probability MIX(occurrence frequencies of instructions) for a given APPLICATION AREA Global average SERVICE TIME per instruction CPI = average number of Clock cycles Per Instruction Performance = average number of executed instructions per second (MIPS, GIPS, TIPS, …) = processing bandwidth of the system at the assembler-firmware levels MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

  8. Performance parameters Completiontimeof a program • Benchmarkapproachto performance evalutation (e.g., SPEC benchmarks) • Comparisonofdistinctmachines,differing at the assembler (and firmware) level • On the contary: the Performanceparametersismeaningfulonlyfor the comparisonofdifferentfirmwareimplementationsof the sameassemblermachine. m = average number of executed instructions This simple formula holds rigorously for uniprocessors without Instruction Level Parallelism It holds with good approximation for Instruction Level Parallelism (m = stream length), or for parallel architectures. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

  9. RISC vs CISC • RISCdecreasesT and increasesm • CISCincreasesT and decreasesm • Whereis the best trade-off? • ForInstructionLevelParallelismmachines: RISC optimizationsmaybeabletoachieve a T decreasegreaterthan the m increase • cases in which the executionbandwidthhasgreater impact than the latency • Whereexecutionlatencydominates, CISC achieveslowercompletiontime • e.g. processcooperation and management, services, specialfunctionalities, … MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

  10. A possibletrade-off • Basically a RISC machine with the addition of specialized co-processors or functional unitsfor complex tasks • Graphic parallel co-processors • Vectorized functional units • Network/Communication co-processors • … • Concept: modularity can lead to optimizations • separation of concerns and its effectiveness • Of course, a CISC machine with “very optimized” compilers can achieve the same/better results • A matter of COMPLEXITY and COSTS / INVESTMENTS MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

  11. Modularity and optimizationofcomplexity / efficiencyratio A uniprocessor architecture In turn, anadvanced I/O unit can be a parallelarchitecture (vector / SIMD co-processor / GPU / …), or a powerfulmultiprocessor-based network interface, or …. Cache Memory Cache Memory Main Memory Forinterprocesscommunication and management: communicationco-processors ... CPU Memory Management Unit In turn, the Processor can beimplementedas a (large) collectionofparallel, pipelinedunits E.g. pipelined hardware implementationofarithmeticoperations on reals, or vectorinstructions I/O1 I/Ok ... . . . Processor Interrupt Arbiter MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

  12. A “didactic RISC” • See Patterson – Hennessy: MIPS assembler machine • In the courses of Computer Architecture in Pisa: D-RISC • a didactic version (on paper) of MIPS with few, inessential simplifications • useful for explaining concepts and techniques in computer architecture and compiler optimizations • includes many features for code optimizations MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

  13. D-RISC in a nutshell • 64 GeneralRegisters: RG[64] • Anyinstructionrepresented in a single 32-bit word • Allarithmetic-logical and branch/jumpinstructions operate on RG • Integerarithmeticonly • Target addressesofbranch/jump; relative toProgramCounter , or RG contents • Memoryaccesses can bedoneby LOAD and STORE instructionsonly • Logicaladdressesare generated, i.e. CPU “sees” the VirtualMemoryofrunningprocess • Logicaladdresscomputedas the sum oftwo GR contents (Base + Index mode) • Input/output: MemoryMapped I/O through LOAD/STORE • No specialinstructions • Specialinstructionsfor • Interrupt control: mask, enable/disable • Processtermination • Contextswitching: minimal supportfor MMU • Cachingoptimizations: annotationsforprefetching, reuse, de-allocation, memoryre-write • Indivisiblesequencesofmemoryaccesses and annotations (multiprocessorapplication) MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

  14. Exampleof compilation intA[N], B[N]; for(i = 0; i < N; i++) A[i] = A[i] + B[i]; • R = keyword denoting “registeraddress” • Forclarityofexamples, registeraddresses are indicatedbysymbols (e.g. RA = R27, then base addressof A is the content RG[27]) • REGISTER ALLOCATIONS and INITIALIZATIONS at COMPILE TIME: • RA, RB: addressesof RG initialized at the base addressof A, B • Ri: addressof RG containingvariablei, initialized at 0 • RN: addressof RG initialized at constant N • Ra, Rb, Rc: addressesof RG containingtemporaries (notinitialized) • VirtualMemorydenotedbyVM • ProgramCounterdenotedbyIC • “NORMAL” SEMANTICS of SOME D-RISC INSTRUCTIONS: • LOAD RA, Ri, Ra :: • RG[Ra] = MV[ RG[RA] + RG[Ri] ], IC = IC + 1 • ADD Ra, Rb, Rc :: • RG[Rc] = RG[Ra] + RG[Rb], IC = IC + 1 • IF < Ri, RN, LOOP :: • ifRG[Ri] < RG[RN] then IC = IC + offset(LOOP) else IC = IC + 1 LOOP: LOAD RA, Ri, Ra LOAD RB, Ri, Rb ADD Ra, Rb, Rc STORE RA, Ri, Rc INCR Ri IF < Ri, RN, LOOP END MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

  15. Compilation rules compile(if C then B)  IF (not C) CONTINUE compile(B) CONTINUE: … compile(if C then B1 else B2)  IF (C) THEN compile(B2) GOTO CONTINUE THEN: compile(B1) CONTINUE: … compile(while C do B)  LOOP: IF (not C) CONTINUE compile(B) GOTO LOOP CONTINUE: … compile(do B while C)  LOOP: compile(B) IF (C) LOOP CONTINUE: … MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

  16. I/O transfers Cache Memory Cache Memory MI/O1 MI/Ok Main Memory DMA Bus • DIRECT MEMORY ACCESS: • I/O units can access the MainMemory ... CPU • MEMORY MAPPED I/O: • InternalMemoriesof I/O units are extensionsof the MainMemory • Fullytranspartentto the process: • LOAD and STORE containlogicaladdresses • Compiler annotations drive the physicalmemoryallocation Memory Management Unit I/O Bus • SHARED MEMORY: • BothtechniquessupportsharingofmemorybetweenCPU-processes and I/O-processes (I/O Memory and/or MainMemory) I/O1 I/Ok ... . . . Processor Interrupt Arbiter MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

  17. Interrupt handling • Signaling asynchronous events (asynchronous wrt the CPU process) from the I/O subsystem • In the microprogram of every instruction: • Firmware phase: Test of interrupt signal and call of a proper assembler procedure (HANDLER) • HANDLER: specific treatment of the event signaled by the I/O unit • Instead, exceptions are synchronous events (e.g. memory fault, protection violation, arithmetic error, and so on), i.e. generated by the running process • Similar technique for Exception Handling (Firmware phase + Handler) MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

  18. Ttr MMU t P Arbiter I/O Firmwarephaseof interrupt handling Interrupt acknowledgement to the selected I/O unit Communication of 2 words from I/O unit, and procedure call • I/O unit identifier is not associated to the interrupt signal • Once the interrupt has been acknowledged by P, the I/O unit sends a message to CPU consisting of 2 words (via I/O Bus): • An event identifier • A parameter MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

More Related