Computing systems cc513
Download
1 / 61

computing systems cc513 - PowerPoint PPT Presentation


  • 226 Views
  • Updated On :

Computing Systems CC513. Magdy Saeb. Computing Systems CC513. Deeper understanding of; Computer Architecture concepts design trade-offs for cost/performance Advanced Architectures trends for the future Why? match/choose hardware and software to solve a problem

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'computing systems cc513' - andrew


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Computing systems cc513 l.jpg

Computing Systems CC513

Magdy Saeb

Arab Academy for Science, Technology & Maritime Transport


Computing systems cc5132 l.jpg
Computing Systems CC513

  • Deeper understanding of;

    • Computer Architecture concepts

    • design trade-offs for cost/performance

    • Advanced Architectures

    • trends for the future

  • Why?

    • match/choose hardware and software to solve a problem

    • design better software (for many programmers)

    • design better hardware (for a chosen few)


Computing systems cc5133 l.jpg
Computing Systems CC513

Course planning

  • 15 Lectures

  • Labs/assignments

  • 3 Written exams, project, report and presentation


Course outline l.jpg
Course Outline

Course objectives:

This course gives a thorough knowledge in advanced computer architecture concepts, parallel architectures and parallel processing. The main aim is to develop the students’ research skills and knowledge in the state-of-the-art architectures. This topic is strongly related to areas like: computer graphics acceleration, cryptography, coding, hardware design, etc

Department Home page:

www.aast-compeng.info

( Here you find many course handouts, VHDL lectures, solution of homework problems, and sample exams)

Topics:

1. Course Overview,

Computational Models,

2. Introduction to Parallel Processing,

3. ILP-Processors,

4. Pipelined Processors,

5. VLIW Processors,

6. Superscalar Processors,

7 . Midterm 1

8 . Code Scheduling for ILP-Processors,

9 . Branch Processing,

10 . Memory Systems,

11 . SIMD Architectures,

12 . Midterm 2

13 . MIMD Architectures, Memory Systems,

14 . Distributed & Shared Memory MIMD,

15 . Multi-threaded Architectures/ Future Directions (Optical Computing, Bio-electronic computing, FPGA special Purpose computers)

16 . Final Exam.


Texts grading l.jpg
Texts & Grading

Text:

Advanced Computer Architectures: A Design Space Approach, Addison-Wesley, 1998.

References:

J.L. Hennessy, D. A. Patterson, Computer Architecture, 3rd Edition, Morgan Kauffman, 2003.

Kai Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw-Hill, 1993.

Grading:

Homework 10%

Quizzes 10%

Project 10%

Midterm1 15%

Midterm2 15%

Final 40%

Lecturer: Magdy Saeb, Ph.D.

Assist Lecturer: Reham Mahdi, BS.


Computing systems cc5136 l.jpg
Computing Systems CC513

Main text:

Advanced Computer Architectures: A Design Space Approach

Sima, Fountain, Kacsuk

Supplementary text:

Computer Architecture: A Quantitative Approach

Hennesey, Pattersson


Computing systems cc5137 l.jpg
Computing Systems CC513

See

http://www.aast-compeng.info

for information on:

  • News

  • Lectures

  • Sample Exams

  • Lab Status

  • Grading

Questions?


Computational models l.jpg

Computational Models

Arab Academy for Science, Technology & Maritime Transport


Advanced computer architecture part i l.jpg
Advanced Computer Architecture: Part I

  • Computational Models

  • The Concept of Computer Architecture

  • Introduction To Parallel Processing

    Sima, et al. introduce a design space approach to Computer Architecture (design aspects are broken down to atoms or tiny pieces).



Part i computational models l.jpg
Part I, Computational Models

  • Turing, Typ0 language

    • Infinite memory, not feasible

  • Von Neumann, Imperative (C)

    • Traditional architecture, Control/Memory

    • Finite State Machine (FSM)

    • Multiple Assignment gives side effects

    • Sequential in nature

    • Control statements


Part i computational models12 l.jpg
Part I, Computational Models

  • Dataflow, Single Assignment language

    • Dataflow machines

  • Applicative, Functional (Haskell/ML)

    • Reduction machines

  • Object Based, Object Oriented (C++)

    • Object oriented computers (similar to Von Neumann however depend on message passing)

  • Predicate Logic Based, Logic Based (Prolog)

    • Has Not been realized


Part i computational models13 l.jpg
Part I, Computational Models

Computational models can be “emulated” on

Von Neumann machines.

Hard to beat on cost/performance!


Part i sima the concept of computer architecture l.jpg
Part I, Sima, The Concept of Computer Architecture

  • Abstract architecture

    • Deals with functional specification

    • For example: programmers model/instruction set

  • Concrete architecture

    • Deals with aspects of the implementation

    • For example: logic design as block diagram of functional units


Part i the concept of computer architecture l.jpg
Part I, The Concept of Computer Architecture

  • DS Trees to define design space

  • con:consist of

  • pex:can be exclusively performed by

  • per:can be performed by

  • example=con(pex(A,B),per(C,con(D,E))

A

B

C

D

E


Part i introduction to parallel processing l.jpg
Part I, Introduction to Parallel Processing

  • Process/Process trees/Threads

    • Process Control Block (PCB)

    • Resource mapping per process

    • Threads/light weight processes, inherits/shares resources

  • Concurrent/Parallel execution

    • Concurrent (time sliced)

      • Multi threaded architectures

    • Parallel (multiple CPUs)

      • Parallel architectures, multi processors, multi computers (clusters)


Part i introduction to parallel processing17 l.jpg
Part I, Introduction to Parallel Processing

  • Types of Parallelism

    • Available, inherent in problem

    • Utilized by architecture implementation

    • Functional, from problem solution (usually irregular)

      • ILP, multi-threading, MIMD

    • Data, from computations (regular, like vectors…)

      • SIMD

  • Flynn’s Classification


Introduction to instruction level parallelism ilp l.jpg

Introduction to Instruction-Level Parallelism (ILP)

Arab Academy for Science, Technology & Maritime Transport


Introduction to instruction level parallelism ilp19 l.jpg
Introduction to Instruction-Level Parallelism (ILP)

Traditional

Von Neumann Processors

(sequential issue, sequential execution)

Scalar ILP Processors

(sequential issue, parallel execution)

SuperScalar ILP Processors

(parallel issue, parallel execution)

Parallelism of Instruction Execution

Parallelism of Instruction Issue

VLSI and superscalar processors with multiple pipelined EUs.

Processors with multiple non-pipelined EUs and pipelined processors

typical implementation

Non-pipelined Processors


Some definitions l.jpg
Some Definitions

  • A pipelined processor :

    • Has instruction level parallelism by having one instruction in each stage of the pipeline

  • An execution unit (EU) is a block that performs some function which helps complete an instruction :

    • Integer ALU, Floating Point Unit (FPU), Branch Unit (BU), Load Store Unit are examples of execution units.


Methods of achieving parallelism l.jpg
Methods of achieving parallelism

There are two major methods of achieving parallelism:

  • Pipelining

  • Replication


More definitions l.jpg
More Definitions

  • A superscalar processor :

    • Issues multiple instructions per clock cycle from a sequential stream

    • Dynamic scheduling of execution units (scheduling done in hardware)

  • An Very Long Instruction Word (VLIW)

    processor :

    • Issues one very ‘wide’ instruction per clock cycle; this instruction contains multiple operations

    • Static scheduling of execution units (done by compiler).


Pipelined vs vliw superscalar l.jpg

Parallel operation

Pipelined operation

EU1

EU2

EU3

EU1

EU2

EU3

Pipelined Processors

VLIW and superscalar processors

Pipelined vs. VLIW/Superscalar

Execution units in VLIW and Superscalar processors can be pipelined!


Typical pipeline l.jpg

Ifetch

Reg/Dec

Exec

Mem

Wr

Typical Pipeline

Cycle 1

Cycle 2

Cycle 3

Cycle 4

Cycle 5

Load

  • Ifetch: Instruction Fetch

    • Fetch the instruction from the Instruction Memory

  • Reg/Dec: Registers Fetch and Instruction Decode

  • Exec: Calculate the memory address

  • Mem: Read the data from the Data Memory

  • Wr: Write the data back to the register file


Execution units can be pipelined powerpc 601 example l.jpg
Execution Units can be pipelined (PowerPC 601 Example)

Branch Instructions

Issue DecodeExecutePredict

Fetch

Integer Instructions

Issue Decode

Execute

Write-back

Fetch

Load/Store Instructions

Issue Decode

Addr Gen

Cache

Write-back

Fetch


Execution units can be pipelined powerpc 601 example cont l.jpg
Execution Units can be pipelined (PowerPC 601 Example) (cont.)

FP Instructions

Issue

Decode

Execute 1

Execute 2

Fetch

Writeback


Data dependencies l.jpg
Data Dependencies

  • Data Dependencies present problems for instruction level parallelism

  • Types of data dependencies:

    • Straight line code

      • Read After Write (RAW)

      • Write After Read (WAR)

      • Write After Write (WAW)

    • Loops

      • Recurrence or inter-iteration dependencies


Straight line dependencies l.jpg
Straight Line Dependencies

Read After Write (RAW)

i1: load r1, a; i2: add r2, r1, r1;

Assume a pipeline of Fetch/Decode/Execute/Mem/Writeback

When addis in the DECODE stage (which fetches r1), the load is in the EXECUTE stage and the true value of r1 has not been fetched yet! (r1 is fetched in the Mem stage)

Solve this by either stalling the ‘add’ until the value of r1 is ready, or by forwarding the value of r1 from the Mem stage to the Execute stage.


Straight line dependencies cont l.jpg
Straight Line Dependencies (cont)

Write after Read (WAR)

i1: mul r1, r2, r3; r1 <= r2 * r3 i2: add r2, r4, r5; r2 <= r4 + r5

If instruction i2 (add) is executed before instruction i1 (mul) for some reason, then i1 (mul) could read the wrong value for r2.

One reason for delaying i1 would be a stall for the ‘r3’ value being produced by a previous instruction. Instruction i2 could proceed because it has all its operands, thus causing the WAR hazard.

Use register renaming to eliminate WAR dependency. Replace r2 with some other register that has not been used yet.


Straight line dependencies cont30 l.jpg
Straight Line Dependencies (cont.)

Write after Write (WAW)

i1: mul r1, r2, r3; r1 <= r2 * r3 i2: add r1, r4, r5; r2 <= r4 + r5

If instruction i1 (mul) finishes AFTER instruction i2 (add), then register r1 would get the wrong value. Instruction i1 could finish after instruction i2 if separate execution units were used for instructions i1 and i2.

One way to solve this hazard is to simply let instruction i1 proceed normally, but disable its write stage.


Loop dependencies l.jpg
Loop Dependencies

Recurrences:

do I = 2, n X(I) = A * X(I-1) + B;

enddo

One way to parallelize this loop would be to ‘unroll’ this loop (create (N-2) copies of the loop). However, a dependency exists between the current X value and the previous loop value, so loop unrolling will not give us anymore parallelism.

This type of data dependency cannot be solved at the implementation level, but must be addressed at the compiler level.


Control dependencies l.jpg
Control Dependencies

  • Control Dependencies (i.e. branches) are a major obstacle to instruction level parallelism

    • In a pipelined machine, normally have branch condition computation done as EARLY as possible in the pipeline in order to lessen the impact of incorrect branch prediction (taken or not taken)

  • Conditional branch instructions are 20% for general purpose code, 5-10% for scientific code.


Branch strategies l.jpg
Branch Strategies

  • Static

    • Always predict taken or not-taken

  • Dynamic

    • Keep a history of code execution and modify predictions based on execution history

  • Multi-way

    • Execute both branch paths and kill incorrect path as soon as branch condition is resolved.


Control dependency graph l.jpg

i0

i1

i2

i3

i4

i7

i5

i6

i8

Control Dependency Graph

i0: r1 = op1;i1: r2 = op2;i2: r3 = op3;i3: if (r2 > r1) {i4: if (r3 > r1) {i5: r4 = r3;i6: else r4 = r1 }i7: } else r4 = r2;i8: r5 = r4 * r4;


Resource dependencies l.jpg
Resource Dependencies

  • A resource dependency is when an instruction requires a hardware resource being used by a previously issued instruction (also known as structural hazard)

    • Execution Units, Busses (e.g, external address/data bus)

  • A resource dependency can only be solved by resource duplication

    • The Harvard architecture has separate address/data busses for instructions and data


Instruction scheduling l.jpg
Instruction Scheduling

  • Instruction Scheduling is the assignment of instructions to hardware resources.

    • Hardware resources are busses, registers, and execution units

  • Static scheduling is done by compiler or by human

    • Hardware assumes that ALL hazards have been eliminated.

    • Lessens the amount of control logic needed which hopefully speeds up maximum clock speed


Instruction scheduling cont l.jpg
Instruction Scheduling (cont).

  • Dynamic Scheduling is implemented in hardware inside of processor.

    • All instruction streams are ‘legal’

    • Control logic and hardware resources needed for dynamic scheduling can be significant.

  • If trying to execute legacy code streams, then dynamic scheduling may be the only option.


Pipelined processors l.jpg

Pipelined Processors

Arab Academy for Science, Technology & Maritime Transport


Definitions l.jpg
Definitions

  • FX pipeline - Fixed point pipeline (integer pipeline)

  • FP pipeline - Floating Point pipeline

  • Cycle time - length of clock period for pipeline, determined by slowest stage.

  • Latency used in referenced to RAW hazards - the amount of time that a result of a particular instruction takes to become available in the pipeline for a subsequent dependent instruction (measured in multiples of clock cycles)


Raw dependency latencies l.jpg
RAW Dependency, Latencies

  • Define-use Latencyis the time delay after decoding and issue of an instruction until the result becomes available for a subsequent RAW dependent instruction. add r1, r2,r3 add r5, r1, r6 define-use dependency Usually one cycle for simple instructions.

  • Define-use Delay of an instruction is the time a subsequent RAW-dependent instruction has to be stalled in the pipeline. It is one less cycle than the define-use latency.


Raw dependency latencies cont l.jpg
RAW Dependency, Latencies (cont)

  • If define-use latency = 1, then define-use delay is 0 and the pipeline is not stalled.

    • This is the case for most simple instructions in the FX pipeline

    • Non-pipelined FP operations can have define-use latencies from a few cycles to a 10’s of cycles.

  • Load-use dependency, Load-use latency, load-use delay refer to load instructions load r1, 4(r2) add r3, r1, r2Definitions are the same as define-use dependency, latency, and delay.


More definitions42 l.jpg
More Definitions

Repetition Rate R (throughput) - shortest possible time interval between subsequent independent instructions in the pipeline

Performance Potential of a Pipeline - the number of independent instructions which can be executed in a unit interval of time:

P = 1 / (R * tc )

R: repetition rate in clock cyclestc : cycle time of the pipeline


Table 5 1 from text latency repetition rate l.jpg
Table 5.1 from Text (latency/repetition rate)

Processor CycleTime Prec Fadd FMult FDiv Fsqrt

a21064 7/5/2 s 6/1 6/1 34 - p 6/1 6/1 63 -Pentium 6/5/3.3 s 3/1 3/1 39 70 d 3/1 3/1 30 70

Pentium Pro 6.7/5/3.3 s 3/1 5/2 18 29 d 3/1 5/2HP PA 8000 5.6 s 3/1 3/1 17 17 d 3/1 3/1 31 31SuperSparc 20/17 s 1/1 3/1 6/4 8/6 d 9/7 12/10


How many stages l.jpg
How many stages?

  • The more stages, the less combinational logic within a stage, the higher the possible clock frequency

    • More stages can complicate control. Dec Alpha has 7 stages for FX instructions, and these instructions have a define-use delay of one cycle for even basic FX instructions

    • Becomes difficult to divide up logic evenly between stages

    • Clock skew between stages becomes more difficult

    • Diminishing returns as stages become large

  • Superpipeliningis a term used for processors that use a high number of stages.


Dedicated pipelines versus multifunctional pipelines l.jpg
Dedicated Pipelines versus Multifunctional Pipelines

  • Trend in current high performance CPUs is to used different logical AND physical pipelines for different instruction classes

    • FX pipeline (integer)

    • FP pipeline (floating point)

    • L/S pipeline (Load/Store)

    • B pipeline (Branch)

  • Allows more concurrency, more optimization

    • Silicon area more plentiful


Sequential consistency l.jpg
Sequential Consistency

  • With multiple pipelines, how do we maintain sequential consistency when instructions are finishing at different times?

    • With just two pipelines (FX and FP), we can lengthen the shorter pipeline with statically or dynamically. Dynamic lengthening would be used only when hazards are detected.

    • We can force the pipelines to write to a special unit called a Renaming Buffer or Reordering Buffer. It is the job of this unit to maintain sequential consistency. Will look at this in detail in Chapter 7 (superscalar).


Risc versus cisc pipelines l.jpg
RISC versus CISC pipelines

  • Pipelines for CISC are required to handle complex memory to register addressing

    • mov r4, (r3, r2)4 EA is r3 + r2 + 4

    • Will have an extra stage for Effective address calculation (see Figures 5.40, 5.41, 5.43)

    • Some CISC pipelines avoid a load-use delay penalty (Fig 5.54, 5.56)

  • RISC pipelines have a load-use penalty of at least one

  • Determining load-use penalties when multiple pipelines are in action are instruction sequence dependent (ie., 1, 2, more than 2 cycles)


Some other important figures in chapter 5 l.jpg
Some other important Figures in Chapter 5

  • Figure 5.26 (illustrates use of both clock phases for performing pipeline tasks)

  • Figure 5.31, Figure 5.32 (Pentium Pipeline, shows difference between logical and physical pipelines)

  • Figure 5.33, Figure 5.34 (PowerPC 604 - first look at a modern superscalar processor)


Cc513 computing systems part 3 vliw architecture l.jpg

CC513Computing SystemsPart 3: VLIW Architecture

Arab Academy for Science, Technology & Maritime Transport


Basic working principles of vliw l.jpg
Basic Working Principles of VLIW

  • Aim at speeding up computation by exploiting instruction-level parallelism.

  • Same hardware core as superscalar processors, having multiple execution units (EUs) working in parallel.

  • An instruction is consisted of multiple operations; typical word length from 52 bits to 1 Kbits.

  • All operations in an instruction are executed in a lock-step mode.

  • One or multiple register files for FX and FP data.

  • Rely on compiler to find parallelism and schedule dependency free program code.



Register file structure for vliw l.jpg
Register File Structure for VLIW

What is the challenge to register file in VLIW? R/W ports



Differences between vliw superscalar architecture ii l.jpg
Differences Between VLIW & Superscalar Architecture (II)

  • Instruction formulation:

    • Superscalar:

      • Receive conventional instructions conceived for seq. processors.

    • VLIW:

      • Receive (very) long instruction words, each comprising a field (or opcode) for each execution unit.

      • Instruction word length depends (a) number of execution units, and (b) code length to control each unit (such as opcode length, register names, …).

      • Typical word length is 256 – 1024 bits, much longer than conventional machine word length.


Differences between vliw superscalar architecture iii l.jpg
Differences Between VLIW & Superscalar Architecture (III)

  • Instruction scheduling:

    • Superscalar:

      • Performed dynamically at run-time by the hardware.

      • Data dependency is checked and resolved in hardware.

      • Need a look-ahead hardware window for instruction fetch.


Differences between vliw superscalar architecture iv l.jpg
Differences Between VLIW & Superscalar Architecture (IV)

  • Instruction scheduling (cont’d):

    • VLIW:

      • Static scheduling done at compile-time by the compiler.

      • Advantages:

        • Reduce hardware complexity.

        • Tasks such as decoding, data dependency detection, instruction issue, …, etc. becoming simple.

        • Potentially higher clock rate.

        • Higher degree of parallelism with global program information.


Differences between vliw superscalar architecture v l.jpg
Differences Between VLIW & Superscalar Architecture (V)

  • Instruction scheduling (cont’d):

    • VLIW:

      • Disadvantages

        • Higher complexity of the compiler.

        • Compiler optimization needs to consider technology dependent parameters such as latencies and load-use time of cache.

          (Question: What happens to the software if the hardware is updated?)

        • Non-deterministic problem of cache misses, resulting in worst case assumption for code scheduling.

        • In case of un-filled opcodes in a (V)LIW, memory space and instruction bandwidth are wasted.




Case study of vliw trace 200 family ii l.jpg
Case Study of VLIW: Trace 200 Family (II)

  • Only two branches might be used in Trace 7/2000


Code expansion in vliw l.jpg
Code Expansion in VLIW

  • It is found that code in VLIW is expanded roughly by a factor of three.

  • For “long” VLIW, more opcode fields will be emptied. This will result in wasting bandwidth and storage space.

    Can you propose a solution for it?


ad