superscalar and vliw architectures l.
Skip this Video
Loading SlideShow in 5 Seconds..
Superscalar and VLIW Architectures PowerPoint Presentation
Download Presentation
Superscalar and VLIW Architectures

Loading in 2 Seconds...

play fullscreen
1 / 22

Superscalar and VLIW Architectures - PowerPoint PPT Presentation

  • Uploaded on

Superscalar and VLIW Architectures. Miodrag Bolic CEG3151. Outline. Types of architectures Superscalar Differences between CISC, RISC and VLIW VLIW. Parallel processing [2]. Processing instructions in parallel requires three major tasks:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Superscalar and VLIW Architectures

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
  • Types of architectures
  • Superscalar
  • Differences between CISC, RISC and VLIW
  • VLIW
parallel processing 2
Parallel processing [2]

Processing instructions in parallel requires three majortasks:

  • checking dependencies between instructions todetermine which instructions can be grouped together forparallel execution;
  • assigning instructions to thefunctional units on the hardware;
  • determining wheninstructions are initiatedplaced together into a single word.
major categories 2
Major categories [2]

VLIW – Very Long Instruction Word

EPIC – ExplicitlyParallel Instruction Computing

From Mark Smotherman, “Understanding EPIC Architectures and Implementations”

major categories 25
Major categories [2]

From Mark Smotherman, “Understanding EPIC Architectures and Implementations”

superscalar processors 1
Superscalar Processors [1]
  • Superscalar processors are designed to exploit more instruction-level parallelism in user programs.
  • Only independent instructions can be executed in parallel without causing a wait state.
  • The amount of instruction-level parallelism varies widely depending on the type of code being executed.
pipelining in superscalar processors 1
Pipelining in Superscalar Processors [1]
  • In order to fully utilise a superscalar processor of degree m, m instructions must be executable in parallel. This situation may not be true in all clock cycles. In that case, some of the pipelines may be stalling in a wait state.
  • In a superscalar processor, the simple operation latency should require only one cycle, as in the base scalar processor.
superscalar implementation
Superscalar Implementation
  • Simultaneously fetch multiple instructions
  • Logic to determine true dependencies involving register values
  • Mechanisms to communicate these values
  • Mechanisms to initiate multiple instructions in parallel
  • Resources for parallel execution of multiple instructions
  • Mechanisms for committing process state in correct order
some architectures
Some Architectures
  • PowerPC 604
    • six independent execution units:
      • Branch execution unit
      • Load/Store unit
      • 3 Integer units
      • Floating-point unit
    • in-order issue
    • register renaming
  • Power PC 620
    • provides in addition to the 604 out-of-order issue
  • Pentium
    • three independent execution units:
      • 2 Integer units
      • Floating point unit
    • in-order issue
the vliw architecture 4
The VLIW Architecture [4]
  • A typical VLIW (very long instruction word) machine has instruction words hundreds of bits in length.
  • Multiple functional units are used concurrently in a VLIW processor.
  • All functional units share the use of a common large register file.
advantages of vliw
Advantages of VLIW

Compiler prepares fixed packets of multipleoperations that give the full "plan of execution"

  • dependencies are determined by compiler and used to schedule according to function unit latencies
  • function units are assigned by compiler and correspond to the position within the instruction packet ("slotting")
  • compiler produces fully-scheduled, hazard-free code => hardware doesn't have to "rediscover" dependencies or schedule
disadvantages of vliw
Disadvantages of VLIW

Compatibility across implementations is a major problem

  • VLIW code won't run properly with different number of function units or different latencies
  • unscheduled events (e.g., cache miss) stall entire processor

Code density is another problem

  • low slot utilization (mostly nops)
  • reduce nops by compression ("flexible VLIW", "variable-length VLIW")
example vector dot product
Example: Vector Dot Product
  • A vector dot product is common in filtering
  • Store a(n) and x(n) into an array of N elements
  • C6x peak performance: 8 RISC instructions/cycle
    • Peak RISC instructions per sample: 300,000 for speech;54,421 for audio; and 290 for luminance NTSC video
    • Generally requires hand coding for peak performance
  • First dot product example will not be optimized
example vector dot product20
Example: Vector Dot Product
  • Prologue
    • Initialize pointers: A5 for a(n), A6 for x(n), and A7 for Y
    • Move the number of times to loop (N) into A2
    • Set accumulator (A4) to zero
  • Inner loop
    • Put a(n) into A0 and x(n) into A1
    • Multiply a(n) and x(n)
    • Accumulate multiplication result into A4
    • Decrement loop counter (A2)
    • Continue inner loop if counter is not zero
  • Epilogue
    • Store the result into Y
example vector dot product21
Example: Vector Dot Product

Coefficients a(n)

Data x(n)

Using A data path only

; clear A4 and initialize pointers A5, A6, and A7

MVK .S1 40,A2 ; A2 = 40 (loop counter)

loop LDH .D1 *A5++,A0 ; A0 = a(n)

LDH .D1 *A6++,A1 ; A1 = x(n)

MPY .M1 A0,A1,A3 ; A3 = a(n) * x(n)

ADD .L1 A3,A4,A4 ; Y = Y + A3

SUB .L1 A2,1,A2 ; decrement loop counter

[A2] B .S1 loop; if A2 != 0, then branch

STH .D1 A4,*A7 ; *A7 = Y

  • Advanced Computer Architectures, Parallelism, Scalability, Programmability, K. Hwang, 1993.
  • M. Smotherman, "Understanding EPIC Architectures and Implementations" (pdf)
  • Lecture notes of Mark Smotherman,
  • An Introduction To Very-Long Instruction Word (VLIW) Computer Architecture, Philips Semiconductors,
  • Lecture 6 and Lecture 7 by Paul Pop,
  • Texas Instruments, Tutorial on TMS320C6000 VelociTI Advanced VLIW Architecture.
  • Morgan Kaufmann Website: Companion Web Site for Computer Organization and Design