or682 math685 csi700 l.
Skip this Video
Loading SlideShow in 5 Seconds..
OR682/Math685/CSI700 PowerPoint Presentation
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 22

OR682/Math685/CSI700 - PowerPoint PPT Presentation

  • Uploaded on

OR682/Math685/CSI700 Lecture 12 Fall 2000 High Performance Computing Computer architectures Computer memory Floating-point operations Compilers Profiling Optimization of programs My Goals Provide you with resources for dealing with large computational problems

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
or682 math685 csi700


Lecture 12

Fall 2000

high performance computing
High Performance Computing
  • Computer architectures
  • Computer memory
  • Floating-point operations
  • Compilers
  • Profiling
  • Optimization of programs
my goals
My Goals
  • Provide you with resources for dealing with large computational problems
  • Explain the basic workings of high-performance computers
  • Talk about compilers and their capabilities
  • Discuss debugging (this week) and profiling (12/13) tools in Matlab
changes in architectures
Changes in Architectures
  • Then (1980s):
    • supercomputers (cost: $10M and up)
    • only a few in existence (often at government laboratories); custom made
    • (peak) speed: several hundred “megaflops” (millions of floating-point operations per second)
  • Now:
    • (clusters of) microprocessors (inexpensive)
    • can be easily assembled by almost anyone
    • commercial, off-the-shelf components
    • (peak) speed: gigaflops and higher
modern supercomputers
Modern “Supercomputers”
  • Multiprocessor
  • Based on commercial RISC (reduced instruction set computer) processors
  • Linked by high-speed interconnect or network
  • Communication by message passing (perhaps disguised from the user)
  • Hierarchy of local/non-local memory
why learn this
Why Learn This?
  • Compilers have limited ability to match your algorithm/calculation to the computer
  • You will be better able to write software that will execute efficiently, by playing to the strengths of the compiler and the machine
some basics
Some Basics
  • Memory
    • main memory
    • cache
    • registers
  • Languages
    • machine
    • assembly
    • high-level (Fortran, C/C++)
    • Matlab?
  • Old Technology: CISC (complex instruction set computer)
    • assembly language instructions that resembled high-level language instructions
    • many tasks could be performed in hardware
    • reduced (slow) memory fetches for instructions
    • reduced (precious) memory requirements
weaknesses of cisc
Weaknesses of CISC?
  • None until relatively recently
  • Harder for compilers to exploit
  • Complicated processor design
    • hard to fit on a single chip
  • Hard to pipeline
    • pipeline: processing multiple instructions simultaneously in small stages
risc processors
RISC Processors
  • Reduce # of instructions, and fit processor on a single chip (faster, cheaper, more reliable)
  • Other operations must be performed in software (slower)
  • All instructions the same length (32 bits); pipelining is possible
  • More instructions must be fetched from memory
  • Programs take up more space in memory
early examples
Early Examples
  • First became prominent in (Unix-based) scientific work stations:
    • Sun
    • Silicon Graphics
    • Apollo
    • IBM RS-6000
characteristics of risc
Characteristics of RISC
  • Instruction pipelining
  • Pipelining of floating-point operations
  • Uniform instruction length
  • Delayed branching
  • Load/Store architecture
  • Simple addressing modes
  • Note: modern RISC processors are no longer “simple” architectures
  • Clock & clock speed (cycles)
  • Goal: 1 instruction per clock cycle
  • Divide instruction into stages, & overlap:
    • instruction fetch (from memory)
    • instruction decode
    • operand fetch (from register or memory)
    • execute
    • write back (of results to register or memory)
  • Complicated memory fetch
    • stalls pipeline
  • Branch
    • may be a “no op” [harmless]
    • otherwise, need to flush pipeline (wasteful)
  • Branches occur every 5-10 instructions in many programs
pipelined floating point
Pipelined Floating-Point
  • Execution of a floating-point instruction can take many clock cycles (especially for multiplication and division)
  • These operations can also be pipelined
  • Modern hardware has reduced the time for a f-p operation to 1-3 cycles
uniform instruction length
Uniform Instruction Length
  • CISC instructions came in varying length
    • length not known until it was decoded
    • this could stall the pipeline
  • For RISC processors, instructions are uniform length (32 bits)
    • no additional memory access required to decode instruction
  • Pipeline flows more smoothly
delayed branches
Delayed Branches
  • Branches lead to pipeline inefficiencies
  • Three possible approaches:
    • branch delay slot
      • potentially useful instruction inserted (by compiler) after the branch instruction
    • branch prediction
      • based on previous result of branch during execution of program
    • conditional execution (next slide)
conditional execution
Conditional Execution
  • Replace a branch with a conditional instruction:










  • Pipeline operates effectively.
load store architectures
Load/Store Architectures
  • Instructions limit memory references:
    • only explicit load and store instructions (no implicit or cascaded memory references)
    • only one memory reference per instruction
  • Keeps instructions the same length
  • Keeps pipeline simple (only one execution stage)
  • Memory load/store requests are already “slower” (complications would further stall the pipeline)
    • by the time the result is needed, the load/store is complete (you hope)
simple addressing models
Simple Addressing Models
  • Avoid:
    • complicated address calculations
    • multiple memory references per instruction
  • Simulate complicated requests with a sequence of simple instructions
2 nd generation risc processors
2nd Generation RISC Processors
  • Faster clock rate (smaller processor)
  • “Superscalar” processors: Duplicate compute elements (execute two instructions at once)
    • hard for compiler writers, hardware designers
  • “Superpipelining”: double the number of stages in the pipeline (each one twice as fast)
  • Speculative computation
for next class
For Next Class
  • Homework: see web site
  • Reading:
    • Dowd: chapters 3, 4, and 5