Herbert g mayer psu cs status 10 9 2012
This presentation is the property of its rightful owner.
Sponsored Links
1 / 50

CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ” PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Herbert G. Mayer, PSU CS Status 10/9/2012. CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”. Syllabus. Computing History Evolution of Microprocessor µP Performance Processor Performance Growth Key Architecture Messages Code Sequences for Different Architectures

Download Presentation

CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Herbert g mayer psu cs status 10 9 2012

Herbert G. Mayer, PSU CS

Status 10/9/2012

CS 201Computer Systems ProgrammingChapter 3“Architecture Overview”



  • Computing History

  • Evolution of Microprocessor µP Performance

  • Processor Performance Growth

  • Key Architecture Messages

  • Code Sequences for Different Architectures

  • Dependencies, AKA Dependences

  • Score Board

  • References

Computing history

Computing History

Before 1940

1643 Pascal’s Arithmetic Machine

About 1660 Leibnitz Four Function Calculator

1710 -1750 Punched Cards by Bouchon, Falcon, Jacquard

1810 Babbage Difference Engine, unfinished; 1st programmer ever in the world was poet Lord Byron’s daughter, after whom the language Ada was named: Lady Ada Lovelace

1835 Babbage Analytical Engine, also unfinished

1920 Hollerith Tabulating Machine to help with census in the USA

Computing history1

Computing History

Decade of 1940s

1939 – 1942 John Atanasoff built programmable, electronic computer at Iowa State University

1936 - 1945 Konrad Zuse’s Z3 and Z4, early electro-mechanical computers based on relays; colleague advised use of “vacuum tubes”

1946 John von Neumann’s computer design of stored program

1946 Mauchly and Eckert built ENIAC, modeled after Atanasoff’s ideas, built at University of Pennsylvania: Electronic Numeric Integrator and Computer, 30 ton monster

1980s John Atanasoff got acknowledgment and patent officially 

Computing history2

Computing History

Decade of the 1950s

Univac Uniprocessor based on ENIAC, commercially viable, developed by John Mauchly and John Presper Eckert

Commercial systems sold by Remington Rand

Mark III computer

Decade of the 1960s

IBM’s 360 family co-developed with GE, Siemens, et al.

Transistor replaces vacuum tube

Burroughs stack machines, compete with GPR architectures

All still von Neumann architectures


Cache and VMM developed, first at Manchester University

Computing history3

Computing History

Decade of the 1970s

Birth of Microprocessor at Intel, see Gordon Moore

High-end mainframes, e.g. CDC 6000s, IBM 360 + 370 series

Architecture advances: Caches, VMM ubiquitous, since real memories were expensive

Intel 4004, Intel 8080, single-chip microprocessors

Programmable controllers

Mini-computers, PDP 11, HP 3000 16-bit computer

Height of Digital Equipment Corp. (DEC)

Birth of personal computers, which DEC misses

Computing history4

Computing History

Decade of the 1980s

decrease of mini-computer use

32-bit computing even on minis

Architecture advances: superscalar, faster caches, larger caches

Multitude of Supercomputer manufacturers

Compiler complexity: trace-scheduling, VLIW

Workstations common: Apollo, HP, DEC’s Ken Olsen trying to catch up, Intergraph, Ardent, Sun, Three Rivers, Silicon Graphics, etc.

Computing history5

Computing History

Decade of the 1990s

Architecture advances: superscalar & pipelined, speculative execution, ooo execution

Powerful desktops

End of mini-computer and of many super-computer manufacturers

Microprocessor powerful as early supercomputers

Consolidation of many computer companies into a few large ones

End of Soviet Union marked the end of several supercomputer companies

Evolution of p performance by james c hoe @ cmu

Evolution of µP Performance(by: James C. Hoe @ CMU)

Processor performance growth

Processor Performance Growth

Moore’s Law --from Webopedia 8/27/2004:

“The observation made in 1965 by Gordon Moore, co-founder of Intel, that the number of transistors per square inch on integrated circuits had doubled every year since it was invented. Moore predicted that this trend would continue for the foreseeable future.

In subsequent years, the pace slowed down a bit, but data density doubled approximately every 18 months, and this is the current definition of Moore's Law, which Moore himself has blessed. Most experts, including Moore himself, expect Moore's Law to hold for at least another two decades.

Others coin a more general law, stating that “the circuit density increases predictably over time.”

Processor performance growth1

Processor Performance Growth

So far in 2012, Moore’s Law is holding true since ~1968.

Some Intel fellows believe that an end to Moore’s Law will be reached ~2018 due to physical limitations in the process of manufacturing transistors from semi-conductor material.

This phenomenal growth is unknown in any other industry. For example, if doubling of performance could be achieved every 18 months, then by 2001 other industries would have achieved the following:

cars would travel at 2,400,000 Mph, and get 600,000 MpG

Air travel from LA to NYC would be at 36,000 Mach, or take 0.5 seconds

Message 1 memory is slow

Message 1: Memory is Slow

The inner core of the processor, the CPU or the µP, is getting faster at a steady rate

Access to memory is also getting faster over time, but at a slower rate. This rate differential has existed for quite some time, with the strange effect that fast processors have to rely on slow memories

Not uncommon on MP server that processor has to wait >100 cycles before a memory access completes. On a Multi-Processor the bus protocol is more complex due to snooping, backing-off, arbitration, etc., thus the number of cycles to complete an access can grow so high

Message 1 memory is slow1

Message 1: Memory is Slow

Discarding conventional memory altogether, relying only on cache-like memories, is NOT an option, due to the price differential between cache and regular RAM

Another way of seeing this: Using solely reasonably-priced cache memories (say at <= 10 times the cost of regular memory) is not feasible: resulting address space would be too small

Almost all intellectual efforts in computer architecture focus on reducing the performance impact of fast processors accessing slow memories

All else seems easy compared to this fundamental problem!

Message 1 memory is slow2

Message 1: Memory is Slow





“Moore’s Law”



Performance Gap:(grows 50% / year)






























Source: David Patterson, UC Berkeley

Message 2 events tend to cluster

Message 2: Events Tend to Cluster

A strange thing happens during program execution: Seemingly unrelated events tend to cluster

memory accesses tend to concentrate a majority of their referenced addresses onto a small domain of the total address space. Even if all of memory is accessed, during some periods of time this phenomenon is observed. Here one memory access seems independent of another, but they both happen to fall onto the same page (or working set of pages)

We call this phenomenon Locality! Architects exploit locality to speed up memory access via Caches and increase the address range beyond physical memory via Virtual Memory Management. Distinguish spacial versus temporal locality

Message 2 events tend to cluster1

Message 2: Events Tend to Cluster

Similarly, hash functions tend to concentrate an unproportionally large number of keys onto a small number of table entries

Incoming search key (say, a C++ program identifier) is mapped into an index, but the next, completely unrelated key, happens to map onto the same index. In an extreme case, this may render a hash lookup slower than a sequential search

Programmer must watch out for the phenomenon of clustering, as it is undesired in hashing!

Message 2 events tend to cluster2

Message 2: Events Tend to Cluster

Clustering happens in all diverse modules of the processor architecture. For example, when a data cache is used to speed-up memory accesses by having a copy of frequently used data in a faster memory unit, it happens that a small cache suffices

Due to Data Locality (spatial and temporal). Data that have been accessed recently will again be accessed in the near future, or at least data that live close by will be accessed in the near future

Thus they happen to reside in the same cache line. Architects do exploit this to speed up execution, while keeping the incremental cost for HW contained. Here clustering is a valuable phenomenon

Message 3 heat is bad

Message 3: Heat is Bad

Clocking a processor fast (e.g. > 3-5 GHz) increases performance and thus generally “is good”

Other performance parameters, such as memory access speed, peripheral access, etc. do not scale with the clock speed. Still, increasing the clock to a higher rate is desirable

Comes at the cost of higher current and thus more heat generated in the identical physical space, the geometry (the real-estate) of the silicon processor or chipset

But Silicon part acts like a heat-conductor, conducting better, as it gets warmer (negative temperature coefficient resistor, or NTC). Since the power-supply is a constant-current source, a lower resistance causes lower voltage, shown as VDroop in the figure below

Message 3 heat is bad1

Message 3: Heat is Bad

Message 3 heat is bad2

Message 3: Heat is Bad

This in turn means, voltage must be increased artificially, to sustain the clock rate, creating more heat, ultimately leading to self-destruction of the part

Great efforts are being made to increase the clock speed, requiring more voltage, while at the same time reducing heat generation. Current technologies include sleep-states of the Silicon part (processor as well as chip-set), and Turbo Boost mode, to contain heat generation while boosting clock speed just at the right time

Good that to date Silicon manufacturing technologies allow the shrinking of transistors and thus of whole dies. Else CPUs would become larger, more expensive, and above all: hotter.

Message 4 resource replication

Message 4: Resource Replication

Architects cannot increase clock speed beyond physical limitations

One cannot decrease the die size beyond evolving technology

Yet speed improvements are desired, and achieved

This conflict can partly be overcome with replicated resources! But careful!

Message 4 resource replication1

Message 4: Resource Replication

Key obstacle to parallel execution is data dependence in the SW under execution. A datum cannot be used, before it has been computed

Compiler optimization technology calls this use-def dependence (short for use-definition dependence), AKA true dependence, AKA data dependence

Goal is to search for program portions that are independent of one another. This can be at multiple levels of focus:

Message 4 resource replication2

Message 4: Resource Replication

At the very low level of registers, at the machine level –done by HW

At the low level of individual machine instructions –done by HW

At the medium level of subexpressions in a program –done by compiler

At the higher level of distinct statements in a high-level program –done by optimizing compiler or by programmer

Or at the very high level of different applications, running on the same computer, but with independent data, separate computations, and independent results –done by the user

Message 4 resource replication3

Message 4: Resource Replication

Whenever program portions are independent of one another, they can be computed at the same time: in parallel

Architects provide resources for this parallelism

Compilers need to uncover opportunities for parallelism

If two actions are independent of one another, they can be computed simultaneously

Provided that HW resources exist, that the absence of dependence has been proven, and that the independent execution paths are scheduled on these replicated HW resources! Generally this is a complex undertaking!

Code 1 for different architectures

Code 1 for Different Architectures

Example 1: Object Code Sequence Without Optimization

Strict left-to-right translation, no smarts in mapping

Consider non-commutative subtraction and division operators

No common subexpression elimination (CSE), and no register reuse

Conventional operator precedence

For Single Accumulator SAA, Three-Address GPR, Stack Architectures

Sample source: d  ( a + 3 ) * b - ( a + 3 ) / c

Code 1 for different architectures1

Code 1 for Different Architectures

Code 1 for different architectures2

Code 1 for Different Architectures

Three-address code looks shortest, w.r.t. number of instructions

Maybe optical illusion, must also consider number of bits for instructions

Must consider number of I-fetches, operand fetches

Must consider total number of stores

Numerous memory accesses on SAA due to temporary values held in memory

Most memory accesses on SA, since everything requires a memory access

Three-Address architecture immune to commutativity constraint, since operands may be placed in registers in either order

Important architectural feature? Only if SW cannot handle this; compiler can

No need for reverse-operation opcodes for Three-Address architecture

Decide in Three-Address architecture how to encode operand types

Numerous stack instructions, i.e. many bits for opcodes, since each operand fetch is separate instruction

Code 2 for different architectures

Code 2 for Different Architectures

This time we eliminate common subexpression

Compiler handles left-to-right order for non-commutative operators on SAA

Better code for: d = ( a+3 ) * b - ( a+3 ) / c

Code 2 for different architectures1

Code 2 for Different Architectures

Code 2 for different architectures2

Code 2 for Different Architectures

Single Accumulator Architecture (SAA) optimized still needs temporary storage; uses temp1 for common subexpression; has no other register!!

SAA could use negate instruction or reverse subtract

Register-use optimized for Three-Address architecture; but dup and xch are newly added instructions

Common subexpresssion optimized on Stack Machine by duplicating, exchanging, etc.

20% reduced for Three-Address, 18% for SAA, only 8% for Stack Machine

Code 3 for different architectures

Code 3 for Different Architectures

Analyze similar source expressions but with reversed operator precedence

One operator sequence associates right-to-left, due to precedence

Compiler uses commutativity

The other left-to-right, due to explicit parentheses

Use simple-minded code model: no cache, no optimization

Will there be advantages/disadvantages due to architecture?

Expression 1 is : e  a + b * c ^ d

Code 3 for different architectures1

Code 3 for Different Architectures

  • Expression 1 is : e  a + b * c ^ d

Expression 1 is : e  a + b * c ^ d

Code 3 for different architectures2

Code 3 for Different Architectures

  • Expression 2 is : f  ( ( g + h ) * i ) ^ j

Code for stack architecture

Code For Stack Architecture

Stack Machine with no register inherently slow: Memory Accesses!!!

Implement few top of stack elements via HW shadow registers  Cache

Measure equivalent code sequences with/without consideration for cache

Top-of-stack register tos points to last valid word on physical stack

Two shadow registers may hold 0, 1, or 2 true top words

Top of stack cache counter tcc specifies number of shadow registers in use

Thus tos plus tcc jointly specify true top of stack

Code for stack architecture1

Code For Stack Architecture

Code for stack architecture2

Code For Stack Architecture

Timings for push, pushlit, add, pop operations depend on tcc

Operations in shadow registers fastest, typically 1 cycle, include register access and the operation itself

Generally, further memory access adds 2 cycles

For stack changes use some defined policy, e.g. keep tcc 50% full

Table below refines timings for stack with shadow registers

Note: push x into cache with free space requires 2 cycles: cache adjustment is done at the same time as memory fetch

Code for stack architecture3

Code For Stack Architecture

Code for stack architecture4

Code For Stack Architecture

Code emission for: a + b * c ^ ( d + e * f ^ g )

Let + and * be commutative, by language rule

Architecture here has 2 shadow registers, compiler exploits this

Assume initially empty 2-word cache

Code for stack architecture5

Code For Stack Architecture


1 Left





cycles 1







push a


push f



push b


push g



push c






push d


push e



push e






push f


push d



push g








push c








= swap + expo





push b












push a









Code for stack architecture6

Code For Stack Architecture

Blind code emission costs 40 cycles; i.e. not taking advantage of tcc knowledge: costs performance

Code emission with shadow register consideration costs 20 cycles

True penalty for memory access is worse in practice

Tremendous speed-up always possible when fixing system with severe flaws

Return of investment for 2 registers is twice the original performance

Such strong speedup is an indicator that the starting architecture was poor

Stack Machine can be fast, if purity of top-of-stack access is sacrificed for performance

Note that indexing, looping, indirection, call/return are not addressed here

Register dependencies

Register Dependencies

Inter-instruction dependencies, also known as dependences, arise between registers being defined and used

One instruction computes a result into a register (or memory), another instruction needs that result from the register (or that memory location)

Or, one instruction uses a datum, only after this use that same datum may it be recomputed

Register dependencies1

Register Dependencies

True Dependence, AKA Data Dependence:

r3 ← r1 op r2

r5 ← r3 op r4Read after Write, RAW

Anti-Dependence, not a true dependence

parallelize under right condition

r3 ← r1op r2

r1 ← r5 op r4Write afterread, WAR

Output Dependence

r3 ← r1 op r2

r5 ← r3 op r4

r3 ← r6 op r7Write after Write, WAW, use in between

Register dependencies2

Register Dependencies

Control Dependence:

if ( condition1 ) {

r3 = r1 op r2;

}else{ see the jump here?

r5 = r3 op r4;

} // end if

write( r3 );

Register renaming

Register Renaming

Only a true dependence is a real dependence AKA Data-Dependence

Others are artifacts of insufficient resources, generally register resources

But that means if only more registers were available, then replacing the conflicting regswith new ones these additional resources could make conflict disappear

Anti- and Output-Dependences are such false dependencies

Register renaming1

Register Renaming

Original Dependences:Renamed Situation, Dependences Gone:

L1:r1 ← r2 op r3r10 ← r2 op r30 –- r30 has r3 copy

L2:r4 ← r1 op r5r4 ← r10 op r5

L3:r1 ← r3 op r6r1 ← r30 op r6

L4:r3 ← r1 op r7r3 ← r1 op r7

The dependences before:after:

L1, L2 true-Dep with r1L1, L2 true-Dep with r10

L1, L3 output-Dep with r1L3, L4 true-Dep with r1

L1, L4 anti-Dep with r3

L3, L4 true-Dep with r1

L2, L3 anti-Dep with r1

L3, L4 anti-Dep with r3

Register renaming2

Register Renaming

With additional or renamed regs, the new code runs in half the time!

First : Compute into r10 instead of r1, no cost

Also: Compute into r30, no added copy operations, just more registers á-priori

Then regs are live afterwards: r1, r3, r4

While r10 and r30 are don’t cares

Score board

Score Board

Score-board is an array of programmable bits sb[]

Manages HW resources, specifically registers

Single-bit array, any one bit associated with one specific register

Association by index, i.e. by name: sb[i] belongs to regri

Only if sb[i] = 0, does register ihave valid data

If sb[i] = 0 then register riis NOT in process of being written

If bit i is set, i.e. if sb[i] = 1, then that register rihas stale data

Initially all sb[*] are stale, i.e. set to 1

Score board1

Score Board

Execution constraints:

rd ← rs op rt

if sb[s] or if sb[t] is set → RAW dependence, hence stall the computation; wait until both rs and rt are 0

if sb[d] is set→ WAW dependence, hence stall the write; wait until rdhas been used; SW can sometimes determine to use another register instead of rd

else dispatch instruction immediately

Score board2

Score Board

To allow out of order (ooo) execution, upon computing the value of rd

Update rd, and clear sb[d]

For uses (references), HW may use any register i, whose sb[i] is 0

For definitions (assignments), HW may set any register j, whose sb[j] is 0

Independent of original order, in which source program was written, i.e. possibly ooo



  • The Humble Programmer: http://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.html

  • Algorithm Definitions: http://en.wikipedia.org/wiki/Algorithm_characterizations

  • http://en.wikipedia.org/wiki/Moore's_law

  • C. A. R. Hoare’s comment on readability: http://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.pdf

  • Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Instruction Scheduling for a Pipelined Architecture”, ACM Sigplan Notices, Proceeding of ’86 Symposium on Compiler Construction, Volume 21, Number 7, July 1986, pp 11-16

  • Church-Turing Thesis: http://plato.stanford.edu/entries/church-turing/

  • Linux design: http://www.livinginternet.com/i/iw_unix_gnulinux.htm

  • Words of wisdom: http://www.cs.yale.edu/quotes.html

  • John von Neumann’s computer design: A.H. Taub (ed.), “Collected Works of John von Neumann”, vol 5, pp. 34-79, The MacMillan Co., New York 1963

  • Login