CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

Herbert G. Mayer, PSU CS Status 1/28/2013 CS 201Computer Systems ProgrammingChapter 3“Architecture Overview”

Syllabus • Computing History • Evolution of Microprocessor µP Performance • Processor Performance Growth • Key Architecture Messages • Code Sequences for Different Architectures • Dependencies, AKA Dependences • Score Board • References

Computing History Before 1940 1643 Pascal’s Arithmetic Machine About 1660 Leibnitz Four Function Calculator 1710 -1750 Punched Cards by Bouchon, Falcon, Jacquard 1810 Babbage Difference Engine, unfinished; 1st programmer ever in the world was Ada, poet Lord Byron’s daughter, after whom the language Ada was named: Lady Ada Lovelace 1835 Babbage Analytical Engine, also unfinished 1920 Hollerith Tabulating Machine to help with census in the USA

Computing History Decade of 1940s 1939 – 1942 John Atanasoff built programmable, electronic computer at Iowa State University 1936 - 1945 Konrad Zuse’s Z3 and Z4, early electro-mechanical computers based on relays; colleague advised use of “vacuum tubes” 1946 John von Neumann’s computer design of stored program 1946 Mauchly and Eckert built ENIAC, modeled after Atanasoff’s ideas, built at University of Pennsylvania: Electronic Numeric Integrator and Computer, 30 ton monster 1980s John Atanasoff got acknowledgment and patent officially 

Computing History Decade of the 1950s Univac Uniprocessor based on ENIAC, commercially viable, developed by John Mauchly and John Presper Eckert Commercial systems sold by Remington Rand Mark III computer Decade of the 1960s IBM’s 360 family co-developed with GE, Siemens, et al. Transistor replaces vacuum tube Burroughs stack machines, compete with GPR architectures All still von Neumann architectures 1969 ARPANET Cache and VMM developed, first at Manchester University

Computing History Decade of the 1970s Birth of Microprocessor at Intel, see Gordon Moore High-end mainframes, e.g. CDC 6000s, IBM 360 + 370 series Architecture advances: Caches, virtual memories (VMM) ubiquitous, since real memories were expensive Intel 4004, Intel 8080, single-chip microprocessors Programmable controllers Mini-computers, PDP 11, HP 3000 16-bit computer Height of Digital Equipment Corp. (DEC) Birth of personal computers, which DEC misses!

Computing History Decade of the 1980s decrease of mini-computer use 32-bit computing even on minis Architecture advances: superscalar, faster caches, larger caches Multitude of Supercomputer manufacturers Compiler complexity: trace-scheduling, VLIW Workstations common: Apollo, HP, DEC’s Ken Olsen trying to catch up, Intergraph, Ardent, Sun, Three Rivers, Silicon Graphics, etc.

Computing History Decade of the 1990s Architecture advances: superscalar & pipelined, speculative execution, ooo execution Powerful desktops End of mini-computer and of many super-computer manufacturers Microprocessor powerful as early supercomputers Consolidation of many computer companies into a few large ones End of Soviet Union marked the end of several supercomputer companies

Evolution of µP Performance(by: James C. Hoe @ CMU)

Processor Performance Growth Moore’s Law --from Webopedia 8/27/2004: “The observation made in 1965 by Gordon Moore, co-founder of Intel, that the number of transistors per square inch on integrated circuits had doubled every year since it was invented. Moore predicted that this trend would continue for the foreseeable future. In subsequent years, the pace slowed down a bit, but data density doubled approximately every 18 months, and this is the current definition of Moore's Law, which Moore himself has blessed. Most experts, including Moore himself, expect Moore's Law to hold for another two decades. Others coin a more general law, stating that “the circuit density increases predictably over time.”

Processor Performance Growth So far in 2013, Moore’s Law is holding true since ~1968. Some Intel fellows believe that an end to Moore’s Law will be reached ~2018 due to physical limitations in the process of manufacturing transistors from semi-conductor material. This phenomenal growth is unknown in any other industry. For example, if doubling of performance could be achieved every 18 months, then by 2001 other industries would have achieved the following: cars would travel at 2,400,000 Mph, and get 600,000 MpG Air travel from LA to NYC would be at 36,000 Mach, or take 0.5 seconds

Message 1: Memory is Slow The inner core of the processor, the CPU or the µP, is getting faster at a steady rate Access to memory is also getting faster over time, but at a slower rate. This rate differential has existed for quite some time, with the strange effect that fast processors have to rely on slow memories Not uncommon on MP server that processor has to wait >100 cycles before a memory access completes; that is one single memory access. On a Multi-Processor the bus protocol is more complex due to snooping, backing-off, arbitration, thus the number of cycles to complete a memory access can grow high IO simply compounds the problem of slow memory access

Message 1: Memory is Slow Discarding conventional memory altogether, relying only on cache-like memories, is NOT an option for 64-bit architectures, due to the price/size/cost/power if you pursue full memory population with 264 bytes Another way of seeing this: Using solely reasonably-priced cache memories (say at < 10 times the cost of regular memory) is not feasible: resulting physical address space would be too small, or price too high Significant intellectual efforts in computer architecture focuses on reducing the performance impact of fast processors accessing slow memories All else except IO, seems easy compared to this fundamental problem! IO is even slower by orders of magnitude

Message 1: Memory is Slow µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) 10 DRAM 7%/yr. DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 1988 2002 Time Source: David Patterson, UC Berkeley

Message 2: Events Tend to Cluster A strange thing happens during program execution: Seemingly unrelated events tend to cluster memory accesses tend to concentrate a majority of their referenced addresses onto a small domain of the total address space. Even if all of memory is accessed, during some periods of time such clustering is observed. Intuitively, one memory access seems independent of another, but they both happen to fall onto the same page (or working set of pages) We call this phenomenon Locality! Architects exploit locality to speed up memory access via Caches and increase the address range beyond physical memory via Virtual Memory Management. Distinguish spacial versus temporal locality

Message 2: Events Tend to Cluster Similarly, hash functions tend to concentrate an unproportionallylarge number of keys onto a small number of table entries Incoming search key (say, a C++ program identifier) is mapped into an index, but the next, completely unrelated key, happens to map onto the same index. In an extreme case, this may render a hash lookup slower than a sequential search Programmer must watch out for the phenomenon of clustering, as it is undesired in hashing!

Message 2: Events Tend to Cluster Clustering happens in all diverse modules of the processor architecture. For example, when a data cache is used to speed-up memory accesses by having a copy of frequently used data in a faster memory unit, it happens that a small cache suffices to speed up execution Due to Data Locality (spatial and temporal). Data that have been accessed recently will again be accessed in the near future, or at least data that live close by will be accessed in the near future Thus they happen to reside in the same cache line. Architects do exploit this to speed up execution, while keeping the incremental cost for HW contained. Here clustering is a valuable phenomenon

Message 3: Heat is Bad Clocking a processor fast (e.g. > 3-5 GHz) can increase performance and thus generally “is good” Other performance parameters, such as memory access speed, peripheral access, etc. do not scale with the clock speed. Still, increasing the clock to a higher rate is desirable Comes at the cost of higher current, thus more heat generated in the identical physical geometry (the real-estate) of the silicon processor or also the chipset But Silicon part acts like a heat-conductor, conducting better, as it gets warmer (negative temperature coefficient resistor, or NTC). Since the power-supply is a constant-current source, a lower resistance causes lower voltage, shown as VDroop in the figure below

Message 3: Heat is Bad

Message 3: Heat is Bad This in turn means, voltage must be increased artificially, to sustain the clock rate, creating more heat, ultimately leading to self-destruction of the part Great efforts are being made to increase the clock speed, requiring more voltage, while at the same time reducing heat generation. Current technologies include sleep-states of the Silicon part (processor as well as chip-set), and Turbo Boost mode, to contain heat generation while boosting clock speed just at the right time Good that to date Silicon manufacturing technologies allow the shrinking of transistors and thus of whole dies. Else CPUs would become larger, more expensive, and above all: hotter.

Message 4: Resource Replication Architects cannot increase clock speed beyond physical limitations One cannot decrease the die size beyond evolving technology Yet speed improvements are desired, and achieved This conflict can partly be overcome with replicated resources! But careful!

Message 4: Resource Replication Key obstacle to parallel execution is data dependence in the SW under execution. A datum cannot be used, before it has been computed Compiler optimization technology calls this use-def dependence (short for use-before-definition, and definition-before-use dependence), AKA true dependence, AKA data dependence Goal is to search for program portions that are independent of one another. This can be at multiple levels of focus

Message 4: Resource Replication At the very low level of registers, at the machine level –done by HW; see also score board At the low level of individual machine instructions –done by HW; see also superscalar architecture At the medium level of subexpressions in a program –done by compiler; see CSE At the higher level of several statements written in sequence in high-level language program –done by optimizing compiler or by programmer Or at the very high level of different applications, running on the same computer, but with independent data, separate computations, and independent results –done by the user running concurrent programs

Message 4: Resource Replication Whenever program portions are independent of one another, they can be computed at the same time: in parallel Architects provide resources for this parallelism Compilers need to uncover opportunities for parallelism If two actions are independent of one another, they can be computed simultaneously Provided that HW resources exist, that the absence of dependence has been proven, that independent execution paths are scheduled on these replicated HW resources

Code 1 for Different Architectures Example 1: Object Code Sequence Without Optimization Strict left-to-right translation, no smarts in mapping Consider non-commutative subtraction and division operators No common subexpression elimination (CSE), and no register reuse Conventional operator precedence For Single Accumulator SAA, Three-Address GPR, Stack Architectures Sample source: d  ( a + 3 ) * b - ( a + 3 ) / c

Code 1 for Different Architectures

Code 1 for Different Architectures Three-address code looks shortest, w.r.t. number of instructions Maybe optical illusion, must also consider number of bits for instructions Must consider number of I-fetches, operand fetches, total number of stores Numerous memory accesses on SAA (Single Accumulator Architecture) due to temporary values held in memory Most memory accesses on SA (Stack Architecture), since everything requires a memory access Three-Address architecture immune to commutativity constraint, since operands may be placed in registers in either order No need for reverse-operation opcodes for Three-Address architecture Decide in Three-Address architecture how to encode operand types

Code 2 for Different Architectures This time we eliminate common subexpression (CSE) Compiler handles left-to-right order for non-commutative operators on SAA Better: d ( a + 3 ) * b - ( a + 3 ) / c

Code 2 for Different Architectures

Code 2 for Different Architectures Single Accumulator Architecture (SAA) optimized still needs temporary storage; uses temp1 for common subexpression; has no other register!! SAA could use negate instruction or reverse subtract Register-use optimized for Three-Address architecture; but dup and xch are newly added instructions Common subexpresssion optimized on Stack Machine by duplicating, exchanging, etc. 20% reduced for Three-Address, 18% for SAA, only 8% for Stack Machine

Code 3 for Different Architectures Analyze similar source expressions but with reversed operator precedence One operator sequence associates right-to-left, due to precedence Compiler uses commutativity The other left-to-right, due to explicit parentheses Use simple-minded code model: no cache, no optimization Will there be advantages/disadvantages due to architecture? Expression 1 is : e  a + b * c ^ d

Code 3 for Different Architectures • Expression 1 is : e  a + b * c ^ d Expression 1 is : e  a + b * c ^ d

Code 3 for Different Architectures • Expression 2 is : f  ( ( g + h ) * i ) ^ j

Code For Stack Architecture Stack Machine with no register inherently slow: Memory Accesses!!! Implement few top of stack elements via HW shadow registers  Cache Measure equivalent code sequences with/without consideration for cache Top-of-stack register tos points to last valid word on physical stack Two shadow registers may hold 0, 1, or 2 true top words Top of stack cache counter tcc specifies number of shadow registers in use Thus tos plus tcc jointly specify true top of stack

Code For Stack Architecture

Code For Stack Architecture Timings for push, pushlit, add, pop operations depend on tcc Operations in shadow registers fastest, typically 1 cycle, include register access and the operation itself Generally, further memory access adds 2 cycles For stack changes use some defined policy, e.g. keep tcc 50% full Table below refines timings for stack with shadow registers Note: push x into cache with free space requires 2 cycles: cache adjustment is done at the same time as memory fetch

Code For Stack Architecture

Code For Stack Architecture Code emission for: a + b * c ^ ( d + e * f ^ g ) Let + and * be commutative, by language rule Architecture here has 2 shadow registers, compiler exploits this Assume initially empty 2-word cache

Code For Stack Architecture # 1 Left - to - Right cycles 1 2 Exploit Cache cycles 2 1 push a 2 push f 2 2 push b 2 push g 2 3 push c 4 e xpo 1 4 push d 4 push e 2 5 push e 4 m ult 1 6 push f 4 push d 2 7 push g 4 a dd 1 8 expo 1 push c 2 9 mult 3 r_ e xpo = swap + expo 1 10 add 3 push b 2 11 expo 3 m ult 1 12 m ult 3 push a 2 13 a dd 3 a dd 1

Code For Stack Architecture Blind code emission costs 40 cycles; i.e. not taking advantage of tcc knowledge: costs performance Code emission with shadow register consideration costs 20 cycles True penalty for memory access is worse in practice Tremendous speed-up always possible when fixing system with severe flaws Return of investment for 2 registers is twice the original performance Such strong speedup is an indicator that the starting architecture was poor Stack Machine can be fast, if purity of top-of-stack access is sacrificed for performance Note that indexing, looping, indirection, call/return are not addressed here

Register Dependencies Inter-instruction dependencies, in CS parlance also known as dependences, arise between registers being defined and used One instruction computes a result into a register (or memory), another instruction needs that result from that same register (or that memory location) Or, one instruction uses a datum; and after such use the same item is reset, i.e. recomputed

Register Dependencies True-Dependence, AKA Data Dependence: <- note synonym! r3 ← r1 op r2 r5 ← r3 op r4 Read after Write, RAW Anti-Dependence, not a true dependence parallelize under right condition r3 ← r1op r2 r1 ← r5 op r4 Write afterread, WAR Output Dependence r3 ← r1 op r2 r5 ← r3 op r4 r3 ← r6 op r7 Write after Write, WAW, use in between

Register Dependencies Control Dependence: if ( condition1 ) { r3 = r1 op r2; }else{  see the jump here? r5 = r3 op r4; } // end if write( r3 );

Register Renaming Only the data dependence is a real dependence, hence called true-dependence Other dependences are artifacts of insufficient resources, generally not enough registers This means: if additional registers were available, then replacing some of these conflicting regswith new one regsiters could make conflict disappear? Anti- and Output-Dependences are indeed such falsedependences

Register Renaming Original Dependences: Renamed Situation, Dependences Gone: L1: r1 ← r2 op r3 r10 ← r2 op r30 –- r30 has r3 copy L2: r4 ← r1 op r5 r4 ← r10 op r5 L3: r1 ← r3 op r6 r1 ← r30 op r6 L4: r3 ← r1 op r7 r3 ← r1 op r7 The dependences before: after: L1, L2 true-Dep with r1 L1, L2 true-Dep with r10 L1, L3 output-Dep with r1 L3, L4 true-Dep with r1 L1, L4 anti-Dep with r3 L3, L4 true-Dep with r1 L2, L3 anti-Dep with r1 L3, L4 anti-Dep with r3

Register Renaming With these additional or renamed regs, the new code could possibly run in half the time! First : Compute into r10 instead of r1, but you need to have the additional register Also: Compute into r30, no added copy operations, just more registers á-priori Then regs are live afterwards: r1, r3, r4 While r10 and r30 are don’t cares

Score Board Score-board is an array of HW programmable bits sb[] Manages other HW resources, specifically registers Single-bit HW array, every bit i in sb[i]is associated with one specific, dedicated register ri Association is by index, i.e. by name: sb[i] belongs to regri Only if sb[i] = 0, does register ihave valid data If sb[i] = 0 then register riis NOT in process of being written If bit i is set, i.e. if sb[i] = 1, then that register rihas stale data Initially all sb[*] are stale, i.e. set to 1

Score Board Execution constraints: rd ← rs op rt if sb[s] or if sb[t]is set → RAW dependence, hence stall the computation; wait until both rs and rt are available if sb[d]is set→ WAW dependence, hence stall the write; wait until rdhas been used; SW can sometimes determine to use another register instead of rd else dispatch instruction immediately

Score Board To allow out of order (ooo) execution, upon computing the value of rd Update rd, and clear sb[d] For uses (references), HW may use any register i, whose sb[i] is 0 For definitions (assignments), HW may set any register j, whose sb[j] is 0 Independent of original order, in which source program was written, i.e. possibly ooo

References • The Humble Programmer: http://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.html • Algorithm Definitions: http://en.wikipedia.org/wiki/Algorithm_characterizations • http://en.wikipedia.org/wiki/Moore's_law • C. A. R. Hoare’s comment on readability: http://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.pdf • Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Instruction Scheduling for a Pipelined Architecture”, ACM Sigplan Notices, Proceeding of ’86 Symposium on Compiler Construction, Volume 21, Number 7, July 1986, pp 11-16 • Church-Turing Thesis: http://plato.stanford.edu/entries/church-turing/ • Linux design: http://www.livinginternet.com/i/iw_unix_gnulinux.htm • Words of wisdom: http://www.cs.yale.edu/quotes.html • John von Neumann’s computer design: A.H. Taub (ed.), “Collected Works of John von Neumann”, vol 5, pp. 34-79, The MacMillan Co., New York 1963

CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”