Self-Scheduling and Cache Memories in Processor

Lecture5

Today’slecture • ProcessorSelf-scheduling • Cachememories Scott B. Baden / CSE 160 /Wi '16

Non-uniformworkloads • In the Mandelbrot set assignment, some • points take longer to complete than others, so the work is not uniformlydistributed • A block cyclic decomposition can balance the workload: core k gets strips starting at rows CHUNK*k,CHUNK*k+1*NT*CHUNK, CHUNK*k+2*NT*CHUNK,… • • Core 1 gets strips@ 2*1, 2+1*2*3,2+2*2*3 2 8 14 NT=3 CHUNK=2 Scott B. Baden / CSE 160 /Wi '16

Second approach: dynamicscheduling • Block cyclic decomposition is a staticmethod • The workload assignment can’t respond to the workload distribution: it is the same forall • Problem: what if the workloaddistribution • correlates with the block cyclic thread assignment? • Processor self scheduling: threads pick up work dynamically, on demand, with a chunksize NT=2 CHUNK=2 Scott B. Baden / CSE 160 /Wi '16

How does self schedulingwork? SelfScheduler S(n, NT, Chunk); h=1/(n-1) while (S.getChunk(startRow)) for i = startRow tostartRow+Chunk -1 for j = 0; j < n; j++ z=complex(i*h,j*h) do z = z2 +c while (|z| < 2) endfor end for endwhile Scott B. Baden / CSE 160 /Wi '16

Implementation • We increment a counter inside a criticalsection • OMP and C++implementations • _counter is a protected member ofSelfScheduler • boolean getChunk(int&startRow){ • Begin criticalsection k =_counter; • _counter +=_chunk; • End critical section if ( k > (_n –_chunk) • returnfalse; startRow= k; returntrue; • } Scott B. Baden / CSE 160 /Wi '16

Tradeoffs in dynamicscheduling • Dynamic workloadassignments • Eachthreadsamplesaunique(anddisjoint)setofindices, changesfromruntorun • Asharedcounterorworkqueuerepresentsthework • User tunes work granularity (chunk size) trading off the overheadof workloadassignmentagainstincreasedload imbalance • Finest granularity: each point is a separatetask • Coarsest granularity: one block perprocessor • Highoverheads Increasing Load imbalance Runningtime Increasing granularity Scott B. Baden / CSE 160 /Wi '16

When might decreasing the chunk size hurt performance? We will hand out more chunks ofwork Datafromadjacentrowsislesslikelytoresidein cache Theworkloadswillbebalancedlessevenly E. B &C D. A &B Scott B. Baden / CSE 160 /Wi '16

Iteration to thread mapping with block cyclic for (i = 0; i < N;i++) iters[i] = thread assigned to rowi N = 9, # ofthreads= 3(BLOCK) 0 0 0 1 1 1 2 22 N = 16, #threads= 4, BLOCK CYCLIC(Chunk=2) 0011223300112233 N=9, #threads= 3, BLOCK CYCLIC(Chunk=2) 0 0 1 1 2 2 0 01 Scott B. Baden / CSE 160 /Wi '16

Which are plausible dynamic assignments of iterations to threads with,N=16, NT=4,Chunk=2 A. 3 3 0 0 1 1 2 2 3 3 3 3 3 3 33 B. 2 3 3 2 0 0 1 1 2 2 2 2 3 3 21 C. 2 2 3 3 0 0 1 1 2 2 2 2 0 0 22 D. A&B while ( S.getChunk(startrow)) iters[startRow:startRow+NT-1] = ThreadID(); E. A&C Scott B. Baden / CSE 160 /Wi '16

Today’slecture • ProcessorSelf-scheduling • Cachememories Scott B. Baden / CSE 160 /Wi '16

The processor-memorygap • Difference in processing and memory speeds growing exponentially overtime • The result of technologicaltrends http://www.extremetech.com (seehttp://tinyurl.com/ohmbo7y) Scott B. Baden / CSE 160 /Wi '16

Animportantprinciple: locality • Memory accesses exhibit two forms oflocality • Processorlikelytoaccessalocationagaininthenear future(Temporallocality(time)) • Processorlikelytoaccessalocationnearbytothe current acces(Spatiallocality(space)) • Often involvesloops • Opportunities forreuse • for t=0 toT-1 • for i = 1 toN-2 • u[i]=(u[i-1] + u[i+1])/2 Scott B. Baden / CSE 160 /Wi '16

Which of these loops exhibit spatiallocality? A. for i = 1 toN-2 u[i]=(u[i-1]+ u[i+1])/2 B. for i = N-2 to 1 by -1 u[i]=(u[i-1]+ u[i+1])/2 for i = 1 to N-2 by 1024 u[i]=(u[i-1]+ u[i+1])/2 for i = 1 to N-2by 1024 u[J[i]]=sqrt((u[J[i]]) None of themdo Scott B. Baden / CSE 160 /Wi '16

Memoryhierarchies • Enable reuse through a hierarchy of smaller but fastermemories • Put things in faster memory that we re-usefrequently • 1CP (1word) • CPU 32 to 64KB 256KB to 4MB 2-3CP(10to100B) L1 O(10)CP(10-100B) L2 O(100)CP DRAM GB O(106)CP Many GB orTB Disk Scott B. Baden / CSE 160 /Wi '16

Bang’s MemoryHierarchy • Intel “Clovertown”processor • IntelXeonE5355(Introduced:2006) • Two “Woodcrest” dies(Core2) on a multichipmodule • Two“sockets” • Intel 64 and IA-32 Architectures Optimization Reference Manual,Tab 2.16 techreport.com/articles.x/10021/2 Line Size = 64B (L1 andL2) Associativity Access latency, throughput(clocks) 3,1 14*,2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 8 16 32KL1 32KL1 32KL1 32KL1 32KL1 32KL1 32KL1 32KL1 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 150/600 FSB FSB Line size (bytes): 64 10.66GB/s 10.66GB/s http://goo.gl/QxvSVr Chipset (4x64bcontrollers) Write updatepolicy: Writeback 21.3GB/s(read) 10.6 GB/s(write) * Will vary depending on access patternsand otherfactors 667MHzFBDIMMs Sam Williams etal. Scott B. Baden / CSE 160 /Wi '16

How does a cachework? • Organizememory and cacheintoblocks called lines (typically 64bytes) • Simplest cache: a direct mappedcache • We map each line in main memory to a unique line in cache: many-to-1mapping • The first time we access an address within a given line in main memory, we load the line into cache: a cachemiss • The next time we access any address within thatline, we have a cache hit: wepaythe cost of accessing the cache line only, not memory • With a 1 level cache, the miss penalty is more or less the cost of accessingmemory Scott B. Baden / CSE 160 /Wi '16

The Benefits of CacheMemory • Let say that we have a small fast memorythat is 10timesfaster (access time) than main memory … • If we find what we are looking for 90% of the time (cache hit rate), access time approaches that of fast memory • • Taccess=0.901+(1-0.9)10= 1.9 • Memory appears to be 5 timesfaster • We can have multiple levels ofcache Scott B. Baden / CSE 160 /Wi '16

IfourL1hitrateis95%,L2hitrate=99%,whatis the effective memory access time on Bang where L1hit=3CP,L2=12CP,Memory=160CP? A. 0.953 + (1-0.95)0.9912 +0.050.01160= 3.5 B. 0.953+0.9912+(1-0.99)160= 16.3 C. Neither Scott B. Baden / CSE 160 /Wi '16

Direct mappedcache • Look up the line indexed by the line index • Like a hash table: match the stored tag against the higher order addressbits cacheblock Line0 valid tag selectedline valid tag cacheblock Line 1 • •• tbits sbits bbits valid tag cacheblock LineL-1 m-1 0 tag Line index blockoffset Theaddress Randal E. Bryant and David R.O Scott B. Baden / CSE 160 /Wi '16

Issues in using acache • Where can we place theblock? • How do we find theblock? • Which block should we on amiss? • What happens on awrite? • =1? (1) The valid bit must beset • 0 1 2 3 4 5 6 7 selected line(i) 1 0110 (2) The tag bits in the cache line must matchthe tag bits in theaddress =? (3) If (1) and (2) are true: we have a cachehit tbits sbits bbits m-1 0 tag Line index blockoffset Theaddress Randal E. Bryant and David R.O’Hallaron 23 Scott B. Baden / CSE 160 /Wi '16

Whydoweusethemiddlebitsforindexingthe cacheratherthanthehigherorderbits? A. Consecutive memory lines map to different cache lines B. Consecutive memory lines map to same cacheentry bbits Tbits sbits T tagbits perline B = 2b bytes per cacheblock m-1 <tag> 0 <blockoffset> valid tag Line0: <lineindex> Line1: valid tag • •• valid tag Lines-1: Scott B. Baden / CSE 160 /Wi '16

What happens on awrite? • Write Through / WriteBack • Write through tomemory • Write to cache only – when do weupdate memory? • Allocate on Write / No Allocate on Write: whether or not we load cache on a miss, or write to main memoryonly Scott B. Baden / CSE 160 /Wi '16

Whymightawrite-backcachebeuseful? A. We expect to write to that location several times We expect to read from that location severaltimes A &B Neither A norB Scott B. Baden / CSE 160 /Wi '16

The 3 C’s of cachemisses • Compulsory (AKA ColdStart) • Capacity • Conflict Scott B. Baden / CSE 160 /Wi '16

Whatisthecachemissratewe’llobserveifwehave an infinitecache? A.Compulsory Capacity Conflict Scott B. Baden / CSE 160 /Wi '16

Other cache designissues • Separate Instruction (I) and Data(D) • Unified(I+D) • Last Level Cache (LLC) – theslowestcache (furthest away from theprocessor) • Translation Lookaside Buffer(TLB) L2 Processor L3 Cache (LLC) Memory Unified L2 Cache L1 d-cache Regs Processor L1 i-cache TLB Scott B. Baden / CSE 160 /Wi '16

Two main types ofcaches • Directmapped • SetAssociative • Blockplacement • Evictionpolicy on amiss/write Scott B. Baden / CSE 160 /Wi '16

Set associativecache • More than 1 line per set • Search each valid line for matching tagbits • Miss/write: evict the least recently used(LRU) T tagbits perline bbits Tbits sbits B = 2b bytes per cacheblock 1valid bitper line m-1 0 valid tag • •• set0: <tag> <blockoffset> <setindex> valid tag valid tag • •• set1: valid tag • •• valid tag 0 1 ••• B–1 • •• Randal E. Bryant and David R.O’Hallaron setS-1: valid tag Scott B. Baden / CSE 160 /Wi '16

Examining Bang’s MemoryHierarchy • /proc/cpuinfo summarizes theprocessor • vendor_id :GenuineIntel • model name : Intel(R) Xeon(R) CPUE5345 • @2.33GHz cachesize cpucores : 4096KB :4 • processor :0through processor :7 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 32K L1 32KL1 32K L1 32KL1 32K L1 32KL1 32K L1 32KL1 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 FSB 10.66GB/s FSB 10.66GB/s Scott B. Baden / CSE 160 /Wi '16

Detailed memory hierarchyinformation • /sys/devices/system/cpu/cpu*/cache/index*/* • Login to bang and view thefiles Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 32K L1 32KL1 32K L1 32KL1 32K L1 32KL1 32K L1 32KL1 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 FSB 10.66GB/s FSB 10.66GB/s Scott B. Baden / CSE 160 /Wi '16

How can we improve cacheperformance? Reduce the missrate Reduce the misspenalty Reduce the hittime D. A, B andC Scott B. Baden / CSE 160 /Wi '16

Optimizing forre-use • The success of caching depends on the abilityto • re-use previously cacheddata • Data access order affectsre-use • Assume a cache with 2 entries, each 2 wordswide for (i=0; i<N;i++) for (j=0; j<N; j++) a[i][j] +=b[i][j]; for (j=0; j<N;j++) for (i=0; i<N;i++) a[i][j] +=b[i][j]; The 3 C’s Compulsory Capacity Conflict Scott B. Baden / CSE 160 /Wi '16

Testbed • 2.7GHz Power PC G5(970fx) • http://www- 01.ibm.com/chips/techlib/techlib.nsf/techdocs/DC3D4 3B729FDAD2C00257419006FB955/$file/970FX_user • _manual.v1.7.2008MAR14_pub.pdf • Caches: 128 Byte linesize • 512KB L2 (8-way, 12 CP hittime) • 32K L1 (2-way, 2 CP hittime) • TLB: 1024 entries,4-way • gcc version4.0.1 • (Apple Computer, Inc. build 5370), -O2optimization • Single precision floatingpoint Scott B. Baden / CSE 160 /Wi '16

Theresults for (i=0; i<N; i++) for (j=0; j<N;j++) a[i][j] +=b[i][j]; for (j=0; j<N;j++) for (i=0; i<N; i++) a[i][j] +=b[i][j]; Scott B. Baden / CSE 160 /Wi '16

Self-Scheduling and Cache Memories in Processor

Self-Scheduling and Cache Memories in Processor

Presentation Transcript

Lecture 5

Lecture 5

[lecture#5]

Lecture 5

Lecture 5

LECTURE 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

LECTURE 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5

Lecture 5