Thread Level Parallelism

Thread Level Parallelism PowerPoint PPT Presentation

  • Uploaded on 19-05-2012
  • Presentation posted in: General

Approaches to TLP. We want to enhance our current processorsuperscalar with dynamic schedulingFine-grained multi-threadingswitches between threads at each clock cyclethus, threads are executed in an interleaved fashionas the processor switches from one thread to the next, a thread that is curre - PowerPoint PPT Presentation

Download Presentation

Thread Level Parallelism

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

1. Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? a thread is defined as a separate process with its own instructions and data this is unlike the traditional (OS) definition of a thread which shares instructions with other threads but they each have their own stack and data (a thread in this case is multiple versions of the same process) a thread may be a traditional thread or a separate process or a single program executing in parallel the idea here is that the thread offers different instructions and data so that the processor, when it has to stall, can switch to another thread and continue execution so that it does not cause time consuming stalls TLP exploits a different kind of parallelism than ILP

2. Approaches to TLP We want to enhance our current processor superscalar with dynamic scheduling Fine-grained multi-threading switches between threads at each clock cycle thus, threads are executed in an interleaved fashion as the processor switches from one thread to the next, a thread that is currently stalled is skipped over CPU must be able to switch between threads at every clock cycle so that it needs extra hardware support Coarse-grained multi-threading switches between threads only when current thread is likely to stall for some time (e.g., level 2 cache miss) the switching process can be more time consuming since we are not switching nearly as often and therefore does not need extra hardware support

3. Advantages/Disadvantages Fine-grained Adv: less susceptible to stalling situations Adv: throughput costs can be hidden because stalls are often unnoticed Disadv: slows down execution of each thread Disadv: requires a switching process that does not cost any cycles – this can be done at the expense of more hardware (we will require at a minimum a PC for every thread) Coarse-grained Adv: more natural flow for any given thread Adv: easier to implement switching process Adv: can take advantage of current processors to implement coarse-grained, but not fine-grained Disadv: limited in its ability to overcome throughput losses because of short stalling situations because the cost of starting the pipeline on a new thread is expensive (in comparison to fine-grained)

4. Simultaneous Multi-threading (SMT) SMT uses multiple issue and dynamic scheduling on our superscalar architecture but adds multi-threading (a) is the traditional approach with idle slots caused by stalls and a lack of ILP (b) and (c) are fine-grained and coarse-grained MT respectively (d) shows the potential payoff for SMT (e) goes one step further to illustrate multiprocessing

5. Four Approaches Superscalar on a single thread (a) we are limited to ILP or, if we switch threads when one is going to stall, then the switch is equivalent to a context switch, which takes many (dozens or hundreds) of cycles Superscalar + coarse-grained MT (c) fairly easy to implement, performance increase over no MT support, but still contains empty instruction slots due to short stalling situations (as opposed to lengthier stalls associated with cache miss) Superscalar + fine-grained MT (b) requires switching between threads at each cycle which requires more complex and expensive hardware, but eliminates most stalls, the only problem is that a thread that lacks ILP or cannot make use of all instruction issue slots will not take full advantage of the hardware Superscalar + SMT (d) most efficient way to use hardware and multithreading so that as many functional units as possible can be occupied

6. Superscalar Limitations for SMT In spite of the performance increase by combining our superscalar hardware and SMT, there are still inherent limitations how many active threads can be considered at one time? we will be limited by resources such as number of PCs available to keep track of each thread, size of bus to accommodate multiple threads having instruction fetches at the same time, how many threads can be stored in main memory, etc finite limitation on buffers used to support the superscalar reorder buffer, instruction queue, issue buffer limitations on bandwidth between CPU and cache/memory limitation on the combination of instructions that can be issued at the same time consider four threads, each of which contains an abnormally large number of FP * but no FP +, then the multiplier functional unit(s) will be very busy while the adder remains idle

7. SMT Design Challenges Superscalars best perform on lengthier pipelines We will only implement SMT using fine-grained MT so we need large register file to accommodate multiple threads per-thread renaming table and more registers for renaming separate PCs for each thread ability to commit instructions of multiple threads in the same cycle added logic that does not require an increase in clock cycle time cache and TLB setups that can handle simultaneous thread access without a degradation in their performance (miss rate, hit time) In spite of the design challenges, we will find performance on each individual thread will decrease (this is natural since every thread will be interrupted as the CPU switches to other threads, cycle-by-cycle) One alternative strategy is to have a “preferred” thread of which instructions are issued every cycle as is possible the functional unit slots not used are filled by alternate threads if the preferred thread reaches a substantial stall, other threads fill in until the stall ends

8. SMT Example Design The IBM Power5 was built on top of the Power4 pipeline but in this case, the Power5 implements SMT simple design choices whenever possible increase associativity of L1 instruction cache and TLB to offset the impact that might arise because of multithreading access to the cache and TLB add per-thread load/store queues increase size of L2 and L3 caches to permit more threads to be represented in these caches add separate instruction prefetch and buffering hardware increase number of virtual registers for renaming increase size of instruction issue queues the cost for these enhancements is not extreme (although it does take up more space on the chip) – are the performance payoffs worthwhile?

9. Performance Improvement of SMT As it turns out, the improvement gains of SMT over a single thread processor is only modest in part this is because multi-issue processors have not increased their issue size over the past few years – to best take advantage of SMT, issue size should increase from maybe 4 to 8 or more, but this is not practical Pentium IV Extreme had improvements of 1.01 and 1.07 for SPEC int and SPEC FP benchmarks respectively over Pentium IV (Extreme = Pentium IV + SMT support) When running 2 SPEC benchmarks at the same time in SMT mode, improvements ranged from 0.9 to 1.58 with an average improvement of 1.20 Conclusions SMT has benefits but the costs do not necessarily pay for the improvement another option: use multiple CPU cores on a single processor (see (e) from the figure on slide 4) another factor discussed in the text (but skipped here) is the increasing demands on power consumption as we continue to add support for ILP/TLP/SMT

10. Advanced Multi-Issue Processors Here, we wrap up chapter 3 with a brief comparison of multi-issue superscalar processors

11. Comparison on Integer Benchmarks

12. Comparison on FP Benchmarks

13. Introduction to Multiprocessors Taking parallelism to the next level, we have entirely independent processors unlike the superscalar which permitted parallel processing on one or more threads but had an overhead by having to keep track of which thread it was executing and how to switch between threads The multiprocessor can execute multiple programs simultaneously one program distributed across processors using shared memory or interprocessor communication Flynn defined the classification of architecture types in the 1960s and this has largely remained the same SISD – single instruction on single data, the traditional processor whether it is pipelined or not SIMD – single instruction on multiple data, this achieves data-level parallelism, useful for vector/array-based operations MISD – multiple instructions on a single datum, never developed MIMD – multiple instructions on multiple data, this is the true multiprocessor

14. SIMD Architectures Bit-slice processors processors execute bit-level operations on bit-slices from memory a group of processors perform one word’s worth of operations by distributing the operation bit-by-bit Processor arrays operate on 1-D or 2-D data so that each processing element executes the current instruction on one datum two flavors – vector machines and matrix machines Hypercube processors similar to a processor array except that processors are connected to nearest neighbor processors for communication purposes in a machine with 2n processors, each processor connects to n neighbors

15. Bit-Slice Example We have an array of boolean values we want to know if any of the bits is equal to 1 (do a parallel OR) assume that processors can read the single datum in parallel but only one processor can write a result on a tie, the processor to write the result will be the processor with the lowest ID show a parallel algorithm and determine the complexity of the algorithm assuming that you have n processors for an n-bit value The following solution takes O(1) time (instead of O(n) time for a normal processor):

16. Vector-Array Example An array stores n int values using an n/2 vector processor, show how we can find the largest value in O(log n) time

17. Development of Multiprocessors Multiprocessor systems have been developed since the 1960s but it wasn’t until the 1980s when processors and memories were more affordable that systems could largely be built using off-the-shelf components attached to a single bus Two flavors of multiprocessor systems tightly coupled or shared memory loosely coupled or distributed memory this category is similar to a network of computers

18. Multiprocessor Problems There are two inherent problems with multiprocessors accessing remote memory is very time consuming processes are often sequential in nature making them hard to parallelize Examples we have 100 processors from which we want to achieve 80 times speedup over a uniprocessor on a given application, how much of the time can the application be executing sequentially? 100 = 1 / (1 – x + x / 80), solving for x gives .9975, so our application must be running in parallel 99.75% of the time, or it can be sequential only 1 – 99.75% = 0.25% of the time! 32 processor system has 200 ns remote access time, a 2 GHz clock speed and a computation CPI = 0.5, assuming all local memory accesses are hits, how much faster is the machine if all memory accesses are local as opposed to 2% remote? remote memory access time = 200 ns / .5 ns (clock cycle time) = 400 clock cycles CPI with remote access = .5 + 2% * 400 = 1.3 CPI without remote access = .5 application with no remote accesses = 1.3 / .5 = 2.6 times faster!

19. Symmetric Shared Memory Architecture The typical form of a tightly coupled architecture is one where we expect memory access to be symmetric that is, where we expect all memory accesses to take about the same amount of time SSM limit the number of processors because the shared memory becomes a bottleneck this can be alleviated with memories distributed across chips and high order interleaving, and multiple buses (but that gets expensive) One way to reduce the impact of the bottleneck is large, local multi-level caches although caching seems an obvious way to reduce memory contention and the bottleneck, in addition to improving processor CPI, it comes with a cost in a shared memory system: cache coherence

20. Cache Coherence A memory system is considered coherent if any read obtains the most recently written version of the datum And 2 writes to the same datum are serialized so that all processors would see the data in the same order Example: cache coherence is required so that this problem does not arise: how can we prevent it? two solutions: directory-based and snooping caches

21. Snooping Protocols All caches monitor some centralized communication mechanism typically a single bus that connects each cache to memory Upon any write, the processor signals that the particular datum (denoted by memory address) has been modified all other caches are “snooping” this line (listening for such a message) upon receiving a write message, each cache checks to see if the updated value is stored locally (this value will now be invalid) Two common versions of snooping are write invalidate where the cache marks the updated address as invalid so that upon a future read it is treated as a cache miss write update where the updated datum is distributed itself so that all caches that have that datum can update their values immediately note: in either case, it is easier to implement this if the cache write protocol is write through rather than write back

22. Write Invalidate Write update sounds like the better approach no data will have to become obsolete But in fact it is more expensive to implement in terms of bandwidth usage and so is less common Write invalidate only costs in terms of increased miss rates to ensure that the write can proceed, the cache must invalidate all other caches’ values of the datum prior to the write the processor must first gain control of the bus in case of a tie (two processors wanting to invalidate a datum simultaneously), the bus arbiter will cause both processors to wait and will randomly select between them example:

23. MESI Protocol The common implementation for a snoopy cache is to use the MESI Protocol M – modified the data (or block) has been modified, thus other cached versions are invalidated, continued reads or writes to this block cause no changes locally E – exclusive the datum is stored exclusively in this cache so can be written to without invalidation elsewhere S – shared the datum cannot be written to since it is shared, a write will require first an invalidation I – invalid same as a miss, the datum has been flagged as invalid The table on the next slide (figure 4.5 page 213) demonstrates how the MESI protocol works A variation of the MESI protocol is the MOESI protocol where O stands for ownership where a given cache block can be shared, but the cache which is denoted as the “owner” is responsible for updating other processors when the block has been modified

25. Implementing the MESI Protocol

26. Performance of SSM Multiprocessors Overall cache performance is a combination of uniprocessor cache miss rate traffic caused by communication (which includes cache invalidations) these are complicated by true sharing misses, caused by invalidations false sharing misses, caused when a block is invalidated but the requested word in that block has not been modified Changing cache size, block size, number of processors can impact the performance in complex ways here, we consider an AlphaServer 4 Alpha 21164 processors with 300 MHz clock cycles 3 levels of cache (1st level is a write-through cache to permit easier snooping, 2nd and 3rd level caches are write-back), latencies are 7 cycles, 21 cycles and 80 cycles for misses at each level respectively cache to cache transfer of shared data takes 125 cycles % of the time CPU is idle because of cache misses is < 1% when run on 3 server benchmark programs (online transaction-processing, decision support system, web index search)

27. Benchmark Performance

28. Distributed Shared Memory A 16 processor multiprocessor with 64 byte block sizes and 512 KB data caches can require as much as 170 GB/sec of bus bandwidth! this is obviously a huge problem where modern processors might have a memory/bus bandwidth capable of supporting 12 GB/sec In order to get around this problem, we must use a distributed memory layout this will lower the local bandwidth to an acceptable amount because most of the traffic will be local between a processor and its own memory But a distributed memory makes snooping much more difficult all changes to data would have to be broadcast over the interconnection network and all caches would have to snoop there instead, we turn to a different cache coherence mechanism – the directory-based approach A directory will keep track of every block that might be in a cache what blocks are in which caches, whether the block has been modified or not, and other useful information

29. Implementing Directory-Based Caches We store every memory block as an entry in the directory this method will work for reasonably large number of processors (e.g., 200 or less) as we would expect a decent amount of overhead for the directory for extremely large systems with enormous memories, we might need better data structures (such as storing fewer bits per block entry) the directory is actually distributed or interleaved so that it does not become a bottleneck a directory on a single site would quickly become a bottleneck as an example, a portion of the directory might be placed with every processor’s local memory (see figure 4.19) the directory will have to handle two operations: handling a read miss handling a write to shared data and store the status of every block shared, uncached, modified communication between processors and between instances of the directory are performed by message passing

31. Implementation Details When a block is currently uncached copy in memory is the current value read miss – requesting processor is sent the block from memory and requestor is made the only sharing node write miss – requesting processor is sent the datum and made the owner, the block is made exclusive although is indicated as shared When a block is shared the memory value is up to date but there is at least one cached copy, maybe more read miss – requesting processor is sent the block from memory and processor is added to the sharing list write miss – requesting processor is sent block from memory, requestor added to sharers list, all other sharers are sent invalidation messages, state of the block is made exclusive

32. Continued When a block is exclusive current block stored by “owner” (the block that has been made exclusive) is the up-to-date value read miss – owner transmits the copy to requestor, directory adds requestor to sharer’s list, owner also updates memory at the same time, changing status from exclusive to shared data write back – updates memory copy, the home directory becomes the owner again, block is now uncached and sharers set to empty write miss – block ownership given to requestor, message sent to old owner to invalidate block and send value to requestor, which updates value, sharers set to new owner and block remains exclusive See figures 4.21 and 4.22 for more details

33. Sun T1 Server We wrap up this chapter with a brief look at the Sun T1 multiprocessor server T1 uses an 8-core processor (1.2 GHz) that is, there are 8 pipelines T1 is a single issue pipeline where each pipeline performs fine-grained multithreading on up to 4 threads the limitation of 4 threads per core allows us to support multithreading directly in hardware by using 4 PCs per pipeline T1 pipeline is 6 stage this is the same as the MIPS 5 stage pipeline with 1 added stage to perform thread switching branches and loads each have a 3 cycle penalty, which can be hidden if the other 3 threads are active (not idle or stalled) as this is a server, FP operations are not emphasized and therefore all 8 pipelines share the same floating point unit cache coherency is enforced using a directory approach where directories are distributed across the L2 caches, keeping track of which L1 caches have copies of data that are stored in their L2 cache

34. The T1 Architecture

35. T1 Performance 4 thread core hides some latency that arises from limited ILP manifesting itself by lowering the L1 cache miss penalty (latency) figure 4.26 shows that a single thread will have 1.1 to 1.2 times the latency from cache misses than the 4 thread core Similarly, larger L2 caches with bigger blocks hide latency the increased size should have an obvious result, but the impact is not as much as one would expect compare for instance the decrease from a 3 MB L2 cache with 32 Byte blocks versus a 6 MB L2 cache with 32 Byte blocks increasing the block size causes additional message traffic in the interconnection network resulting in larger latencies so caches with smaller block sizes (fewer words per block) are preferred figure 4.28 shows these impacts For a 4 thread core, the ideal per-thread CPI is 4 the T1 averages between 5.6 and 6.6 but the per core CPI (ideal is 1) ranges from 1.4 to 1.8 For all 8 cores, the effective CPI is between .175 and .225 compare this to multi-issue processors that might have CPIs of .3 or .4 so while the per-thread performance is not impressive, the overall throughput and effective CPI of the entire processor is

36. Comparison The T1 as just described is compared to the AMD Opteron (2 cores, 3 instruction issue per cycle, 2.4 GHz, does not support multithreading) Intel Pentium D (2 cores, 3 instruction issue per cycle, 3.2 GHz, supports SMT) IBM Power 5 (2 cores, 4 instruction issue per cycle, 1.9 GHz, supports SMT) we don’t compare the T1 on FP benchmarks because, being a server, the only FP hardware is shared and therefore FP benchmarks will perform poorly Details are given in figure 4.33 where the T1 clearly outperforms the other processors in all non-FP benchmarks except for the Spec integer benchmarks where it is marginally better than the Power5 and Opteron the moral of this story appears to be that proper support of fine-grained threads coupled with multiple non-FP threads provides a better performance than superscalar pipelines

37. Sample Problem #1 How many processors will it take a processor array to find the maximum value of an array of n values in O(1) time? Solution: n2 as follows Denote each processor as pi,j where a processor pi,j is assigned the value of a[i] (so that n processors will have each array value) each processor pi,j compares the two array values a[i] and a[j] if a[i] < a[j] then write 1 to array location b[i] else write 1 to b[j] there may be multiple writes so that the processor with the lowest ID writes to b[i] (or b[j]), but we are only writing 1, so a 1 will be written to some elements of b Using n of the original processors, assign a processor to each element of b if b[i] = = 0 then write i to datum k a[k] will be the maximum item can you figure out why? This algorithm executes in parallel taking a total of 2 loads, two comparisons and two writes no matter the size of n, so is O(1)

38. Sample Problem #2 Assume the memory setup as shown to the right Determine the resulting state of cache and memory of each operation using write invalidate P0: read 120 P0 B0: (S, 120, 00, 20) reads 20 P0: write 120 ? 80 P0 B0 modified to (M, 120, 00, 80) 120 ? 80 is broadcast over the bus P15 invalidates B0: (I, 120, 00, 20)

39. Continued P15: write 120 ? 80 P15 B0 (M, 120, 00, 80) P0 is unchanged since it is the same value as in P0 B0 P1: read 110 P0 B2: (S, 110, 00, 30) P1 B2: (S, 110, 00, 30) M[110]: (00, 30), read (returns 30) P0: write 108 ? 48 P0 B1: (M, 108, 00, 48) P15 B1 becomes invalid: (I, 108, 00, 08) P0: write 130 ? 78 P0 B2: (M, 130, 00, 78) M[110]: (00, 30) P15: write 130 ? 78 P0 B2: (M, 130, 00, 78)

40. Sample Problem #3 We use the figure from the previous example as our memory/cache layout use MESI protocol where memory access takes 100 cycles, remote cache access takes 70 cycles, invalidate takes 15 cycles and write back takes 10 cycles

  • Login