230 likes | 348 Views
Coe-502 paper presentation 2. Transactional Coherence and Consistency. Presenters: Muhammad Mohsin Butt. (g201103010). OUtline. Introduction Current Hardware TCC in Hardware TCC in Software Performance evaluation Conclusion. Intoduction.
E N D
Coe-502 paper presentation 2 Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g201103010)
OUtline Introduction Current Hardware TCC in Hardware TCC in Software Performance evaluation Conclusion.
Intoduction • Transactional Coherence and Consistency (TCC) provides a lock free transactional model which simplifies parallel hardware and software. • Transactions are the basic unit of parallel work which are defined by the programmer. • Memory coherence, communication and memory consistency are implicit in a transaction.
Current Hardware • Provide illusion of a single shared memory to all processors. • Problem is divided into various parallel tasks that work on a shared data present in shared memory. • Complex cache coherence protocols required. • Memory consistency models are also required to ensure the correctness of the program. • Locks used to prevent data races and provide sequential access. • Too many locks overhead can degrade performance.
TCC in HARDWARE • Processors execute speculative transactions in a continuous cycle. • A transaction is a sequence of instructions marked by software that are guaranteed to execute and complete atomically. • Provides All Transactions All The time model which simplifies parallel hardware and software.
TCC in HARDWARE • When a transaction starts, it produces a block of writes in a local buffer while transaction is executing. • After completing transaction, hardware arbitrates system wide for permission to commit the transaction. • After acquiring permission, the node broadcasts the writes of the transaction as one single packet. • Transmission as a single packet reduces number of inter processor messages and arbitrations. • Other processors snoop on these write packets for dependence violation.
TCC in HARDWARE • TCC simplifies cache design • Processor hold data in unmodified and speculatively modified form. • During snooping invalidation is done if commit packet contains address only. • Update is done if commit packet contains address and data. • Protection against data dependencies. • If a processor has read from any of the commit packet address, the transaction is re executed.
TCC in HARDWARE • Current CMP need features that provide speculative buffering of memory references and commit arbitration control. • Mechanism for gathering all modified cache lines from each transaction into a single packet is required. • Write Buffer completely separate from cache. • Address buffer containing list of tags for lines containing data to be committed.
TCC in HARDWARE • Read BITs • Set on a speculative read during a transaction. • Current transaction is voilated and restarted if the snoop protocal sees a commit packet having address of a location whose read bit is set. • Modified BITs • During a transaction stores set this bit to 1. • During violation lines having modified bit set to 1 are invalidated.
TCC in Software • Programming with TCC is a 3 Step process. • Divide program into transactions. • Specify Transactions Order. • Can be relaxed if not required. • Tuning Performance • TCC provide feedback where in program the violations occur frequently
Loop Based Parallelization • Consider Histogram Calculation for 1000 integer percentage • /* input */ • int *in = load_data(); • int i, buckets[101]; • for (i = 0; i < 1000; i++) { • buckets[data[i]]++; • } • /* output */ • print_buckets(buckets);
Loop Based Parallelization • Can be parallelized using. • t_for (i = 0; i < 1000; i++) • Each loop body becomes a separate transaction. • When two parallel iterations try to update same histogram bucket, TCC hardware causes later transaction to violate, forcing the later transaction to re execute. • A conventional Shared memory model would require locks to protect histogram bins. • Can be further optimized using • t_for_unordered()
Fork Based Parallelization • t_fork() forces the parent transaction to commit and create two completely new transactions. • One continues execution of remaining code • Second start executing the function provided in parameters. E.g • /* Initial setup */ • intPC = INITIAL_PC; • intopcode = i_fetch(PC); • while (opcode ! = END_CODE){ • t_fork(execute, &opcode, • 1, 1, 1); • increment_PC(opcode, &PC); • opcode = i_fetch(PC);}
Explicit transaction commit ordering • Provide partial ordering. • Done by assigning two parameters to each transaction • Sequence Number and Phase Number • Transactions with same sequence number commit in an ordered way defined by programmer. • Transactions with different sequence number are independent. • Order for transactions having same sequence numbered is achieved through phase number. • Transaction having Lowest Phase number is executed first.
Performance Evaluation • Maximize Parallelization. • Create as many transactions as possible • Minimize Violations. • Keep transactions small to reduce amount of work lost on violation • Minimize Transaction Overhead • Not To small size of transaction • Avoid Buffer Overflow • Can result in excessive serialization
Performance Evaluation • Base Case. • Simple parallelization without any optimization. • Unordered • Finding loops that can be un orderd. • Reduction • Finding areas that exploit reduction operations • Privatization • Privatize the variables to each transaction that cause violations. • Using t_commit() • Break large transactions to small ones but execute on same processor. Reduces loss overhead due to violations and prevents buffer overflow. • Loop Adjustments • Using various loop adjustments optimizations provided by the compiler.
Performance Evaluation Inner Loops had too many violations Using outer loop_adjust improved result Privatization and t_commit Improve performance
Performance Evaluation • CMP performance is close to Ideal TCC for small number of processors.
Conclusions • Bandwidth limitation is still a problem for scaling TCC to more processors. • No support for nested for loops. • Dynamic optimization techniques still required to automate performance tuning on TCC