390 likes | 517 Views
This work explores the implementation of application-specific signatures to enhance transactional memory (TM) functionality in soft processors on FPGA platforms. By efficiently detecting memory access conflicts and optimizing atomic operations, we aim to improve parallel processing capabilities within large system-on-chip architectures. Our approach modifies memory access strategies using tailored hash functions to maximize performance and minimize resource usage, enabling better management of concurrency. This study presents a novel framework for efficient TM operation leveraging FPGA reconfigurability and specific application characteristics.
E N D
ECE Dept. University of Toronto Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan
FPGA Soft Processor DDR controller Ethernet MAC controllers FPGAs for Systems-on-Chip • Increasingly large Systems-on-Chip • Many CPUs, accelerators, IP blocks • Processors are easier to program than hardware • FPGAs & multicores: similar parallel programming challenge Why are parallel programschallenging?
Atomic Atomic Packet Processing Example SINGLE-THREADED MULTI-THREADED packet = get_packet(); … connection = database->lookup(packet); if(connection == NULL) connection = database->add(packet); connection->count++; … global_packet_count++; packet = get_packet(); … connection = database->lookup(packet); if(connection == NULL) connection = database->add(packet); connection->count++; … global_packet_count++; Challenges: 1- Must correctly delimit atomic operations 2- Improve performance by finer-grain locking
Atomic Optimisic Parallelism across Connections Atomic Packet Processing Example MULTI-THREADED Opportunity for Parallelism packet = get_packet(); … connection = database->lookup(packet); if(connection == NULL) connection = database->add(packet); connection->count++; … global_packet_count++; No Parallelism
Exploit Opportunity for Parallelism • Allow more than 1 thread in a critical section • Will succeed if threads access different data • Transactional Memory • the new hot topic for multiprocessor computers • how to map TM to FPGAs?
Our Transactional Approach • Modify main memory directly: reduce copies, faster commit • Detect conflicts prior to corrupting main memory • Undo changes on transaction abort processor1 processor2 x x Data Data Cache Off-chip DDR • How to efficiently detect conflicts?
Transaction2 Transaction1 Read A Read A OK Read B Write B CONFLICT Write D Write C Write D Read C CONFLICT CONFLICT Conflict Detection • Tracking speculative reads and writes • Compare accesses across transactions: Must detect all conflicts for correctness Reporting false conflicts is acceptable
Related Work on Conflict Detection • FPGAs: test speculative bits in the cache • Complex to evict cache lines • Lots of additional state • Too restrictive in terms of storage capacity • ASIC: compare signatures • Signature: bit vector recording TM memory accesses • No previous signature FPGA implementation Signatures well suited to FPGA bitwise operations How can signatures be efficiently implemented?
AND processor2 Conflict Detection with Signatures • Hash of an address indexes into a bit vector Signatures processor1 load Hash Function Write Read store • More bits per signature more resolution • FPGA timing and area limit the number of bits • Hash functions have varying complexity/accuracy
Goals of this Work • Implement efficient signatures for TM on FPGAs • FPGA reconfigurability better/more-efficient TM • Evaluate with real system
Existing Hash Functions Bit Selection 4 bits hash index into 16 signature bits Address bits Hash = 0 0 ... 1 1 ... 0 1 1 0
Hash_1 = Hash_2 = Multiple hash functions index different parts of the signature Existing Hash Functions (continued) H3: XOR random address bits Address bits Address bits 1 1 0 1 1 0 0 0 1 0 ... 1 ... 1 1 1 1 0 We use 4 hash functions to improve performance/length
Existing Hash Functions (continued) PBX: XOR high-order bits with low-order ones LE-PBX: XOR high-order bits with low-order ones, progressively omit low-order bits in hash functions Hash_1 = Hash_2 = Address bits Address bits Address bits Hash_2 = 1 0 1 0 1 1 0 1 0 ... 0 ... 1 ... 1 0 0 1 0 1 1
Signatures: an Opportunity for FPGAs • ASIC hash functions on FPGA: very area consuming • Due to locality: • applications access certain memory locations more frequently • certain locations will have more conflicts than others • Via app-specific signatures: • increase tracking resolution of conflicting memory locations • decrease tracking resolution of others • FPGAs allow customized hash function for each application Application-specific signatures!
Binary Addresses (profiling) 0 0 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 root 1xx 0xx 11x 10x 01x 00x 111 110 101 100 011 000 Trie-based Hashing for Signatures Leaves are distinct addresses signature bits • Trie gives control on the resolution for different memory regions • Complete trie of all TM accesses is HUGE • Which leaves in the trie can/cannot be merged?
A2,A1,A0 A2,A1,A0 xxx Simulation feedback: 1xx 0xx 11x 10x 01x 00x 111 110 101 100 011 000 A2 & A0 A2 & !A0 !A2 Load/Store A2 A1 A0 Trie-Based Conflict Detection 3 leaves in trie 3 signature bits encompass all accesses Compact trie by only evaluating nodes with remaining branching Representation is very efficient!
Trie-based Hash functionEvaluation Training packet trace is different from test packet trace
Synch. Unit processor1 processor2 I$ I$ 1-thread 1-thread Instr. Data Input mem. Output mem. Input Buffer Shared Data Cache Output Buffer packet output packet input Off-chip DDR Multiprocessor System • NetFPGA: Virtex II Pro 50, 4 GigE + 1 PCI interfaces • 2 processors @ 125 MHz (limited by FPGA) • 64 MB DDR2 SDRAM @ 200 MHz Real system executing real applications
Simulated Ratio of False Conflicts versus Number of Signature Bits NAT, percent false conflicts - Trie-based hashing function requires much fewer signature bits
Simulated Ratio of False Conflicts versus Number of Signature Bits UDHCP NAT Classifier Intruder - Trie-based hashing function requires much fewer signature bits
Ideal Simulated Packet Rate Normalized to Ideal Conflict Detection vs Trie-Based Signature Length Signatures are Critical to Performance
Block RAM Arbitrary hash function Registers ~100 signature bits per thread 2 Best Implementation Options Maximum Design @ 125MHz Bit-Select hash function 2048 signature bits per thread Let’s Compare! Signatures We use trie-based signatures: They perform best at that size
+71% +58% +12% +9% Trie-based Hashing Normalized to BitSelection Area Throughput - At most 5% area overhead - Significantly fewer rollbacks packet rate increase
Conclusions • Conflict detection significantly impacts performance • Trie-based hashing reduces required signature bits • Trie-based hashing can be implemented in LUTs • Preserve frequency, 5% area overhead • Retiming is required to implement in RAMs • Increased performance (up to 71%) versus other best implementation (RAM-based bit-select) - Application-specific signatures enable first fully integrated TM processor for FPGA - We now have an extended version working with 8 threads
ECE Dept. University of Toronto Thank you! Martin Labrecque Mark Jeffrey Gregory Steffan martinL/markJ@eecg.utoronto.ca
Alleviate need for fine grained-synchronization Bool val = f(shared_1); if(val) { Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock(); } Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock(); AFTER BEFORE • Reduce conservative synchronization overhead Lock(); if (shared_1) array [ i ] = 0; Unlock(); Only serialized when truly necessary Transactional MemoryParallel Programming Made Easy
Our Transactional Approach • No program change required • Modify directly main memory • Detect conflicts prior to corrupting main memory • Undo changes on transaction abort processor processor x x Data Data Cache x Off-chip DDR
sigsvn_udhcp/statsout fp rates sigsvn_other/mat other stats
Alleviate need for fine grained-synchronization Bool val = f(shared_1); if(val) { Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock(); } Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock(); AFTER BEFORE • Reduce conservative synchronization overhead Lock(); if (shared_1) array [ i ] = 0; Unlock(); Only serialized when truly necessary Transactional MemoryParallel Programming Made Easy
Hazard Detection Logic Transactional Single-Threaded Processor (simplified) Data Cache P C Reg. Array Instr. Cache ALU +4 Hazard detection is too slow: use static hazard detection
Transactional Single-Threaded Processor (simplified) Conflict Detection Undo Log Data Cache P C P C Reg. Array Reg. Array Instr. Cache ALU +4
Transactional Packet Processing • Hardware support to revert speculative changes to: • Register file • Program counter • Data memory • To detect failed speculation: • Record read and write sets of speculative threads • Compare sets across threads When does the set comparison take place?
Conflict Detection with Signatures • Suited for FPGA bitwise operations • Hash of an address sets bits in a bit vector • Set comparison is an AND operation • Clearing sets is done in 1 cycle Signature Thread 0 W 01000000 R 00000000 W 00000000 R 00000000 processor x Signature Thread 1 W 01000000 R 00000000 W 00000000 R 00000000 processor x • Requires many bits per thread • Timing constraints allow read and write set tracking for 2 threads • -Made a single-threaded 2-processor implementation
root 1xx 0xx 11x 00x 111 110 000
A New Meaning for Locks • Optimistically consider locks • No program change required Thread1Thread2Thread3Thread4 LOCKS Lock(); if ( f( ) ) shared_1 = a(); else shared_2 = b(); Unlock(); TRANSACTIOAL Thread1Thread2Thread3Thread4 x • Reduce conservative synchronization overhead • Reduce challenge of fine grained-synchronization
* can you list the apps? • emphasize that train != test in methodology page