ixp lab 2012 part 3 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
IXP Lab 2012: Part 3 PowerPoint Presentation
Download Presentation
IXP Lab 2012: Part 3

Loading in 2 Seconds...

play fullscreen
1 / 47

IXP Lab 2012: Part 3 - PowerPoint PPT Presentation


  • 94 Views
  • Uploaded on

IXP Lab 2012: Part 3. Programming Tips. Outline. Memory Independent Techniques Instruction Selection Task Partition Memory Dependent Techniques Reducing Overhead Reduce the number of memory accesses Reduce average access latency Hiding Overhead. Memory Independent Techniques.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'IXP Lab 2012: Part 3' - phallon-caddell


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ixp lab 2012 part 3

IXP Lab 2012: Part 3

Programming Tips

outline
Outline
  • Memory Independent Techniques
    • Instruction Selection
    • Task Partition
  • Memory Dependent Techniques
    • Reducing Overhead
      • Reduce the number of memory accesses
      • Reduce average access latency
    • Hiding Overhead

NCKU CSIE CIAL Lab

memory independent techniques
Memory Independent Techniques
  • Instruction Selection
    • General Coding Skill
    • Use Hardware Instruction
  • Task Partition
    • Multi-Processing
    • Context-Pipelining

NCKU CSIE CIAL Lab

general coding skill
General Coding Skill
  • Remove loop
  • Shift Operation
    • Avoid using multiply and divide
  • Inline Function
    • __inline & __forceinline
  • Branch Prediction
    • Branch Prediction Penalty

NCKU CSIE CIAL Lab

hardware instruction
Hardware Instruction
  • POP_COUNT
  • FFS
  • Multiply
  • CRC
  • Hashing
  • CAM

NCKU CSIE CIAL Lab

pop count brief
POP_COUNT--Brief
  • Population Count
  • Report number of bit set in a 32-bit register
  • 3 cycles latency
  • Example:
    • pop_count( 0x3121 ) = ?
    • 0011 0001 0010 0001
    • Result = 5

NCKU CSIE CIAL Lab

pop count na ve implementation
POP_COUNT--Naïve Implementation

unsigned int pop_count_for (unsigned int x)

{

unsigned int y=0;

unsigned int i;

for(i=0; i<32; i++)

{

if( (x&1)==1 )

y++;

x=x>>1;

}

return y;

}

NCKU CSIE CIAL Lab

pop count faster implementation
POP_COUNT--Faster Implementation

unsigned int pop_count_agg(unsigned int x)

{

x -= ((x >> 1) & 0x55555555);

x = (((x >> 2) & 0x33333333) + (x & 0x33333333));

x = (((x >> 4) + x) & 0x0f0f0f0f);

x += (x >> 8);

x += (x >> 16);

return(x & 0x0000003f);}

}

Reference http://aggregate.org/MAGIC/

NCKU CSIE CIAL Lab

pop count hardware instruction
POP_COUNT--Hardware Instruction

unsigned int pop_count_hardware(unsigned int x)

{

return pop_count (x);

}

NCKU CSIE CIAL Lab

pop count additional information
POP_COUNT--Additional Information
  • Bitmap-RFC (Liu, TECS 2008)

NCKU CSIE CIAL Lab

slide11
FFS
  • Find the first bit set in data and return its position
  • Example:
    • ffs ( 0x3121 ) = 0
      • 0011 0001 0010 0001
    • ffs ( 0x3120 ) = 5
      • 0011 0001 0010 0000
    • ffs ( 0x3100 ) = 8
      • 0011 0001 0000 0000

NCKU CSIE CIAL Lab

multiply
Multiply
  • Specific Multiply Instruction
    • Multiply_24x8()
    • Multiply_16x16()
    • Multiply_32x32_hi()
    • Multiply_32x32_lo()

NCKU CSIE CIAL Lab

slide13
CRC
  • 14 cycles latency
  • Example of CRC operation

crc_write( 0x42424242);

crc_32_be( source_address, bytes_0_3 );

crc_32_be( dest_address, bytes_0_3 );

Cache_index = crc_read();

NCKU CSIE CIAL Lab

slide14
Hash
  • hash_48()
  • hash_64()
  • hash_128()
  • Example:

SIGNAL sig_hash;

hash48(data_out, data_in, count, sig_done, &sig_hash);

__wait_for_all(&sig_hash);

NCKU CSIE CIAL Lab

cam brief
CAM--Brief
  • Content Addressable Memory
  • Each ME has 16 32-bit CAM entries
  • The CAM is private to other MEs
  • With lookup operation, each entries is searching in parallel
  • With a success lookup, the index of matched entries will be returned
  • Else, the index of entries to be replaced will be returned

NCKU CSIE CIAL Lab

cam structure
CAM--Structure
  • cam_lookup_t

NCKU CSIE CIAL Lab

cam usage
CAM--Usage

cam_lookup_t cam_result;

cam_result = cam_lookup( data );

if( cam_result.hit == 1 ) {

Access Entry cam_result.entry_num;

}

else {

……

cam_write( cam_result.entry_num, data, 15 );

}

NCKU CSIE CIAL Lab

task partition
Task Partition
  • Multi-Processing
    • More Computing Power
    • Easy to implement
  • Context-Pipelining
    • More Useable Resource
    • Hard to balance

NCKU CSIE CIAL Lab

memory relative techniques reducing overhead
Memory Relative Techniques--Reducing Overhead
  • Reduce the number of memory accesses
    • Wide-word Accesses
    • Result Caches
  • Reduce average access latency
    • Multi-level Memory Hierarchy
    • Data Cache

NCKU CSIE CIAL Lab

wide word accesses brief
Wide-Word Accesses--Brief
  • Batch Access the needed data
  • Reduce the necessary accesses
  • Useful when the data stored contiguously

NCKU CSIE CIAL Lab

wide word accesses usage one node per access
Wide-Word Accesses--Usage (One Node per Access)

__declspec(sram_read_reg) UINT32 A;

SIGNAL sig_read;

sram_read( &A, MEM_ADDR+(i*4), 1, sig_done, &sig_read);

__wait_for_all( &sig_read );

Access A ......

----------------------------------------------

Result: 8 Accesses are needed

NCKU CSIE CIAL Lab

wide word accesses usage two node per access
Wide-Word Accesses--Usage (Two Node per Access)

__declspec(sram_read_reg) UINT32 A[2];

SIGNAL sig_read;

sram_read( &A, MEM_ADDR+(i*8), 2, sig_done, &sig_read);

__wait_for_all( &sig_read );

Access A ......

----------------------------------------------

Result: 4 Accesses are needed

NCKU CSIE CIAL Lab

wide word accesses usage four node per access
Wide-Word Accesses--Usage (Four Node per Access)

__declspec(sram_read_reg) UINT32 A[4];

SIGNAL sig_read;

sram_read( &A, MEM_ADDR+(i*16), 4, sig_done, &sig_read);

__wait_for_all( &sig_read );

Access A ......

----------------------------------------------

Result: 2 Accesses are needed

NCKU CSIE CIAL Lab

wide word accesses experiment
Wide-Word Accesses--Experiment
  • Platform: IXP2800
  • Total Accesses: 8 LW (8*4 Byte)

NCKU CSIE CIAL Lab

wide word accesses limitation
Wide-Word Accesses--Limitation
  • Data must be contiguous
    • Suitable for linear search
    • Not support random accesses
  • Number of Transfer Registers are fixed
    • Each thread has 16 read / write registers
    • The Tx-Regs may be reserved by others

NCKU CSIE CIAL Lab

resulting cache brief
Resulting Cache--Brief
  • Caching the result of application
  • If same fields appear again, the cached result is returned
  • Memory accesses are reduced when cache hit.
  • Depends on temporal locality of the traffic

NCKU CSIE CIAL Lab

result cache ixp2400
Result Cache--IXP2400
  • No hardware cache is supported in IXP2400 ME
  • Not easy to implement set-associative cache
  • Replacement policy will also be an overhead

NCKU CSIE CIAL Lab

result cache design consideration
Result Cache--Design Consideration
  • Shared or Private Cache ?
  • Size of Cache ?
  • Works with specific Hardware ?
  • Miss penalty handling ?

NCKU CSIE CIAL Lab

result cache example
Result Cache--Example

NCKU CSIE CIAL Lab

multi level memory hierarchy brief
Multi-Level Memory Hierarchy--Brief
  • Reduce the average access latency
  • Number of accesses remained unchanged
  • If data can fit in faster memory, then do it

NCKU CSIE CIAL Lab

multi level memory hierarchy data placement
Multi-Level Memory Hierarchy--Data Placement
  • Size smaller while read-only
    • Hard Code
  • Size smaller while need updating
    • Local Memory
  • Size larger
    • Scratchpad
  • Size largest
    • SRAM

NCKU CSIE CIAL Lab

multi level memory hierarchy packet data type
Multi-Level Memory Hierarchy--Packet Data Type
  • Packet related data
    • Temporary Data
    • Valid with specific packet
    • Local Memory
  • Flow related data
    • Related to specific flow
    • Spatial Locality
    • Wide-Word Access
  • Application related data
    • Valid with specific application
    • Temporal Locality
    • Result Cache

NCKU CSIE CIAL Lab

split cache z liu iet com 2007
Split-Cache (Z. Liu, IET-COM 2007)
  • Two separate hardware for application data and flow data

NCKU CSIE CIAL Lab

data cache brief
Data Cache--Brief
  • Hardware Cache Mechanism that cached the data for packet processing
    • App-Cache
    • Flow-Cache
  • However, not supported by IXP2400 (Need additional hardware)

NCKU CSIE CIAL Lab

data cache cam local memory
Data Cache--CAM + Local Memory
  • CAM works with Local Memory acts like hardware cache
  • However, number of CAM entries is limited
  • Each CAM entry may co-worked with several Local Memory Cache entry

NCKU CSIE CIAL Lab

memory relative techniques hiding overhead
Memory Relative Techniques--Hiding Overhead
  • Not really reduce the overhead, but overlapped it
    • Hardware Multi-Threading
    • Asynchronous Memory

NCKU CSIE CIAL Lab

hardware multi threading
Hardware Multi-Threading
  • Swap out itself and let another thread to execute while access memory
  • Each thread kept its own set of registers, thus no stack are needed for thread swapping
  • Round Robin Scheduling
  • No thread preemptive

NCKU CSIE CIAL Lab

asynchronous memory brief
Asynchronous Memory--Brief
  • Thread will not be blocked when issue a memory request
  • Thus, thread can issues multiple memory requests at a time

NCKU CSIE CIAL Lab

asynchronous memory example 1 issue
Asynchronous Memory--Example (1 Issue)

Read X

__wait_for_all ( &sig_x )

Read Y

__wait_for_all ( &sig_y )

// Use X and Y …

NCKU CSIE CIAL Lab

asynchronous memory example 2 issues
Asynchronous Memory--Example (2 Issues)

Read X

Read Y

__wait_for_all ( &sig_x, &sig_y )

// Use X and Y …

NCKU CSIE CIAL Lab

reference 1
Reference (1)
  • Jayaram Mudigonda, Harrick M. Vin, Raj Yavatkar, “Overcoming the memory wall in packet processing: hammers or ladders?”, Proc. ANCS 2005.
  • Duo Liu, Zheng Chen, Bei Hua, Nenghai Yu, Xinan Tang, “High-Performance Packet Classification Algorithm for Multireaded IXP Network Processor”, ACM TECS 2008.

NCKU CSIE CIAL Lab

reference 2
Reference (2)
  • Z. Liu, K. Zheng, B. Liu, “Hybrid cache architecture for high-speed packet processing”, IET-COM 2007.

NCKU CSIE CIAL Lab