Instruction and data address trace compression
This presentation is the property of its rightful owner.
Sponsored Links
1 / 43

Instruction and Data Address Trace Compression PowerPoint PPT Presentation


  • 63 Views
  • Uploaded on
  • Presentation posted in: General

Instruction and Data Address Trace Compression. Aleksandar Milenković (collaborative work with Milena Milenković and Martin Burtscher) Electrical and Computer Engineering Department The University of Alabama in Huntsville Email: [email protected] Web: http://www.ece.uah.edu/~milenka

Download Presentation

Instruction and Data Address Trace Compression

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Instruction and data address trace compression

Instruction and Data Address Trace Compression

Aleksandar Milenković

(collaborative work with Milena Milenković and Martin Burtscher)

Electrical and Computer Engineering Department

The University of Alabama in Huntsville

Email: [email protected]

Web: http://www.ece.uah.edu/~milenka

http://www.ece.uah.edu/~lacasa


Outline

Outline

  • Program Execution Traces

  • Trace Compression

  • Trace Compression in Hardware

    • Stream caches and predictors for instruction address trace compression

    • Data address stride caches for data address trace compression

  • Results

  • Conclusions


Program execution traces

Program Execution Traces

  • Streams of recorded events

    • Basic block traces

    • Address traces

    • Instruction words

    • Operands

  • Trace uses

    • Computer architects for evaluation of new architectures

    • Computer analysts for workload characterization

    • Software developers for program tuning, optimization, and debugging


Instruction and data address traces an example

Instruction and Data Address Traces:An Example

for(i=0; i<100; i++) {

c[i] = s*a[i] + b[i];

sum = sum + c[i];

}

Dinero+ Execution Trace

DataAddress

InstructionAddress

Type

@ 0x020001f4: mov r1,r12, lsl #2

@ 0x020001f8: ldr r2,[r4, r1]

@ 0x020001fc: ldr r3,[r14, r1]

@ 0x02000200: mla r0,r2,r8,r3

@ 0x02000204: add r12,r12,#1 (1 >>> 0)

@ 0x02000208: cmp r12,#99 (99 >>> 0)

@ 0x0200020c: add r6,r6,r0

@ 0x02000210: str r0,[r5, r1]

@ 0x02000214: ble 0x20001f4

20x020001f4

00x020001f80xbfffbe24

00x020001fc0xbfffbc94

20x02000200

20x02000204

20x02000208

20x0200020c

10x020002100xbfffbb04

20x02000214


Trace issues

Trace Issues

  • Trace issues

    • Capture

    • Compression

    • Processing

  • Traces tend to be very large

    • In terabytes for a minute of program execution

    • Expensive to store, transfer, and use

  • Effective reduction techniques:

    • Lossless

    • High compression ratio

    • Fast decompression


Outline1

Outline

  • Program Execution Traces

  • Trace Compression

  • Trace Compression in Hardware

    • Stream caches and predictors for instruction address trace compression

    • Data address stride caches for data address trace compression

  • Results

  • Conclusions


Trace compression

Trace Compression

  • General purpose compression algorithms

    • Ziv-Lempel (gzip)

    • Burroughs-Wheeler transformation (bzip2)

    • Sequitur

  • Trace specific compression techniques

    • Tuned to exploit redundancy in traces

    • Better compression, faster, can be further combined with general-purpose compression algorithms


Trace specific compression techniques

Trace-Specific Compression Techniques

Lossless Compression

Instructions

Instructions + data

Link data addresses to dynamic basic block

Offset

Mache [Samples 1989],LBTC [Luo and John 2004]

Replacing an execution sequence with its identifier

[Pleszkun 1994],SBC [Milenkovic and Milenkovic, 2003]

Offset + repetitions

  • Acyclic path (WPP [Larus 1999], Time Stamped WPP [Zhang and Gupta 2001])

  • - N-tuple [Milenkovic, Milenkovic and Kulick 2003]

  • Instruction (PDI [Johnson, Ha and Zaidi 2001])

Control flow graph + trace of transitions

PDATS [Johnson, Ha and Zaidi 2001]

Link data addresses to loop

QPT [Larus 1993]

[Elnozahy 1999], SIGMA [DeRose, et al. 2002]

Regenerate addresses

Abstract execution

Value Predictor

Graph with number of repetitions in nodes

VPC [Burtscher and Jeeradit 2003],TCGEN [Burtscher and Sam 2005]

[Eggers, et al. 1990],[Larus 1993]

[Hamou-Lhadj and Lethbridge 2002]


Outline2

Outline

  • Program Execution Traces

  • Trace Compression

  • Trace Compression in Hardware

    • Stream caches and predictors for instruction address traces

    • Data address stride caches for data address traces

  • Results

  • Conclusions


Why trace compression in hardware

Why Trace Compression in Hardware?

  • Problem #1: Capture program traces

    • In software: trap after each instruction or taken branch

      • E.g., IBM’s Performance Inspector

      • Slowdown > 100 times

    • Multiple cores on a single chip + more detailed information needed (e.g., time stamps of events)

  • Problem #2: debugging is far from fun

    • Stop execution on breakpoints, examine the state

    • Time-consuming, difficult, may miss a critical state leading to erroneous behavior

    • Stopping the CPU may perturb the sequence of events making your bugs disappear

  • => Need an unobtrusive real-time tracing mechanism


Trace compression in hardware

Trace Compression in Hardware

  • Goals

    • Small on-chip area and small number of pins

    • Real-time compression (never stall the processor)

    • Achieve a good compression ratio

  • Solution

    • A set of compression algorithms targeting on-the-fly compression of instruction and data address traces


Exploiting stream and strides

Exploiting Stream and Strides

  • Instruction address trace compression

    • Limited number andstrong temporal locality of instruction streams

    • => Replace an instruction streamwith its identifier

  • Data address trace compression

    • Spatial and temporal locality of data addresses

    • => Recognize regular strides


Trace compressor system overview

PC

DA

Trace Compressor: System Overview

Processor Core

Data Address

Task Switch

Program Counter

System Under Test

Data AddressBuffer

Processor Core

Memory

Stream Cache(SC)

Data Address Stride Cache (DASC)

Trace Compressor

SCIT

SCMT

DMT

DT

Predictor +Byte rep. FSM

Byte rep.FSM

Trace port

External Trace Unitfor Storing/Processing

(PC or Intelligent Drive)

Trace Output Controller

To External Unit


Outline3

Outline

  • Program Execution Traces

  • Trace Compression

  • Trace Compression in Hardware

    • Stream caches and predictors for instruction address traces

    • Data address stride caches for data address traces

  • Results

  • Conclusions


Stream detector stream cache

0x020001f4

0x020001f8

...

0x02000214

PC

PPC

SA

SL

-

Instruction Stream Buffer

=! 4

S.SA

S.L

Stream Detector + Stream Cache

Stream Cache (SC)

NWAY - 1

iWay

1

0

0

1

F(S.SA, S.SL)

0x0E

i

S.SA & S.L

iSet

’00…0’

iWay

(0x020001f4,0x09)

NSET - 1

S.SA & S.LFrom InstructionStream Buffer

=?

Hit/Miss

0x00 // it. 0

SCMT (SA, SL)

SCIT

(0x020001f4,0x09)

Stream Cache Index Trace

Stream Cache Miss Trace

0x0E // it. 1

0x0E // it. 99


Sc itrace compression

Instruction Stream Buffer size

Not to stall processor (e.g., have consecutive very short instruction streams)

Stream cache

Size

Associativity

Replacement policy

Mapping function

SC Itrace Compression

Compress instruction stream

Get the next instruction stream record from the instruction stream buffer(S.SA, S.SL);

Lookup in the stream cache with iSet = F(S.SA, S.SL);

if (hit)

Emit(iSet && iWay) to SCIT;

else {

Emit reserved value 0 to SCIT;

Emit stream descriptor (S.SA, S.SL) to SCMT;

Select an entry (iWay) in the iSet set to be replaced;

Update stream cache entry: SC[iSet][iWay].Valid = 1 SC[iSet][iWay].SA = S.SA, SC[iSet][iWay].SL = S.SL;}

Update stream cache replacement indicators;

Design Decisions:


Sc itrace compression an analytical model

SC Itrace Compression: An Analytical Model

Legend:

  • CR(SC.I) – compression ratio

  • N – number of instructions

  • SL.Dyn – average stream length (dynamic)

  • SC.Hit(Nset,Nway) – SC hit rate

  • Assumptions:

    • stream length < 256(1 byte for SL)

    • 4 bytes for stream starting address


2 nd level itrace compression

2nd Level Itrace Compression

  • Size(SCIT) >> Size(SCMT)

    • HitRate = 98%, 8-bit index => Size(SCIT) = 10*Size(SCMT)

  • Redundancy in SCIT

    • Temporal and spatial locality of instruction streams

  • Reduce SCIT trace

    • Global Predictor

    • N-tuple compression using Tuple History Table

    • N-tuple compression using SCIT History Buffer


Global predictor structure

Global Predictor Structure

SCIT Trace

History Buffer

Predictor

next.sid

0

F

pindex

MaxP-1

==?

’0’

’1’

Hit/Miss

SCIT PRED Trace

SCIT PRED Miss Trace


Scit compression

Length of history buffer

SCIT Compression

Predict SCIT index

Get the incoming index, next.sid, from the SCIT trace

Calculate the SCIT predictor index, pindex, using indices in the History bufferpindex = F (indices in the History Buffer);

Perform lookup in the SCIT Predictor with pindex;

if(SCIT.Predictor[pindex] == next.sid)

Emit(‘1') to SCIT PRED trace;

else {

Emit(‘0’) to SCIT PRED trace;

Emit next.sid to SCIT Miss PRED trace;

SCIT.Predictor[pindex] = next.sid; }

Shift in the next.sid to the History Buffer;

Design Decisions:

  • Global predictor

    • Size

    • Mapping function


Redundancy in scit pred trace

Redundancy in SCIT Pred Trace

  • High predictor hit rates and long runs of 0xFF bytes are expected in Predictor Hit Trace

  • Use a simple FSM to exploit byte repetitions

PREDHit

Trace

// Detect byte repetitions in SCIT pred

1. Get next SCIT Pred byte, Next.BYTE;

2. if (Next.BYTE == Prev.BYTE) CNT++;

3. else {

4. if (CNT == 0) {

5. Emit Prev.BYTE to SCIT.REP.Trace;

6. Emit ‘0’ to SCIT Header;

7. } else {

8. Emit (Prev.BYTE, CNT) pair to SCIT.REP.Trace;

9. Emit ‘1’ to SCIT Header;}

10. Prev.BYTE = Next.BYTE;}

Prev.BYTE

CNT

=?

SCIT PRED Repetition Trace

SCIT PRED Header


Outline4

Outline

  • Program Execution Traces

  • Trace Compression

  • Trace Compression in Hardware

    • Stream caches and predictors for instruction address traces

    • Data address stride caches for data address traces

  • Results

  • Conclusions


Data address trace compression

Data Address Trace Compression

  • More challenging task

  • Data addresses rarely stay constant during program execution

  • However, they often have a regular stride

  • => Use Data Address Stride Cache (DASC) to exploit locality of memory referencing instructions and regularity in data address strides


Data address stride cache

Data Address Stride Cache

Data Address Stride Cache (DASC)

0x020001f8

DASC

  • Tagless structure

  • Indexed by PC of the corresponding instruction

  • Entry fields

    • LDA – Last Data Address

    • Stride

PC

0

1

G(PC)

i

index

0xbfffbe24

N - 1

0xbfffbe20

0xbfffbe1c

DA

DA-LDA

==?

’0’

’1’

Stride.Hit

Stride.Hit

0xbfffbe24

DT (Data trace)

DMT Data Miss Trace

0xbfffbe20

0

1

0


Dasc compression

Number of entries

Index function G

Stride length

Data address buffer depth

DASC Compression

// Compress data address stream

Get the next pair from data buffers (PC, DA)

Lookup in the data address stream cache indexSet = G(PC);

cStride = DA - DASC[iSet].LDA;

if (cStride == DASC[iSet].Stride) {

Emit(‘1’) to DT; //1-bit info

} else {

Emit(‘0’) to DT;

Emit DA to DMT;

DASC[iSet].Stride =lsb(cStride);}

DASC[iSet].LDA = DA;

Design Decisions:


Dasc dtrace compression an analytical model

DASC Dtrace Compression: An Analytical Model

Legend:

  • CR(SC.D) – compression ratio

  • Nmemref – number of memory referencing instructions

  • DASC.Hit – DASC hit rate

  • Assumptions:

    • 4 bytes for stream starting address


Redundancy in dt trace

Redundancy in DT Trace

  • High predictor hit rates and long runs of 0xFF bytes are expected in DT Trace

  • Use a simple FSM to exploit byte repetitions

DT

// Detect data repetitions

1. Get next DT byte;

2. if (DT == Prev.DT) CNT++;

3. else {

4. if (CNT == 0) {

5. Emit Prev.DT to DRT;

6. Emit ‘0’ to DH;

7. } else {

8. Emit (Prev.DT, CNT) pair to DRT;

9. Emit ‘1’ to DH;}

10. Prev.DT = DT;}

Prev.DT

CNT

=?

Data Header

(DH)

Data Repetition Trace (DRT)


Outline5

Outline

  • Program Execution Traces

  • Trace Compression

  • Trace Compression in Hardware

    • Stream caches and predictors for instruction address traces

    • Data address stride caches for data address traces

  • Results

  • Conclusions


Experimental evaluation

Experimental Evaluation

  • Goals

    • Assess the effectiveness of the proposed algorithms

    • Explore the feasibility of the proposed hardware implementations

    • Determine optimal size and organization of HW structures

  • Workload

    • 16 MiBench benchmarks

    • ARM architecture

  • Legend:

  • IC – Instruction count

  • NUS – Number of unique instruction streams

  • maxSL – Maximum stream length

  • SL.Dyn – Average stream length (dynamic)


Findings about sc size organization

Findings about SC Size/Organization

  • Good compression ratio

    • Outperforms fast GZIP

    • High stream cache hit rates for all application (>98 %)

    • Smaller SCs work well too

  • Replacement policy

    • Pseudo-LRU vs. FIFO

  • Associativity

    • 4-way is a reasonable choice

    • 8-way and 16-way desirable

  • Mapping function

    • S.SA<5+n:6> xor S.L<n-1:0>n=log2(NSET)


Findings about global predictor

Findings about Global Predictor

  • Number of entries should not exceed the number of entries in SC

    • Having longer histories and larger predictorsgives only marginal improvements for all applicationsexcept ghostscript, blowfish, and stringsearch

  • History length = 1

    • Index GPRED using the previous SCIT index


Putting it all together sc gpred brep itrace compression

Putting It All Together (SC+GPRED+BREP): Itrace Compression


Findings about dasc

Findings about DASC

  • Stride size

    • 1 byte is optimal

    • 2 byte stride improves compression for  10%

  • DASC with 1K entriesis an optimal choice

  • Tagged (multi-way) DASC further improves overall compression ratio

    • Increased complexity


Dasc compression ratio

DASC Compression Ratio


Hardware complexity estimation

Hardware Complexity Estimation

  • CPU model

    • In-order, Xscale like

    • Vary SC and DASC parameters

  • SC and DASC timings

    • SC: Hit latency = 1 clock, Miss latency = 2 clocks

    • DASC: Hit latency = 2 clocks Miss latency = 2 clocks

  • To avoid any stalls

    • Instruction stream input buffer: MIN = 2 entries

    • Data address input buffer: MIN = 8 entries

    • Results are relatively independent of SC and DASC organization


Trace port bandwidth analysis

Trace Port Bandwidth Analysis


Outline6

Outline

  • Program Execution Traces

  • Trace Compression

  • Trace Compression in Hardware

    • Stream caches and predictors for instruction address traces

    • Data address stride caches for data address traces

  • Results

  • Conclusions


Conclusions

Conclusions

  • A set of algorithms and hardware structuresfor instruction and data address trace compression

    • Stream Caches + Global Predictor + Byte repetition FSMfor instruction traces

    • Data Address Stride Cache + Byte repetition FSM for data traces

  • Benefits

    • Enabling real-time trace compression with high compression ratio

    • Low complexity (small structures, small number of external pins)

  • Analytical & simulation analysis focusing on compression ratio and optimal sizing/organization of the structures as well as real-time trace port bandwidth requirements


Laboratory for advanced computer architectures and systems at alabama research overview

Laboratory for Advanced Computer Architectures and Systems at Alabama: Research Overview

Aleksandar Milenković

The LaCASA Laboratory

Electrical and Computer Engineering Department

The University of Alabama in Huntsville

Email: [email protected]

Web: http://www.ece.uah.edu/~milenka

http://www.ece.uah.edu/~lacasa


Secure processors

Secure Processors

PMAC (Parallel MACs) for reducedcryptographic latency

A variation of the one-time-pad for code encryption

Instruction Verification Buffer for conditional execution before verification

Software & physical attacks

Computer Security is Critical

Sign & Verify for Guaranteed Integrity and Confidentiality of Code

Improvements

http://www.ece.uah.edu/~lacasa/research.htm#secure_processors


Microbenchmarks for architectural analysis

Small programs for uncovering architectural parameters (usually not publicly disclosed) of modern processors

Relatively simple, so their behavior can be understood

Benefits

Architecture-aware compiler optimization

Processor design evaluation and verification

Testing

Competitive analysis

Microbenchmarks for Architectural Analysis

Microbenchmarks

  • Results

    • Microbenchmarks for BTB analysis

    • Experimental flow foroutcome predictor

    • Tested on P6 and NetBurst (Northwood core)

BTB Size

Outcome Predictor

BTB Org.

BTB

BTB Indexing

...

Local History

PerformanceCounters

  • Challenge

    • Dothan (PentiumM) predictor

Branch relatedevents

Global History

...

http://www.ece.uah.edu/~lacasa/bp_mbs/bp_microbench.htm


Tinyhms

TinyHMS

Prototype

Concept

Software

http://www.ece.uah.edu/~lacasa/research.htm#tinyHMS


Tinyhms1

Motion Sensor(TS2)

ECGSensor(TS1)

Heart Beat

Heart Beat

Step

Heart Beat

Step

BeaconMessage

BeaconMessage

Event Messagewith Timestamp

TS2

TS2

TS3

TS3

NC

NC

TS1

TS1

Frame i

Frame i-1

TinyHMS


  • Login