Architectural Support for Security in the Many-core Age: Threats and Opportunities
Download
1 / 54

Architectural Support for Security in the Many-core Age: Threats and Opportunities - PowerPoint PPT Presentation


  • 50 Views
  • Uploaded on

Architectural Support for Security in the Many-core Age: Threats and Opportunities. Dr. Nael Abu-Ghazaleh and Dr. Dmitry Ponomarev Department of Computer Science SUNY-Binghamton {nael,dima}@cs.binghamton.edu. Multi-cores-->Many-cores. Moore's law coming to an end

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Architectural Support for Security in the Many-core Age: Threats and Opportunities' - eunice


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Architectural Support for Security in the Many-core Age: Threats and Opportunities

Dr. Nael Abu-Ghazaleh and Dr. Dmitry Ponomarev

Department of Computer Science

SUNY-Binghamton

{nael,dima}@cs.binghamton.edu


Multi-cores-->Many-cores Threats and Opportunities

  • Moore's law coming to an end

    • Power wall; ILP wall; memory wall

    • “End of lazy-boy programming era”

  • Multi-cores offer a way out

    • New Moore's law: 2x number of cores every 1.5 years

    • The many-core era is about to get started

    • Will have more cores than can power -> likely to have a lot of accelerators, including the ones for security

  • How to best support trusted computing?

  • Critical to anticipate and diffuse security threats


Security Challenges for Many-cores Threats and Opportunities

  • Diverse applications, both parallel and sequential

  • New vulnerabilities due to resource sharing

    • Side-Channel and Denial-of-Service Attacks

  • Performance impact is a critical consideration

    • Can use spare cores/thread contexts to accelerate security mechanisms

    • Can use speculative checks to lower latency


Defending against attacks on shared resources
Defending against Threats and Opportunitiesattacks on shared resources


Attacks on Shared Resources Threats and Opportunities

  • Resource sharing (specifically the sharing of the cache hierarchy) opens the door for two types of attacks

    • Side-Channel Attacks

    • Denial-of-Service Attacks

  • Our first target: software cache-based side channel attacks.

    • First, some cache background...


Background: Set-Associative Caches Threats and Opportunities


L1 Cache Sharing in SMT Processor Threats and Opportunities

Instruction

Cache

Issue Queue

PC

Register

File

PC

PC

Execution

Units

Data

Cache

Fetch

Unit

PC

PC

PC

PC

Load/Store

Queues

PC

LDST

Units

Decode

RegisterRename

Re-order Buffers

ArchState

Private Resources

Shared Resources


Last-level Cache Sharing on Multicores Threats and Opportunities(Intel Xeon)

2 × quad-coreIntel Xeon e5345(Clovertown)


Advanced Encryption Standard (AES) Threats and Opportunities

  • One of the most popular algorithms in symmetric key cryptography

    • 16-byte input (plaintext)

    • 16-byte output (ciphertext)

    • 16-byte secret key (for standard 128-bit encryption)

    • several rounds of 16 XOR operations and 16 table lookups

secret key byte

Lookup Table

index

Input byte


Cache-Based Side Channel Attacks Threats and Opportunities

  • An attacker and a victim process (e.g. AES) run together using a shared cache

  • Access-Driven Attack:

    • Attacker occupies the cache, evicting victim’s data

    • When victim accesses cache, attacker’s data is evicted

    • By timing its accesses, attacker can detect intervening accesses by the victim

  • Time-Driven Attack

    • Attacker fills the cache

    • Times victim’s execution for various inputs

    • Performs correlation analysis


Attack example
Attack Example Threats and Opportunities

Cache

a

b

c

d

Main Memory

AES data

Attacker’s data

b>(a≈c≈d)

Can exploit knowledge of the cache replacement policy to optimize attack


Simple Attack Code Example Threats and Opportunities

#define ASSOC 8

#define NSETS 128

#define LINESIZE 32

#define ARRAYSIZE (ASSOC*NSETS*LINESIZE/sizeof(int))

static int the_array[ARRAYSIZE]

int fine_grain_timer(); //implemented as inline assembler

void time_cache() {

register int i, time, x;

for(i = 0; i < ARRAYSIZE; i++) {

time = fine_grain_timer();

x = the_array[i];

time = fine_grain_timer() - time;

the_array[i] = time;

}

}


Existing Solutions Threats and Opportunities

  • Avoid using pre-computed tables – too slow

  • Lock critical data in the cache (Lee, ISCA 07)

    • Impacts performance

    • Requires OS/ISA support for identifying critical data

  • Randomize the victim selection (Lee, ISCA 07)

    • Significant cache re-engineering -> impractical

    • High complexity

    • Requires OS/ISA support to limit the extent to critical data only


Desired Features and Our Proposal Threats and Opportunities

  • Desired solution:

    • Hardware-only (no OS, ISA or language support)

    • Low performance impact

    • Low complexity

    • Strong security guarantee

    • Ability to simultaneously protect against denial-of-service (a by-product of access-driven attack)

  • Our solution: Non-Monopolizable (NoMo) Caches


NoMo Caches Threats and Opportunities

  • Key idea: An application sharing cache cannot use all lines in a set

  • NoMo invariant: For an N-way cache, a thread can use at most N – Y lines

    • Y – NoMo degree

    • Essentially, we reserve Y cache ways for each co-executing application and dynamically share the rest

    • If Y=N/2, we have static non-overlapping cache partitioning

    • Implementation is very simple – just need to check the reservation bits at the time of replacement


NoMo Replacement Logic Threats and Opportunities


NoMo example for an 8-way cache Threats and Opportunities

  • Showing 4 lines of an 8-way cache with NoMo-2

  • X:N means data X from thread N


Why Does NoMo Work? Threats and Opportunities

  • Victim’s accesses become visible to attacker only if the victim has accesses outside of its allocated partition between two cache fills by the attacker.

  • In this example: NoMo-1


Evaluation Methodology Threats and Opportunities

  • We used M-Sim-3.0 cycle accurate simulator (multithreaded and Multicores derivative of Simplescalar) developed at SUNY Binghamton

    • http://www.cs.binghamton.edu/~msim

  • Evaluated security for AES and Blowfish encryption/decryption

  • Ran security benchmarks for 3M blocks of randomly generated input

  • Implemented the attacker as a separate thread and ran it alongside the crypto processes

  • Assumed that the attacker is able to synchronize at the block encryption boundaries (i.e. It fills the cache after each block encryption and checks the cache after the encryption)

  • Evaluated performance on a set of SPEC 2006 Benchmarks. Used Pin-based trace-driven simulator with Pintools.


Aggregate Exposure of Critical Data Threats and Opportunities


Sets with Critical Exposure Threats and Opportunities




NoMo Design Summary workloads simulated)

  • Practical and low-overhead hardware-only design for defeating access-driven cache-based side channel attacks

  • Can easily adjust security-performance trade-offs by manipulating degree of NoMo

  • Can support unrestricted cache usage in single-threaded mode

  • Performance impact is very low in all cases

  • No OS or ISA support required


NoMo Results Summary workloads simulated)

(for an 8-way L1 cache)

  • NoMo-4 (static partitioning): complete application isolation with 1.2% average (5% max) performance and fairness impact on SPEC 2006 benchmarks

  • NoMo-3: No side channel for AES, and 0.07% critical leakage for Blowfish. 0.8% average(4% max) performance impact on SPEC 2006 benchmarks

  • NoMo-2: Leaks 0.6% of critical accesses for AES and 1.6% for Blowfish. 0.5% average (3% max) performance impact on SPEC 2006 benchmarks

  • NoMo-1: Leaks 15% of critical accesses for AES and 18% for Blowfish. 0.3% average (2% max) performance impact on SPEC 2006 benchmarks


Extending NoMo to Last-level Caches workloads simulated)

  • Side-channel attack is possible at the L2/L3 level, especially with cache hierarchy that explicitly guarantee inclusion

  • Attacker can invalidate victim’s lines in L2/L3, thus forcing their evictions from private L1s.

  • Effect of partitioning is much more profound at that level.

  • Have to address the possibility of a multithreaded attack.

  • Examine other designs for protecting L2/L3 caches. Latency is less critical there.

  • Investigations in progress...



Using Extra Cores/Threads for Security workloads simulated)

  • Main opportunity: using extra cores and core extensions to support security:

    • Improve performance by offloading security-related computations

    • Reduce design complexity

  • Applications that we consider:

    • Dynamic Information Flow Tracking (DIFT)

    • Dynamic Bounds Checking (not covered in this talk)


Dynamic information flow tracking
Dynamic Information Flow Tracking workloads simulated)

  • Basic Idea:

    • Attacks come from outside of processor

    • Mark data coming from the outside as tainted

    • Propagate taint inside processor during program execution

    • Flag the use of tainted data in unsafe ways


Security Checking Policies workloads simulated)

  • Memory address AND data tainted

  • Load address is tainted

  • Store address is tainted

  • Jump destination is tainted*

  • Branch condition is tainted*

  • System call arguments are tainted*

  • Return address register is tainted

  • Stack pointer is tainted

  • Memory address OR data tainted*


Existing dift schemes and limitations
Existing DIFT Schemes and Limitations workloads simulated)

  • Hardware solutions:

    • Taint propagation with extra busses

    • Additional checking units

      Limitations

    • intrusive changes to datapath design

  • Software solutions:

    • More instructions to propagate and check taint

      Limitations

    • High performance cost

    • Source code recompilation


Hardware –based DIFT workloads simulated)

add r3,r1,r4

WriteBack

t

0

0

+

data

1

t

r1+r4

data

1

r1

InstructionDecode

data

1

r0

0

add r3 r1 r4

data

1

r4

r1

r1

1

1

0

ALU

r2

1

0

ExceptionCheckingLogic

MEM

r3

r1+r4

1

0

r4

r4

0

0

r5

0

IFQ

RF

1

Taint Computation Logic

Existing

DIFT


Software –based DIFT workloads simulated)

add r3 r1 r4

add r3 r1 r4

compiler

WriteBack

add r3 r1 r4

shr r2 r5 2

shl r0 r5 1

data

and r2 r2 16

data

and r0 r0 16

InstructionDecode

r0

data

or r0 r2 r0

data

r1

or r5 r5 r0

ALU

r2

00111100

r3

MEM

r4

01100000

A small region of memory is used to store taint information of memory

IFQ

RF

r5 is used for storing taint information of remaining register file

Existing

DIFT


Our proposal sift smt based dift
Our Proposal: SIFT (SMT-based DIFT) workloads simulated)

  • Execute two threads on SMT processor

    • Primary thread executes real program

    • Security thread executes taint tracking instructions

  • Committed instructions from main thread generate taint checking instruction(s) for security thread

  • Instruction generation is done in hardware

  • Taint tracking instructions are stored into a buffer from where they are fed to the second thread context

  • Threads are synchronized at system calls


Instruction Flow in SIFT workloads simulated)


SMT Datapath with SIFT Logic workloads simulated)


SIFT Example workloads simulated)

add r3 r1 r4

IFQ 1

add r3r1 r4

WriteBack

0

0

1

data

1

r1+r4

1

data

+

or

1

data

add r3r1 r4

SIFT Generator

r1

r1

1

data

InstructionDecode

r0

0

0

ALU

0

r1

1

1

r1

r4

r4

MEM

r2

1

r3

r1+r4

0

1

or r3 r1r4

r4

0

0

r4

r5

0

RF 1

RF 2

or r3 r1 r4

IFQ 2

Shared Resources

Context-1

Context-2

DIFT


SIFT Instruction Generation Logic workloads simulated)

2. Secutiry Instruction Opcodes are read from COT

1. Taint Code Generation

3. Rest of the instructions are taken from Register Organizer and stored Instruction Buffer

4. Load and Store Instruction’s memory addresses are stored in Address Buffer


Die Floorplan with SIFT Logic workloads simulated)

  • SUN T1 Open Source Core

  • IGL synthesized using Synopsys Design Compiler using a TSMC 90nm standard cell library

  • COT, IB and AB implemented using Cadence Virtuoso

  • The integrated processor netlist placed and routed using Cadence SoC Encounter

  • Cost 4.5% of whole processor area


Benefits of Taint Checking with SMT workloads simulated)

  • Software is not involved, transparent to user and applications (although the checking code can also be generated in software)

  • Hardware instruction generation is faster than software generation

  • Additional hardware is at the back end of the pipeline, it is not on the critical path

  • No inter-core communication




SIFT Performance Optimizations Instruction

  • Reduce the number of checking instructions by eliminating the ones that never change the taint state.

  • Reduce data dependencies in the checker by preloading taint values into its cache once the main program encounters the corresponding address

  • Reduce the number of security instructions depending on taint state of registers and TLB


Eliminating Checking Instructions Instruction

Primary Thread

SIFT Security Thread

SIFT – F Security Thread

lda r0,24176(r0)

xor r9,3,r2

addq r0,r9,r0

ldah r16,8191(r29)

ldq_u r1,0(r0)

lda r16,-26816(r16)

lda r0,1(r0)

lda r18,8(r16)

extqh r1,r0,r1

sra r1,56,r10

bne r2,0x14

and r9,255,r9

stl r3,32(r30)

ldl r2,64(r2)

lda r16,48(r30)

bic r2,255,r2

bis r3,r2,r2

bis r0,r0,r0

bis r9,r9,r2

bis r0,r9,r0

bis r29,r29,r16

ldq_u r1,0(r0)

bis r1,r0,r1

bne r1,0xfffffffffffff080

bis r16,r16,r16

bis r0,r0,r0

bis r16,r16,r18

bis r1,r0,r1

bis r1,r1,r10

bne r2,0xfffffffffffff080

bis r9,r9,r9

bis r3,r30,r3

bne r3,0xfffffffffffff080

stl r3,32(r30)

ldl r2,64(r2)

bis r2,r2,r2

bne r2,0xfffffffffffff080

bis r30,r30,r16

bis r2,r2,r2

bis r3,r2,r2

bis r9,r9,r2

bis r0,r9,r0

bis r29,r29,r16

ldq_u r1,0(r0)

bis r1,r0,r1

bne r1,0xfffffffffffff080

bis r16,r16,r18

bis r1,r0,r1

bis r1,r1,r10

bne r2,0xfffffffffffff080

bis r3,r30,r3

bne r3,0xfffffffffffff080

stl r3,32(r30)

ldl r2,64(r30)

bne r2,0xfffffffffffff080

bis r30,r30,r16

bis r3,r2,r2



Performance Impact of Eliminating Security Instructions Instruction

Performance Loss of SIFT and SIFT-F compared to Baseline Single Thread execution

Percentage of Filtered Instructions





Details of TLB Based Optimization Instruction

  • For Stores

  • tainted register -> clean page – generate instructions

  • tainted register -> tainted page – generate instructions

  • clean register -> clean page – don't generate instructions

  • clean register -> tainted page – generate instructions

  • For Loads

  • tainted page -> tainted register – generate instructions

  • tainted page -> clean register – generate instructions

  • clean page -> tainted register – generate instructions

  • clean page -> clean register – don't generate instructions




Future Work Instruction

  • To consider in the future:

  • Collapse multiple checking instructions via ISA

  • Optimize resource sharing between two threads

  • Provide additional execution units for the checker

  • Implementation of Register Taint Vector and TLB based instruction elimination


Thank you any questions

Thank you! InstructionAny Questions?


ad