Number seven of a series

Number seven of a series Drinking from the Firehose Defense against malice and error - security and reliability in the Mill™ CPU Architecture Naughty, naughty! Bad program, mustn’t do that!

Talks in this series Encoding The Belt Memory Prediction Metadata and speculation Execution Security and reliability Specification Software pipelines You are here Slides and videos of other talks are at: http://ootbcomp.com/docs

The Mill CPU The Mill is a new general-purpose commercial CPU family. The Mill has a 10x single-thread power/performance gain over conventional out-of-order superscalar architectures, yet runs the same programs, without rewrite. • This talk will explain: • the Mill memory and security models • how calls can cross security boundaries safely • how to replace task switches – and save >1000X • how to make most exploits impossible Not all, mind you!

Caution! Gross over-simplification! This talk tries to convey an intuitive understanding to the non-specialist. The reality is more complicated. (we try not to over-simplify, but sometimes…)

Motivating example – buggy drivers Device drivers need access to special parts of memory to make the device work – MMIO, on-device buffers, etc. They shouldn’t have access to the OS or application state. Ideally, each driver should be its own process, with relevant device-specific memory regions mapped in. application OS driver device Clean, simple – and too expensive

Mechanism vs. policy This talk is about mechanism – how Mill security works. It is not about policy – how the mechanism is used. The Mill is a general-purpose CPU architecture. It is not a Unix machine. It is not a Windows, …, machine. It is not a C machine. It is not a Java, …, machine. It is a platform in which each of those can implement their own security model. To the extent that they have one.

Some philosophy Security must be unobtrusive, unavoidable, and cheap or it won’t be used.

Some philosophy All must have equal security, none more equal than others No pigs on this farm.

The Mill protection model You can see only what I give you I can see only what you give me Fast, cheap, no third-parties

What about the OS? The operating system is an application - like any other. There are no privileged operations. There is no Supervisor Mode. All protection is by memory address. Byte address.

A review protection vs. translation No longer coupled

Memory hierarchy from 40,000 ft. CPU core decode retire stations load/store FUs I$0e I$0f dPLB iPLB D$1 I$1e I$1f Harvard level 1 L$2 shared level 2 TLB View is representative. Actual hierarchy is configured in each chip specification. The Mill uses virtual caching and the single address space model. device controllers MMIO DRAM ROM devices

Memory hierarchy from 40,000 ft. virtual addresses retire stations load/store FUs eI$0 fI$0 dPLB iPLB D$1 eI$1 fI$1 Harvard level 1 L$2 shared level 2 TLB The Mill uses virtual caching and the single address space model. device controllers physical addresses MMIO DRAM ROM devices

Memory model Program addresses must be translated to physical addresses before being looked up in cache. bottleneck Traditional: TLB virtual address physical address cache CPU load operation translation/ protection data lines regs fault Mill: virtual address cache CPU load operation data All tasks use the same virtual addresses, no aliasing or translation across tasks or OS. lines belt PLB protection fault

Why put translation in front of the cache? bottleneck Traditional TLB virtual address physical address cache CPU load operation translation/ protection data lines regs fault To fit in 32-bit memory, different programs must overlap addresses (aliasing). Translation gives each program private memory, even while using the same bit patterns as pointers. • The cost: • On the critical path, TLBs must be very fast, small, and power-hungry, and frequently multilevel. Big programs can see 20% or more TLB overhead.

Why put translation after the cache? TLB out of critical path, only referenced on cache misses and evicts; can be big, single-level, and low power. Pointers can be passed to OS or other tasks without translation; simplifies sharing and protection for apps. Protection checking done in parallel with cache access. Mill: virtual address cache CPU load operation data All tasks use the same virtual addresses, no aliasing or translation across tasks or OS. lines belt PLB protection fault

The address space max The other four bits in a pointer are not part of the address. 60 bits Regionsare parts of space. Regions may overlap. A region has byte granularity 0 LWB UPB Regions are parts of the address space, not of memory. The whole potential data space of a program, included unallocated heap, may be one region.

Region descriptors max Regions have descriptors, kept in OS tables and cached in the PLB. A descriptor gives: location access rights identifications 0 read writeexecute portal … region desc: LWB UPB rights IDs A user matching the identifications can reference the location in the way indicated by the rights.

Turf – a collection of regions address space regions A turf has a non-forgeable, globally unique id. Each region descriptor carries a turf id turf A turf comprises all regions with descriptors carrying the turf id. region desc: LWB UPB rights turf ID Region descriptor turf ids may be wild-carded. A region descriptor contains only one turf id. But the same region can have several descriptors with different turf ids.

Threads – lines of execution A thread runs in a turf – one turf at a time, but can change. turf 5

Threads – lines of execution A thread runs in a turf – one turf at a time, but can change. turf 5 turf 17 Note that the descriptors of a turf can describe overlapping regions, possibly with different rights. Note that the descriptors of two different turfs can describe the same region, possibly with different rights.

Threads – lines of execution A thread runs in a turf – one turf at a time, but can change. turf 5 turf 17 While running in turf 5 While running in turf 17 A thread can see and use A register holds the current turf ID for the thread. Many threads can be in the same turf concurrently.

Threads – lines of execution address space regions A thread also has a unique non-forgeable global id. A region belongs to a thread if the thread id is in the descriptor. Region descriptor thread ids may be wild-carded. region desc: LWB UPB rights turf ID thread ID At power-up, hardware starts an initial thread in the All region, the whole 60-bit address space with all rights. Your vision increases as you approach the All. Swami Suchananda

Granting Each thread runs in a turf, and has the rights of every region of that turf, as well as the thread’s own rights. A thread can grant a subset of one of its regions to another turf or thread, with a subset of its rights. owned desc: LWB UPB R/W turf 17 thread * granted desc: LWB UPB R turf 22 thread 5 Grant is a hardware operation. Granted region descriptors are pushed to the PLB.

The Region Table Region descriptors are kept in the Region Table in memory and cached in the PLB. The table is an Augmented Interval Tree (Cormen2001) searched by address range. Insertion, deletion and search are logN. PLB Region Table Newly granted region descriptors have a Novel bit in the PLB.

The Region Table Region descriptors are kept in the Region Table in memory and cached in the PLB. The table is an Augmented Interval Tree (Cormen2001) searched by address range. Insertion, deletion and search are logN. PLB Region Table Evicted novel descriptors are copied to the Table.

The Region Table Region descriptors are kept in the Region Table in memory and cached in the PLB. The table is an Augmented Interval Tree (Cormen2001) searched by address range. Insertion, deletion and search are logN. PLB Region Table Novel bit is not set in descriptors loaded from the table.

The Region Table Region descriptors are kept in the Region Table in memory and cached in the PLB. The table is an Augmented Interval Tree (Cormen2001) searched by address range. Insertion, deletion and search are logN. PLB Region Table Evicted non-novel descriptors are discarded.

Revocation Granted regions may be revoked, implicitly or explicitly. PLB Region Table Region descriptors pushed to the PLB have the Novel bit set.

Revocation Granted regions may be revoked, implicitly or explicitly. PLB Region Table Revoked novel descriptors are simply discarded.

Revocation Granted regions may be revoked, implicitly or explicitly. PLB Region Table Descriptors loaded from the table have the Novel bit clear.

Revocation Granted regions may be revoked, implicitly or explicitly. By use of the Novel bit, the great majority of transient grants exist only in the PLB and never go to the Table. PLB Region Table Non-novel descriptors are discarded in the PLB. And lazily removed from the table

Avoiding the PLB Well Known Regions

Avoiding the PLB Every turf has three Well Known Region descriptors held in registers, not in the PLB: code, data, and constant pool load module binary code cpReg mapped in memory constants cppReg initialized data dpReg Well Known Regions are created by the loader.

Avoiding the PLB Every thread has two Well Known Region descriptors held in registers, not the PLB: stack and Thread Local limit The stack region covers only between baseand spReg. stack: load(ptr,,,) spReg frame fpReg load(ptr,,,) frame frame data stack region frame The stack region dynamically adjusts to track call/return. frame base

Avoiding the PLB Every thread has two Well Known Region descriptors held in registers, not the PLB: stack and Thread Local limit Hardware initializes every new frame to zero. (see http://ootbcomp.com/docs/Memory) stack: spReg frame Beyond the top is inaccessible. fpReg frame You cannot browse in stack rubble. frame frame Nor can anyone else. frame base

Smash and grab stack protection

Smash and grab Return-oriented programming is an exploit that permits an attacker to execute arbitrary code, even if all code is in a ROM and the hardware prevents execution of data. It works by smashing the stack (typically a buffer overrun) and then changing the return addresses saved on the stack to point to the desired instructions already in memory. The target instruction(s) must be followed by a return instruction, which follows another modified address on the stack to the next instructions the attacker wants to execute. Various defenses make these attacks harder to do None make them impossible.

Mill spiller stack The Mill has a stack for application data frame frame stack region frame frame frame

Mill spiller space Mill program state is not kept on the data stack. Return addresses and other state are in spiller space, not in the app. core spiller engine state data frame frame stack region Return-oriented exploits are impossible on a Mill frame frame frame spiller space

How about debuggers? Apps cannot see the call chain. Whence a backtrace? Application Trace Service stack spiller spiller space app space trace space The Trace Service is a callable API that has read rights in Spiller space. Trace will return spill state information about a frame to anyone who has read rights to the frame.

Service-oriented programming services

Service-oriented programming A service is a secure, stateful, callable behavior provider. secure, stateful, callable A service is secure You can’t tromp on it; it can’t tromp on you. A service is stateful It remembers what it was doing for you. It may still be working for you while you’re gone. A service is callable You reach it by a normal function call, not a task switch. The cost is two cache loads per call

Service access A service function is accessed via a portal. A pointer to a portal can be called like any other function pointer. Portal layout: The portal is one I$1 cache line, and one fetch to access. The whole line must have Portal permission. entry turf id data code pool … A portal call: • Spiller saves the Well Known Region descriptors • Loads the WKR descriptors from the portal • Switches the turf ID register to the new turf • Calls to the entry address normally

Service access A portal call is not a process switch or thread switch. frame WKR frame stack application code call turf 9 state thread 17

Service access A portal call is not a process switch or thread switch. After a portal call, the same thread is running service code in the service environment, with no old access. frame PLB entry frame WKR frame portal call turf 5 stack application service code code turf 9 turf 5 state state thread 17

But that doesn’t quite work… You can get a fragmented stack if an application and a service call each other back, or services cross-call. application service frame Also: what happens on stack overflow? frame frame frame A service should not be faulted just because a caller was close to its limit. frame frame frame stack Lots of little regions gives poor PLB performance.

Stacklets A thread in a service needs its own stack. service B The logical stack of each thread is a chain of stacklets, one for each turf entered by a nested portal call. portal call service A portal call But - how can you allocate a stacklet in the middle of a portal call? application stack

Stacklets A stacklet per turf solves the callback problem. WKR frame frame WKR application service application portal-calls service

Stacklets A stacklet per turf solves the callback problem. frame WKR frame frame WKR application service service back-calls application

Number seven of a series

Number seven of a series

Presentation Transcript

Financial Management Series Number 1

Financial Management Series Number 14

Think of a number

Think of a number

Number eight of a series

…a series of tubes

MAGIC NUMBER SEVEN

Percent of a number

Four less than a number equals seven.

a series of examples

Building a Sunspot Group Number Backbone Series

Series of a Soul

Financial Management Series Number 10

Number five of a series

Number of agricultural holdings Legend ( Series: 2007 )

Number Series Tricks

Financial Management Series Number 3

Number of agricultural holdings Legend ( Series: 2007 )

Financial Management Series Number 10