Secure Compiler Seminar 9/12 Survey on Design of Secure Low-Level Languages

Secure Compiler Seminar 9/12Survey on Design ofSecure Low-Level Languages Toshihiro YOSHINO, Yonezawa Lab. <tossy-2@yl.is.s.u-tokyo.ac.jp>

High-LevelProgram SecureLow-LevelLanguage MachineExecutable A Secure Low-Level Language is Needed • As the secure compiler target language • “secure” means it has a method to prove program’s properties • Memory safe, control flow safe, … • For this, its concrete formal model should be given • It should also below-level • To reduce complexity of JIT compiler (In other words, TCB) Secure Compiler Verifier JIT compilation

Existing Researches • Two major approaches • TAL, PCC • Extension to conventional assembly languages • Utilizes certain logic (such as type theories) to prove safety • Virtual Machines • Introduce intermediate languages of their own • Many of them adopt safe-by-construction design e.g. Java VM is semantically safe in memory operation • Java VM, Microsoft CIL, mvm [Franz et al. 2003],Jinja [Klein et al. 2006], ADL [Yoshino 2006], …

Comparison: PCC vs. Java VM Machine Independent • PCC • (Extended) Machine code + Proof • Generated code and proof are machine-dependent • Requires one VC implementation for each architecture • Java VM • Verifier ensures type and control flow safety • It often restricts optimizations • Leads to performance degradation • High cost to perform verification • Stack is nothing more than a set of untyped (variable-number) registers Machine Dependent

Limitations ofJava Bytecode Verification • Initialization check is incomplete • Correlation with other variables is not taken into account • Example: class Test { int test(boolean b) { int i; try { if(b) return 1; i = 2; } finally { if(b) i = 3; } return i; }} • Incomplete common subexpression elimination • Cannot eliminate c.s. in address calculations (array refs)

Then What Should We Do? • Maximize machine-independent part • Avoids porting cost of the system • Tradeoff against the size of TCB (Trusted Computing Base) • But recent works in PCC and TAL (e.g. Foundational PCC) aim solely to minimize TCB • Reduce proof size and generation cost • PCC requires much effort to produce proof, because the target’s level is very low • Registers and memory are untyped, etc.

mvm [Franz et al. 2003] • Aimed to find the semantics level that: • Is effective at supporting proof-carrying code • Can also be translated efficientlyinto highly performing native code(on many platforms) • Separated design between VMlayer and PCC layer [1] M. Franz et al. A Portable Virtual Machine Target for Proof-Carrying Code. IVME ’03.

mvm: Virtual Machine Design • Register-based architecture • The number of registers is not bounded • Registers are categorized by the type of values: Integer, Boolean, Pointer, Address • Pointer registers are used to store pointers to heap objects (more specifically, array heads) • Address registers are for storing results of address arithmetic • Bounds check is not performed in arithmetic, so it has to be checked in higher layer • Heap can be used to store objects • Heap model is explained next

1 false 42 true … … … … mvm: Virtual Machine Design mvm Virtual Machine Integer Boolean Pointer Address label1: instr instr … label2: … Heap

mvm: Heap Model • mvm heap consists of arrays of objects • Object representation in mvm • Each object is tagged • Tag can only be written with new operation and is immutable after creation • Two sections of data area: values and pointers • Integers, booleans are stored into the first section • Pointers are stored into the second section 1 42

mvm: Heap Model • A type is associated with its tag value, layout and structure • This association is managed by compiler • Layout describes the sizes of data sections • Structure describes the possible substructure inside pointer section • Example of type information {} means disjunction datatype T = Int of int | Pair of int * int… T list … <> means a tuple

mvm: Heap Model • Example of a T list object tree int*int T T list

mvm: Instructions • Arithmetics, Logical calculation • Similar to many other languages ;-) • Branch • Unconditional: goto label • Conditional: brtrue bi, label / brfalse bi, label • Condition must be taken from a boolean register • Jump is allowed only to a label • Conditional by object tag (RTTI): iftag

mvm: Instructions • Object creation and access • pj = new(tag, ik) • Creates an array of ik objects with type tag • r = load([sizev, sizep] | tag, pk, offset) • store([sizev, sizep] | tag, pk, offset, r) • sizes and tag are used to check memory safety • Pointer registers and address registers • Object access also permit address registers ak • This distinction is for supporting garbage collection • Address registers always contain “derived” pointers

mvm: Instructions • Accessing arrays • an = adda([sizev, sizep] | tag, pk, il) • Calculates address of the il-th element in an array of type tag stored at pk • in = getlen(pk) • Guards • Bounds checking: CHECKLEN(pk, il) • Validity checking: CHECKNOTNULL(pk) • Type checking: CHECKTAG(pi, [sizev, sizep] | tag) • These guards are inserted when static checking failed

An Example mvm Program

Type Safety in mvm Programs • Operations on primitives are all type-safe • Because registers to store values are distinct • Type-safety proofs are needed only for non-primitive operations • Pointers, arrays and records • For every pointer operation, check that result pointer: • Points to the beginning of an array, record or value • Points to an object of the correct type

Jinja [Klein et al. 2006] • A Java-like programming language built on Isabelle/HOL • Formal description of Jinja language, Jinja VM and compiler are given • Several properties were machine-checked • Big step evaluation and small step evaluation (atomic operations) are equivalent • Compiler correctness [2] G. Klein, T. Nipkow. A Machine-Checked Model for a Java-Like Language, Virtual Machine, and Compiler. TOPLAS 28(4), 2006.

Jinja Language • “Jinja is not Java” • Object-oriented language with exceptions • A program is a set of class definitions and,a class consists of several fields and methods • Method body is an expression • Overriding is supported as in Java • But not overloading, because it is complicated • Language is statically typed • Type system ensures that the execution of a well-typed program never gets stuck

Jinja Language: Language Elements • Values • Boolean Bool b, Integer Intg i, Reference Addr a • Null reference Null, Dummy value Unit • Expressions • Val v , binary operations e1 op e2 , Var V , V := e , e1; e2 , … • Conditional: if (e) e1 else e2 , while (e) e’ / Block: {V:T, e} • Object construction: new C • Casting: Cast C e • Field access: e.F{D} , e.F{D} := e • D is annotation added in preprocessing (e.g., by typechecker) • Method call: e.M(e, e, …) • Exception: throw e , try e1 catch(C V) e2

Jinja Language: Semantics • Big step semantics • Typical operational semantics • State = <Heap, Local Variables> • Detail abbreviated because nothing special • Small step semantics • Finer-grained semantics • One-step evaluation • Useful for formalizing parallelism (?) • Each (small) operation is considered atomic • Not discussed in the paper

Jinja Language: Semantics • Big and small semantics are proven to be equivalent • wwf-J-prog means “weak well-formedness”, which is defined by the following properties: • Number of parameter types and of parameter names are equal • “this” is not included in parameter list • Free variables in the method body only refer to this or these parameters

Jinja VM • Similar to Java VM • Stack-based machine with heap • State = <addr option, heap, frame list> • First element is possibly a generated exception • Third element is a call-stack • Frame = <stack, registers, cname, mname, pc> where stack = value list, registers = value list • Evaluation of operands are done on stack • Registers are for storing local variables

Jinja VM: Instructions • Basic operations • Push v / Pop • Register operations: Load n / Store n • Arithmetics: IAdd, … • Logical operations: CmpEq, … • Object manipulation • Construction: New cname • Casting: Checkcast cname • Field access: Getfield vnamecname / Putfield vname cname • Method invocation: Invoke mnamen

Jinja VM: Instructions • Control flow operation • Branching: Goto n / IfFalse n • n is relative offset from the instruction • Exit from a method: Return • Exception • Throwing an exception: Throw • Information about exception handlers (try-catch) are attached to method declarations • Handler is retrieved from there when needed

Jinja VM: Semantics • Please refer to the paper for detail • Basically, straightforward and intuitive • In this level, there are no runtime checks • For example, IAdd (Integer addition) does not check whether its argument is really integers • Otherwise, the result is unspecified • This kind of checks is performed by a bytecode verifier

Jinja VM Bytecode Verification • JVM relies on the following assumptions: • Types are correct • No overflow or underflow in stack • Code containment • Register initialization before use • Just the same as Java VM • Bytecode verifier statically ensures these assumptions

Intg 1 Addr 1 Int Class A Class B Addr 2 Addr 1 Class A Class A Addr 1 Addr 3 Class B Class B Addr 3 Int 0 Int 0 JVM Program Jinja VM Bytecode Verification • Abstract interpretation • Instead of values, consider only types State State Type

Jinja Compiler and its Correctness • 2-staged compilation • Map parameter names to register indices • Assign local variables to registers • Gather variable occurrences and use it to lookup • Code generation • expression → instruction list (compE2) • Straightforward definition • Exception table generation (compEx2) • Separated from compE2, because exception table must contain global addresses

Jinja Compiler and its Correctness • Correctness of compilation • If a program is weakly well-formed, then: Heap,Vars Heap,[Frame] compilation Jinjaprogram JVMbytecode Heap,Vars Heap, []

Implementation of Jinja • http://afp.sourceforge.net/entries/Jinja.shtml • About 20kLoC in Isabelle/HOL • Over 1,000 theorems are defined • It takes about 25 min. to process these proofs on a 3GHz Pentium 4 machine with 1GB RAM

Summary of Today’s Talk • We would need a secure low-level language for the target of a secure compiler • Minimize machine-dependent part to reduce implementation cost • Also reduce cost for proof generation • To answer this, surveyed two VM projects • mvm • Aimed to find the “sweet spot” that reconciles high performance and small type-safety proofs • Jinja • Constructed a unified formal model of a Java-like language, the underlying VM and compiler • In contrast to mvm, this research is oriented toward higher-level languages and compiler’s properties

How about ADL [Yoshino 2006] …? • The position of ADL is close to mvm • To provide a common basis of implementing verifier for low-level languages • Assumed translation direction is opposite • mvm is an intermediate code of compilation • ADL is designed to simulate real machines JVM, mvm MachineCode ADL Secure L3

How about ADL [Yoshino 2006] …? • ADL takes minimalist approach • Only 7 kinds of commands • Instead, expression-based design to allow complex formulae to be easily written • ADL can be used as an intermediate language? • Probably some modification needed • Register allocation is done, but except for variables • Minimalist design, however, may increase complexity in constructing a verification logic • Abstract interpretation is often not sufficient, so a verification logic may want to calculate exact values

More References • LLVM Project [Lattner 2000] • http://www.llvm.org/ • Use VM for interprocedural optimization • SafeTSA [Amme et al. 2001] • SSA-based language for mobile code security • Dis virtual machine [Winterbottom et al. 1997] • Omniware system [Adl-Tabatabai et al. 1996]

High-LevelProgram MachineExecutable (Typical) Compiler Construction andSeveral Intermediate Languages Lexing /Parsing TypeChecking Normalize(SSA, etc.) LLVM,SafeTSA Java, CIL,Jinja(VM) Intermediate CodeGeneration Optimize TAL,PCC mvm ADL Target Code Generation RegisterAllocation PrettyPrinting

Secure Compiler Seminar 9/12 Survey on Design of Secure Low-Level Languages