Incorporating domain specific information into the compilation process
Download
1 / 51

Incorporating Domain-Specific Information into the Compilation Process - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

Incorporating Domain-Specific Information into the Compilation Process. Samuel Z. Guyer Supervisor: Calvin Lin April 14, 2003. Motivation. Two different views of software: Compiler’s view Abstractions: numbers, pointers, loops Operators: +, -, *, ->, [] Programmer’s view

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Incorporating Domain-Specific Information into the Compilation Process' - rianna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Incorporating domain specific information into the compilation process

Incorporating Domain-Specific Information into the Compilation Process

Samuel Z. Guyer

Supervisor: Calvin Lin

April 14, 2003


Motivation
Motivation Compilation Process

Two different views of software:

  • Compiler’s view

    • Abstractions: numbers, pointers, loops

    • Operators: +, -, *, ->, []

  • Programmer’s view

    • Abstractions: files, matrices, locks, graphics

    • Operators: read, factor, lock, draw

      This discrepancy is a problem...


Find the error part 1
Find the error – part 1 Compilation Process

  • Example:

  • Error: case outside of switch statement

    • Part of the language definition

    • Error reported at compile time

    • Compiler indicates the location and nature of error

switch (var_83) {

case 0: func_24();

break;

case 1: func_29();

break;

}

case 2: func_78();

!


Find the error part 2
Find the error – part 2 Compilation Process

  • Example:

  • Improper call to libfunc_38

    • Syntax is correct – no compiler message

    • Fails at run-time

  • Problem: what does libfunc_38 do?

    This is how compilers view reusables

struct __sue_23 * var_72;

char var_81[100];

var_72 = libfunc_84(__str_14, __str_65);

libfunc_44(var_72);

libfunc_38(var_81, 100, 1, var_72);

!


Find the error part 3
Find the error – part 3 Compilation Process

  • Example:

  • Improper call to fread() after fclose()

    • The names reveal the mistake

      No traditional compiler reports this error

    • Run-time system: how does the code fail?

    • Code review: rarely this easy to spot

FILE * my_file;

char buffer[100];

my_file = fopen(“my_data”, “r”);

fclose(my_file);

fread(buffer, 100, 1, my_file);

!


Problem
Problem Compilation Process

  • Compilers are unaware of library semantics

    • Library calls have no special meaning

    • The compiler cannot provide any assistance

  • Burden is on the programmer:

    • Use library routines correctly

    • Use library routines efficiently and effectively

      These are difficult manual tasks

    • Tedious and error-prone

    • Can require considerable expertise


Solution
Solution Compilation Process

  • A library-level compiler

    • Compiler support for software libraries

    • Treat library routines more like built-in operators

  • Compile at the library interface level

    • Check programs for library-level errors

    • Improve performance with library-level optimizations

      Key: Libraries represent domains

    • Capture domain-specific semantics and expertise

    • Encode in a form that the compiler can use


The broadway compiler

Application Compilation Process

Broadway

Source code

Error reports

Analyzer

Library-specific messages

Library

Annotations

Header files

Source code

Optimizer

Application+Library

Integrated source code

The Broadway Compiler

  • Broadway – source-to-source C compiler

    Domain-independent compiler mechanisms

  • Annotations – lightweight specification language

    Domain-specific analyses and transformations

    Many libraries, one compiler


Benefits
Benefits Compilation Process

  • Improves capabilities of the compiler

    • Adds many new error checks and optimizations

    • Qualitatively different

  • Works with existing systems

    • Domain-specific compilation without recoding

    • For us: more thorough and convincing validation

  • Improve productivity

    • Less time spent on manual tasks

    • All users benefit from one set of annotations


Outline
Outline Compilation Process

  • Motivation

  • The Broadway Compiler

  • Recent work on scalable program analysis

    • Problem: Error checking demands powerful analysis

    • Solution: Client-driven analysis algorithm

    • Example: Detecting security vulnerabilities

  • Contributions

  • Related work

  • Conclusions and future work


Security vulnerabilities
Security vulnerabilities Compilation Process

  • How does remote hacking work?

    • Most are not direct attacks (e.g., cracking passwords)

    • Idea: trick a program into unintended behavior

  • Automated vulnerability detection:

    • How do we define “intended”?

    • Difficult to formalize and check application logic

      Libraries control all critical system services

    • Communication, file access, process control

    • Analyze routines to approximate vulnerability


Remote access vulnerability
Remote access vulnerability Compilation Process

  • Example:

  • Vulnerability: executes any remote command

    • What if this program runs as root?

    • Clearly domain-specific: sockets, processes, etc.

    • Requirement:

  • Why is detecting this vulnerability hard?

int sock;

char buffer[100];

sock = socket(AF_INET, SOCK_STREAM, 0);

read(sock, buffer, 100);

execl(buffer);

!

Data from an Internet socket should not specify a program to execute


Challenge 1 pointers
Challenge 1: Pointers Compilation Process

  • Example:

  • Still contains a vulnerability

    • Only one buffer

    • Variables buffer and ref are aliases

      We need an accurate model of memory

int sock;

char buffer[100];

char * ref = buffer;

sock = socket(AF_INET, SOCK_STREAM, 0);

read(sock, buffer, 100);

execl(ref);

!


Challenge 2 scope

main Compilation Process

Challenge 2: Scope

  • Call graph:

  • Objects flow throughout program

    • No scoping constraints

    • Objects referenced through pointers

      We need whole-program analysis

!

sock = (AF_INET, SOCK_STREAM, 0);

(sock, buffer, 100);

(ref);

socket

read

execl


Challenge 3 precision
Challenge 3: Precision Compilation Process

  • Static analysis is always an approximation

  • Precision: level of detail or sensitivity

    • Multiple calls to a procedure

      • Context-sensitive: analyze each call separately

      • Context-insensitive: merge information from all calls

    • Multiple assignments to a variable

      • Flow-sensitive: record each value separately

      • Flow-insensitive: merge values from all assignments

        Lower precision reduces the cost of analysis

        Exponential polynomial ~linear


Insufficient precision

main Compilation Process

stdin

?

?

!

execl

socket

^

execl

read

^

Insufficient precision

  • Example:

    Context-insensitivity

  • Information merged at call

    • Analyzer reports 2 possible errors

    • Only 1 real error

      Imprecision leads to false positives


Cost versus precision
Cost versus precision Compilation Process

  • Problem: A tradeoff

    • Precise analysis prohibitively expensive

    • Cheap analysis too many false positives

  • Idea: Mixed precision analysis

    • Focus effort on the parts of the program that matter

    • Don’t waste time over-analyzing the rest

      Key: Let error detection problem drive precision

      Client-driven program analysis


Client driven algorithm

Precision Policy Compilation Process

Error Reports

Information Loss

Dependence Graph

Monitor

Adaptor

Client-Driven Algorithm

  • Client: Error detection analysis problem

  • Algorithm:

    • Start with fast cheap analysis – monitor imprecision

    • Determine extra precision – reanalyze

Pointer Analyzer

Client Analysis

Memory Model


Algorithm components

? Compilation Process

Algorithm components

  • Monitor

    • Runs alongside main analysis

    • Records imprecision

  • Adaptor

    • Start at the locations of reported errors

    • Trace back to the cause and diagnose


Sources of imprecision

Multiple procedure calls Compilation Process

Multiple assignments

Conditions

if(cond)

x =

foo( )

foo( )

x =

x =

x =

foo( )

= f( , )

x

Pointer dereference

Polluted target

Polluted pointer

ptr

or

ptr

(*ptr)

Sources of imprecision

Polluting assignments


In action

main Compilation Process

?

?

stdin

execl

socket

execl

read

read

read

In action...

  • Monitor analysis

  • Polluting assignments

  • Diagnose and apply “fix”

    • In this case: one procedure context-sensitive

  • Reanalyze

!


Methodology
Methodology Compilation Process

  • Compare with commonly-used fixed precision

  • Metrics

    • Accuracy – number of errors reported

      Includes false positives – fewer is better

    • Performance – only when accuracy is the same


Programs
Programs Compilation Process

  • 18 real C programs

    • Unmodified source – all the issues of production code

    • Many are system tools – run in privileged mode

  • Representative examples:


Error detection problems
Error detection problems Compilation Process

  • File access:

  • Remote access vulnerabillity:

  • Format string vulnerability (FSV):

  • Remote FSV:

  • FTP behavior:

Files must be open when accessed

Data from an Internet socket should not specify a program to execute

Format string may not contain untrusted data

Check if FSV is remotely exploitable

Can this program be tricked into reading and transmitting arbitrary files


Results

Full (CS-FS) Compilation Process

Medium (CI-FS)

Slow (CS-FI)

Fast (CI-FI)

Client-Driven

?

?

?

?

?

?

?

?

?

?

?

?

0

100X

29

26

15

7

85

0

31

5

0

0

41

1

0

28

89

0

0

7

0

5

Increasing number of CFG nodes

1

41

6

4

7

15

26

0

0

1

18

88

0

0

6

0

26

0

0

7

15

4

18

nn

fcron

make

named

SQLite

muh (I)

pfinger

pureftp

apache

stunnel

privoxy

muh (II)

cfengine

wu-ftp (I)

wu-ftp (II)

blackhole

ssh client

ssh server

0

7

0

0

29

0

6

0

85

28

2

0

31

4

5

93

0

41

Results

Remote access vulnerability

1000X

Normalized performance

10X


Overall results
Overall results Compilation Process

  • 90 test cases: 18 programs, 5 problems

    • Test case: 1 program and 1 error detection problem

    • Compare algorithms: client-driven vs. fixed precision

As accurate as any other algorithm:

87 out of 90

Runs faster than best fixed algorithm:

64 out of 87

Performance not an issue:

19 of 23

Both most accurate and fastest:

29 out of 64


Why does it work
Why does it work? Compilation Process

  • Validates our hypothesis

    • Different errors have different precision requirements

    • Amount of extra precision is small


Outline1
Outline Compilation Process

  • Motivation

  • The Broadway Compiler

  • Recent work on scalable program analysis

    • Problem: Error checking demands powerful analysis

    • Solution: Client-driven analysis algorithm

    • Example: Detecting security vulnerabilities

  • Contributions

  • Related work

  • Conclusions and future work


Central contribution
Central contribution Compilation Process

  • Library-level compilation

    • Opportunity: library interfaces make domains explicit in existing programming practice

    • Key: a separate language for codifying domain-specific knowledge

    • Result: our compiler can automate previously manual error checks and performance improvements

Knowledge representation

Applying knowledge

Results

Old way:

Informal

Manual

Difficult, unpredictable

Broadway:

Codified

Compiler

Easy, automatic, reliable


Specific contributions
Specific contributions Compilation Process

  • Broadway compiler implementation

    • Working system (43K C-Breeze, 23K pointers, 30K Broadway)

  • Client-driven pointer analysis algorithm [SAS’03]

    • Precise and scalable whole-program analysis

  • Library-level error checking experiments [CSTR’01]

    • No false positives for format string vulnerability

  • Library-level optimization experiments [LCPC’00]

    • Solid improvements for PLAPACK programs

  • Annotation language [DSL’99]

    • Balance expressive power and ease of use


Related work
Related work Compilation Process

  • Configurable compilers

    • Power versus usability – who is the user?

  • Active libraries

    • Previous work focusing on specific domains

    • Few complete, working systems

  • Error detection

    • Partial program verification – paucity of results

  • Scalable pointer analysis

    • Many studies of cost/precision tradeoff

    • Few mixed-precision approaches


Future work
Future work Compilation Process

  • Language

    • More analysis capabilities

  • Optimization

    • We have only scratched the surface

  • Error checking

    • Resource leaks

    • Path sensitivity – conditional transfer functions

  • Scalable analysis

    • Start with cheaper analysis – unification-based

    • Refine to more expensive analysis – shape analysis


Incorporating domain specific information into the compilation process

Thank You Compilation Process


Annotations i
Annotations (I) Compilation Process

  • Dependence and pointer information

    • Describe pointer structures

    • Indicate which objects are accessed and modified

procedure fopen(pathname, mode)

{

on_entry { pathname --> path_string

mode --> mode_string }

access { path_string, mode_string }

on_exit { return --> new file_stream }

}


Annotations ii
Annotations (II) Compilation Process

  • Library-specific properties

    Dataflow lattices

property State : { Open, Closed}

initially Open

property Kind : { File,

Socket { Local, Remote } }

^

^

Remote

Local

Closed

Open

File

Socket

^

^


Annotations iii
Annotations (III) Compilation Process

  • Library routine effects

    Dataflow transfer functions

procedure socket(domain, type, protocol)

{

analyze Kind {

if (domain == AF_UNIX) IOHandle <- Local

if (domain == AF_INET) IOHandle <- Remote

}

analyze State { IOHandle <- Open }

on_exit { return --> new IOHandle }

}


Annotations iv
Annotations (IV) Compilation Process

  • Reports and transformations

procedure execl(path, args)

{

on_entry { path --> path_string }

reportif (Kind : path_string could-be Remote)

“Error at “ ++ $callsite ++ “: remote access”;

}

procedure slow_routine(first, second)

{

when (condition)

replace-with%{ quick_check($first);

fast_routine($first, $second); }%

}


Why does it work1
Why does it work? Compilation Process

  • Validates our hypothesis

    • Different clients have different precision requirements

    • Amount of extra precision is small


Incorporating domain specific information into the compilation process
Time Compilation Process


Validation
Validation Compilation Process

  • Optimization experiments

    • Cons: One library – three applications

    • Pros: Complex library – consistent results

  • Error checking experiments

    • Cons: Quibble about different errors

    • Pros: We set the standard for experiments

  • Overall

    • Same system designed for optimizations is among the best for detecting errors and security vulnerabilities


Type theory

Flow values Compilation Process

Types

Transfer functions

Inference rules

Type Theory

  • Equivalent to dataflow analysis (heresy?)

  • Different in practice

    • Dataflow: flow-sensitive problems, iterative analysis

    • Types: flow-insensitive problems, constraint solver

  • Commonality

    • No magic bullet: same cost for the same precision

    • Extracting the store model is a primary concern

Remember Phil

Wadler’s talk?


Generators
Generators Compilation Process

  • Direct support for domain-specific programming

    • Language extensions or new language

    • Generate implementation from specification

  • Our ideas are complementary

    • Provides a way to analyze component compositions

    • Unifies common algorithms:

      • Redundancy elimination

      • Dependence-based optimizations


Is it correct
Is Compilation Processit correct?

Three separate questions:

  • Are Sam Guyer’s experiments correct?

    • Yes, to the best of our knowledge

      • Checked PLAPACK results

      • Checked detected errors against known errors

  • Is our compiler implemented correctly?

    • Flip answer: who’s is?

    • Better answer: testing suites

  • How do we validate a set of annotations?


Annotation correctness
Annotation correctness Compilation Process

Not addressed in my dissertation, but...

  • Theoretical approach

    • Does the library implement the domain?

    • Formally verify annotations against implementation

  • Practical approach

    • Annotation debugger: interactive

    • Automated assistance in early stages of development

  • Middle approach

    • Basic consistency checks


Error checking vs optimization

Optimistic Compilation Process

False positives allowed

It can even be unsound

Tend to be “may” analyses

Correctness is absolute

“Black and white”

Certify programs bug-free

Cost tolerant

Explore costly analysis

Pessimistic

Must preserve semantics

Soundness mandatory

Tend to be “must” analyses

Performance is relative

Spectrum of results

No guarantees

Cost sensitive

Compile-time is a factor

Error Checking vs Optimization


Complexity
Complexity Compilation Process

  • Pointer analysis

    • Address taken: linear

    • Steensgaard: almost linear (log log n factor)

    • Anderson: polynomial (cubic)

    • Shape analysis: double exponential

  • Dataflow analysis

    • Intraprocedural: polynomial (height of lattice)

    • Context-sensitivity: exponential (call graph)

  • Rarely see worst-case


Optimization
Optimization Compilation Process

  • Overall strategy

    • Exploit layers and modularity

    • Customize lower-level layers in context

  • Compiler strategy: Top-down layer processing

    • Preserve high-level semantics as long as possible

    • Systematically dissolve layer boundaries

  • Annotation strategy

    • General-purpose specialization

    • Idiomatic code substitutions


Plapack optimizations

Processor grid Compilation Process

PLA_Gemm( , , );  PLA_Local_gemm

PLA_Gemm( , , );  PLA_Rankk

PLAPACK Optimizations

  • PLAPACK matrices are distributed

  • Optimizations exploit special cases

    • Example: Matrix multiply


Results1
Results Compilation Process


Find the error part 31
Find the error – part 3 Compilation Process

  • State-of-the-art compiler

struct __sue_23 * var_72;

struct __sue_25 * new_f = (struct __sue_25 *) malloc(sizeof (struct __sue_25));

_IO_no_init(& new_f->fp.file, 1, 0, ((void *) 0), ((void *) 0));

(& new_f->fp)->vtable = & _IO_file_jumps;

_IO_file_init(& new_f->fp);

if (_IO_file_fopen((struct __sue_23 *) new_f, filename, mode, is32) != ((void *) 0)) {

  var_72 = & new_f->fp.file;

  if ((var_72->_flags2 & 1) && (var_72->_flags & 8)) {

    if (var_72->_mode <= 0) ((struct __sue_23 *) var_72)->vtable = & _IO_file_jumps_maybe_mmap;

    else ((struct __sue_23 *) var_72)->vtable = & _IO_wfile_jumps_maybe_mmap;

    var_72->_wide_data->_wide_vtable = & _IO_wfile_jumps_maybe_mmap;

  }

}

if (var_72->_flags & 8192U) _IO_un_link((struct __sue_23 *) var_72);

if (var_72->_flags & 8192U) status = _IO_file_close_it(var_72);

  else status = var_72->_flags & 32U ? - 1 : 0;

((* (struct _IO_jump_t * *) ((void *) (& ((struct __sue_23 *) (var_72))->vtable) +

                             (var_72)->_vtable_offset))->__finish)(var_72, 0);

if (var_72->_mode <= 0)

  if (((var_72)->_IO_save_base != ((void *) 0))) _IO_free_backup_area(var_72);

if (var_72 != ((struct __sue_23 *) (& _IO_2_1_stdin_)) &&

var_72 != ((struct __sue_23 *) (& _IO_2_1_stdout_)) &&

    var_72 != ((struct __sue_23 *) (& _IO_2_1_stderr_))) { var_72->_flags = 0;

  free(var_72); }

bytes_read = _IO_sgetn(var_72, (char *) var_81, bytes_requested);


End backup slides
End backup slides Compilation Process