incorporating domain specific information into the compilation process
Download
Skip this Video
Download Presentation
Incorporating Domain-Specific Information into the Compilation Process

Loading in 2 Seconds...

play fullscreen
1 / 51

Incorporating Domain-Specific Information into the Compilation Process - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

Incorporating Domain-Specific Information into the Compilation Process. Samuel Z. Guyer Supervisor: Calvin Lin April 14, 2003. Motivation. Two different views of software: Compiler’s view Abstractions: numbers, pointers, loops Operators: +, -, *, ->, [] Programmer’s view

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Incorporating Domain-Specific Information into the Compilation Process' - rianna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
incorporating domain specific information into the compilation process

Incorporating Domain-Specific Information into the Compilation Process

Samuel Z. Guyer

Supervisor: Calvin Lin

April 14, 2003

motivation
Motivation

Two different views of software:

  • Compiler’s view
    • Abstractions: numbers, pointers, loops
    • Operators: +, -, *, ->, []
  • Programmer’s view
    • Abstractions: files, matrices, locks, graphics
    • Operators: read, factor, lock, draw

This discrepancy is a problem...

find the error part 1
Find the error – part 1
  • Example:
  • Error: case outside of switch statement
    • Part of the language definition
    • Error reported at compile time
    • Compiler indicates the location and nature of error

switch (var_83) {

case 0: func_24();

break;

case 1: func_29();

break;

}

case 2: func_78();

!

find the error part 2
Find the error – part 2
  • Example:
  • Improper call to libfunc_38
    • Syntax is correct – no compiler message
    • Fails at run-time
  • Problem: what does libfunc_38 do?

This is how compilers view reusables

struct __sue_23 * var_72;

char var_81[100];

var_72 = libfunc_84(__str_14, __str_65);

libfunc_44(var_72);

libfunc_38(var_81, 100, 1, var_72);

!

find the error part 3
Find the error – part 3
  • Example:
  • Improper call to fread() after fclose()
    • The names reveal the mistake

No traditional compiler reports this error

    • Run-time system: how does the code fail?
    • Code review: rarely this easy to spot

FILE * my_file;

char buffer[100];

my_file = fopen(“my_data”, “r”);

fclose(my_file);

fread(buffer, 100, 1, my_file);

!

problem
Problem
  • Compilers are unaware of library semantics
    • Library calls have no special meaning
    • The compiler cannot provide any assistance
  • Burden is on the programmer:
    • Use library routines correctly
    • Use library routines efficiently and effectively

These are difficult manual tasks

    • Tedious and error-prone
    • Can require considerable expertise
solution
Solution
  • A library-level compiler
    • Compiler support for software libraries
    • Treat library routines more like built-in operators
  • Compile at the library interface level
    • Check programs for library-level errors
    • Improve performance with library-level optimizations

Key: Libraries represent domains

    • Capture domain-specific semantics and expertise
    • Encode in a form that the compiler can use
the broadway compiler

Application

Broadway

Source code

Error reports

Analyzer

Library-specific messages

Library

Annotations

Header files

Source code

Optimizer

Application+Library

Integrated source code

The Broadway Compiler
  • Broadway – source-to-source C compiler

Domain-independent compiler mechanisms

  • Annotations – lightweight specification language

Domain-specific analyses and transformations

Many libraries, one compiler

benefits
Benefits
  • Improves capabilities of the compiler
    • Adds many new error checks and optimizations
    • Qualitatively different
  • Works with existing systems
    • Domain-specific compilation without recoding
    • For us: more thorough and convincing validation
  • Improve productivity
    • Less time spent on manual tasks
    • All users benefit from one set of annotations
outline
Outline
  • Motivation
  • The Broadway Compiler
  • Recent work on scalable program analysis
    • Problem: Error checking demands powerful analysis
    • Solution: Client-driven analysis algorithm
    • Example: Detecting security vulnerabilities
  • Contributions
  • Related work
  • Conclusions and future work
security vulnerabilities
Security vulnerabilities
  • How does remote hacking work?
    • Most are not direct attacks (e.g., cracking passwords)
    • Idea: trick a program into unintended behavior
  • Automated vulnerability detection:
    • How do we define “intended”?
    • Difficult to formalize and check application logic

Libraries control all critical system services

    • Communication, file access, process control
    • Analyze routines to approximate vulnerability
remote access vulnerability
Remote access vulnerability
  • Example:
  • Vulnerability: executes any remote command
    • What if this program runs as root?
    • Clearly domain-specific: sockets, processes, etc.
    • Requirement:
  • Why is detecting this vulnerability hard?

int sock;

char buffer[100];

sock = socket(AF_INET, SOCK_STREAM, 0);

read(sock, buffer, 100);

execl(buffer);

!

Data from an Internet socket should not specify a program to execute

challenge 1 pointers
Challenge 1: Pointers
  • Example:
  • Still contains a vulnerability
    • Only one buffer
    • Variables buffer and ref are aliases

We need an accurate model of memory

int sock;

char buffer[100];

char * ref = buffer;

sock = socket(AF_INET, SOCK_STREAM, 0);

read(sock, buffer, 100);

execl(ref);

!

challenge 2 scope

main

Challenge 2: Scope
  • Call graph:
  • Objects flow throughout program
    • No scoping constraints
    • Objects referenced through pointers

We need whole-program analysis

!

sock = (AF_INET, SOCK_STREAM, 0);

(sock, buffer, 100);

(ref);

socket

read

execl

challenge 3 precision
Challenge 3: Precision
  • Static analysis is always an approximation
  • Precision: level of detail or sensitivity
    • Multiple calls to a procedure
      • Context-sensitive: analyze each call separately
      • Context-insensitive: merge information from all calls
    • Multiple assignments to a variable
      • Flow-sensitive: record each value separately
      • Flow-insensitive: merge values from all assignments

Lower precision reduces the cost of analysis

Exponential polynomial ~linear

insufficient precision

main

stdin

?

?

!

execl

socket

^

execl

read

^

Insufficient precision
  • Example:

Context-insensitivity

  • Information merged at call
    • Analyzer reports 2 possible errors
    • Only 1 real error

Imprecision leads to false positives

cost versus precision
Cost versus precision
  • Problem: A tradeoff
    • Precise analysis prohibitively expensive
    • Cheap analysis too many false positives
  • Idea: Mixed precision analysis
    • Focus effort on the parts of the program that matter
    • Don’t waste time over-analyzing the rest

Key: Let error detection problem drive precision

Client-driven program analysis

client driven algorithm

Precision Policy

Error Reports

Information Loss

Dependence Graph

Monitor

Adaptor

Client-Driven Algorithm
  • Client: Error detection analysis problem
  • Algorithm:
    • Start with fast cheap analysis – monitor imprecision
    • Determine extra precision – reanalyze

Pointer Analyzer

Client Analysis

Memory Model

algorithm components

?

Algorithm components
  • Monitor
    • Runs alongside main analysis
    • Records imprecision
  • Adaptor
    • Start at the locations of reported errors
    • Trace back to the cause and diagnose
sources of imprecision

Multiple procedure calls

Multiple assignments

Conditions

if(cond)

x =

foo( )

foo( )

x =

x =

x =

foo( )

= f( , )

x

Pointer dereference

Polluted target

Polluted pointer

ptr

or

ptr

(*ptr)

Sources of imprecision

Polluting assignments

in action

main

?

?

stdin

execl

socket

execl

read

read

read

In action...
  • Monitor analysis
  • Polluting assignments
  • Diagnose and apply “fix”
    • In this case: one procedure context-sensitive
  • Reanalyze

!

methodology
Methodology
  • Compare with commonly-used fixed precision
  • Metrics
    • Accuracy – number of errors reported

Includes false positives – fewer is better

    • Performance – only when accuracy is the same
programs
Programs
  • 18 real C programs
    • Unmodified source – all the issues of production code
    • Many are system tools – run in privileged mode
  • Representative examples:
error detection problems
Error detection problems
  • File access:
  • Remote access vulnerabillity:
  • Format string vulnerability (FSV):
  • Remote FSV:
  • FTP behavior:

Files must be open when accessed

Data from an Internet socket should not specify a program to execute

Format string may not contain untrusted data

Check if FSV is remotely exploitable

Can this program be tricked into reading and transmitting arbitrary files

results

Full (CS-FS)

Medium (CI-FS)

Slow (CS-FI)

Fast (CI-FI)

Client-Driven

?

?

?

?

?

?

?

?

?

?

?

?

0

100X

29

26

15

7

85

0

31

5

0

0

41

1

0

28

89

0

0

7

0

5

Increasing number of CFG nodes

1

41

6

4

7

15

26

0

0

1

18

88

0

0

6

0

26

0

0

7

15

4

18

nn

fcron

make

named

SQLite

muh (I)

pfinger

pureftp

apache

stunnel

privoxy

muh (II)

cfengine

wu-ftp (I)

wu-ftp (II)

blackhole

ssh client

ssh server

0

7

0

0

29

0

6

0

85

28

2

0

31

4

5

93

0

41

Results

Remote access vulnerability

1000X

Normalized performance

10X

overall results
Overall results
  • 90 test cases: 18 programs, 5 problems
    • Test case: 1 program and 1 error detection problem
    • Compare algorithms: client-driven vs. fixed precision

As accurate as any other algorithm:

87 out of 90

Runs faster than best fixed algorithm:

64 out of 87

Performance not an issue:

19 of 23

Both most accurate and fastest:

29 out of 64

why does it work
Why does it work?
  • Validates our hypothesis
    • Different errors have different precision requirements
    • Amount of extra precision is small
outline1
Outline
  • Motivation
  • The Broadway Compiler
  • Recent work on scalable program analysis
    • Problem: Error checking demands powerful analysis
    • Solution: Client-driven analysis algorithm
    • Example: Detecting security vulnerabilities
  • Contributions
  • Related work
  • Conclusions and future work
central contribution
Central contribution
  • Library-level compilation
    • Opportunity: library interfaces make domains explicit in existing programming practice
    • Key: a separate language for codifying domain-specific knowledge
    • Result: our compiler can automate previously manual error checks and performance improvements

Knowledge representation

Applying knowledge

Results

Old way:

Informal

Manual

Difficult, unpredictable

Broadway:

Codified

Compiler

Easy, automatic, reliable

specific contributions
Specific contributions
  • Broadway compiler implementation
    • Working system (43K C-Breeze, 23K pointers, 30K Broadway)
  • Client-driven pointer analysis algorithm [SAS’03]
    • Precise and scalable whole-program analysis
  • Library-level error checking experiments [CSTR’01]
    • No false positives for format string vulnerability
  • Library-level optimization experiments [LCPC’00]
    • Solid improvements for PLAPACK programs
  • Annotation language [DSL’99]
    • Balance expressive power and ease of use
related work
Related work
  • Configurable compilers
    • Power versus usability – who is the user?
  • Active libraries
    • Previous work focusing on specific domains
    • Few complete, working systems
  • Error detection
    • Partial program verification – paucity of results
  • Scalable pointer analysis
    • Many studies of cost/precision tradeoff
    • Few mixed-precision approaches
future work
Future work
  • Language
    • More analysis capabilities
  • Optimization
    • We have only scratched the surface
  • Error checking
    • Resource leaks
    • Path sensitivity – conditional transfer functions
  • Scalable analysis
    • Start with cheaper analysis – unification-based
    • Refine to more expensive analysis – shape analysis
annotations i
Annotations (I)
  • Dependence and pointer information
    • Describe pointer structures
    • Indicate which objects are accessed and modified

procedure fopen(pathname, mode)

{

on_entry { pathname --> path_string

mode --> mode_string }

access { path_string, mode_string }

on_exit { return --> new file_stream }

}

annotations ii
Annotations (II)
  • Library-specific properties

Dataflow lattices

property State : { Open, Closed}

initially Open

property Kind : { File,

Socket { Local, Remote } }

^

^

Remote

Local

Closed

Open

File

Socket

^

^

annotations iii
Annotations (III)
  • Library routine effects

Dataflow transfer functions

procedure socket(domain, type, protocol)

{

analyze Kind {

if (domain == AF_UNIX) IOHandle <- Local

if (domain == AF_INET) IOHandle <- Remote

}

analyze State { IOHandle <- Open }

on_exit { return --> new IOHandle }

}

annotations iv
Annotations (IV)
  • Reports and transformations

procedure execl(path, args)

{

on_entry { path --> path_string }

reportif (Kind : path_string could-be Remote)

“Error at “ ++ $callsite ++ “: remote access”;

}

procedure slow_routine(first, second)

{

when (condition)

replace-with%{ quick_check($first);

fast_routine($first, $second); }%

}

why does it work1
Why does it work?
  • Validates our hypothesis
    • Different clients have different precision requirements
    • Amount of extra precision is small
validation
Validation
  • Optimization experiments
    • Cons: One library – three applications
    • Pros: Complex library – consistent results
  • Error checking experiments
    • Cons: Quibble about different errors
    • Pros: We set the standard for experiments
  • Overall
    • Same system designed for optimizations is among the best for detecting errors and security vulnerabilities
type theory

Flow values

Types

Transfer functions

Inference rules

Type Theory
  • Equivalent to dataflow analysis (heresy?)
  • Different in practice
    • Dataflow: flow-sensitive problems, iterative analysis
    • Types: flow-insensitive problems, constraint solver
  • Commonality
    • No magic bullet: same cost for the same precision
    • Extracting the store model is a primary concern

Remember Phil

Wadler’s talk?

generators
Generators
  • Direct support for domain-specific programming
    • Language extensions or new language
    • Generate implementation from specification
  • Our ideas are complementary
    • Provides a way to analyze component compositions
    • Unifies common algorithms:
      • Redundancy elimination
      • Dependence-based optimizations
is it correct
Is it correct?

Three separate questions:

  • Are Sam Guyer’s experiments correct?
    • Yes, to the best of our knowledge
      • Checked PLAPACK results
      • Checked detected errors against known errors
  • Is our compiler implemented correctly?
    • Flip answer: who’s is?
    • Better answer: testing suites
  • How do we validate a set of annotations?
annotation correctness
Annotation correctness

Not addressed in my dissertation, but...

  • Theoretical approach
    • Does the library implement the domain?
    • Formally verify annotations against implementation
  • Practical approach
    • Annotation debugger: interactive
    • Automated assistance in early stages of development
  • Middle approach
    • Basic consistency checks
error checking vs optimization
Optimistic

False positives allowed

It can even be unsound

Tend to be “may” analyses

Correctness is absolute

“Black and white”

Certify programs bug-free

Cost tolerant

Explore costly analysis

Pessimistic

Must preserve semantics

Soundness mandatory

Tend to be “must” analyses

Performance is relative

Spectrum of results

No guarantees

Cost sensitive

Compile-time is a factor

Error Checking vs Optimization
complexity
Complexity
  • Pointer analysis
    • Address taken: linear
    • Steensgaard: almost linear (log log n factor)
    • Anderson: polynomial (cubic)
    • Shape analysis: double exponential
  • Dataflow analysis
    • Intraprocedural: polynomial (height of lattice)
    • Context-sensitivity: exponential (call graph)
  • Rarely see worst-case
optimization
Optimization
  • Overall strategy
    • Exploit layers and modularity
    • Customize lower-level layers in context
  • Compiler strategy: Top-down layer processing
    • Preserve high-level semantics as long as possible
    • Systematically dissolve layer boundaries
  • Annotation strategy
    • General-purpose specialization
    • Idiomatic code substitutions
plapack optimizations

Processor grid

PLA_Gemm( , , );  PLA_Local_gemm

PLA_Gemm( , , );  PLA_Rankk

PLAPACK Optimizations
  • PLAPACK matrices are distributed
  • Optimizations exploit special cases
    • Example: Matrix multiply
find the error part 31
Find the error – part 3
  • State-of-the-art compiler

struct __sue_23 * var_72;

struct __sue_25 * new_f = (struct __sue_25 *) malloc(sizeof (struct __sue_25));

_IO_no_init(& new_f->fp.file, 1, 0, ((void *) 0), ((void *) 0));

(& new_f->fp)->vtable = & _IO_file_jumps;

_IO_file_init(& new_f->fp);

if (_IO_file_fopen((struct __sue_23 *) new_f, filename, mode, is32) != ((void *) 0)) {

  var_72 = & new_f->fp.file;

  if ((var_72->_flags2 & 1) && (var_72->_flags & 8)) {

    if (var_72->_mode <= 0) ((struct __sue_23 *) var_72)->vtable = & _IO_file_jumps_maybe_mmap;

    else ((struct __sue_23 *) var_72)->vtable = & _IO_wfile_jumps_maybe_mmap;

    var_72->_wide_data->_wide_vtable = & _IO_wfile_jumps_maybe_mmap;

  }

}

if (var_72->_flags & 8192U) _IO_un_link((struct __sue_23 *) var_72);

if (var_72->_flags & 8192U) status = _IO_file_close_it(var_72);

  else status = var_72->_flags & 32U ? - 1 : 0;

((* (struct _IO_jump_t * *) ((void *) (& ((struct __sue_23 *) (var_72))->vtable) +

                             (var_72)->_vtable_offset))->__finish)(var_72, 0);

if (var_72->_mode <= 0)

  if (((var_72)->_IO_save_base != ((void *) 0))) _IO_free_backup_area(var_72);

if (var_72 != ((struct __sue_23 *) (& _IO_2_1_stdin_)) &&

var_72 != ((struct __sue_23 *) (& _IO_2_1_stdout_)) &&

    var_72 != ((struct __sue_23 *) (& _IO_2_1_stderr_))) { var_72->_flags = 0;

  free(var_72); }

bytes_read = _IO_sgetn(var_72, (char *) var_81, bytes_requested);

ad