Machine learning assisted binary code analysis
Download
1 / 18

Machine-Learning Assisted Binary Code Analysis - PowerPoint PPT Presentation


  • 664 Views
  • Updated On :

Machine-Learning Assisted Binary Code Analysis. N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,[email protected] K. Hunt National Security Agency [email protected] Supporting Static Binary Analysis .

Related searches for Machine-Learning Assisted Binary Code Analysis

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Machine-Learning Assisted Binary Code Analysis' - Olivia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Machine learning assisted binary code analysis

Machine-Learning Assisted Binary Code Analysis

N. Rosenblum, X. Zhu, B. Miller

Computer Sciences Department

University of Wisconsin - Madison

{nater,jerryzhu,[email protected]

K. Hunt

National Security Agency

[email protected]


Supporting static binary analysis
Supporting Static Binary Analysis

Binary Analysis is a Foundational Technique for Many Areas

  • Malware detection

  • Vulnerability analysis

  • Static and Dynamic Instrumentation

  • Formal verification

Example Uses

Why Analyze Binaries?

  • Source code unavailable

    • e.g., malware

  • Source code is inaccurate

    • Compiler transforms structure

  • Provides most accurate representation

Code is found through symbol information and parsing

MUCH HARDER without symbols

Rosenblum, Zhu, Miller, Hunt


Many binaries are stripped
Many Binaries are Stripped

  • Malicious programs

  • Operating system distributions

  • Commercial software packages

  • Legacy codes

BINARY

Stripped binaries lack symbol & debug information

Headers

EXAMPLES:

Code Segment

(functions?)

Data Segment

Standard Approach: Parse from entry point

Rosenblum, Zhu, Miller, Hunt


Stripped binaries exhibit gaps
Stripped Binaries Exhibit Gaps

Code Segment

  • Indirect (pointer-based) control ambiguity

  • Deliberate calls/branch obfuscation

  • Gaps in code segment may not contain code

After static parsing, gap regions remain

Rosenblum, Zhu, Miller, Hunt


Stripped binaries exhibit gaps1

Gap contents may vary

.__gmon_start__.libc.so.6.stpcpy.strcpy.__divdi3.printf.stdout.strerror.memmove.getopt_long.re_syntax_options.__ctype_b.getenv.__strtol_internal.getpagesize.re_search_2.memcpy.puts.feof.malloc.optarg.btowc._obstack_newchunk.re_match.__ctype_toupper.__xstat64.abort.strrchr._obstack_begin.calloc.re_set_registers.fprintf.

Stripped Binaries Exhibit Gaps

Code Segment

String data

  • Dialog Constants

  • Import names

  • Other strings

Rosenblum, Zhu, Miller, Hunt


Stripped binaries exhibit gaps2

Gap contents may vary

0x8022346

0x802434b

0x80243ad

0x80403d0

0x80503d0

0x8052140

0x8053142

0x806000b

0x802321a

0x8023332

0x804132a

0x8050ca0

Stripped Binaries Exhibit Gaps

Code Segment

Tables or lists of addresses

  • Jump tables

  • Virtual function tables

  • Data objects

Rosenblum, Zhu, Miller, Hunt


Stripped binaries exhibit gaps3

Gap contents may vary

Stripped Binaries Exhibit Gaps

Code Segment

gap_funcA {

. . .

}

Code unreachable through standard static parsing

gap_funcB {

. . .

  • Function pointers

  • Virtual methods

  • Obfuscated calls

gap_funcC {

. . .

}

Rosenblum, Zhu, Miller, Hunt


Stripped binaries exhibit gaps4
Stripped Binaries Exhibit Gaps

Code Segment

Gap contents may vary

7a 01 00 fd a2 b3

74 68 69 73 20 65

78 61 6d 70 6c 65

20 69 73 20 62 6f

67 75 73 2e 2e 2e

7a 01 00 fd a2 b3

74 68 69 73 20 65

78 61 6d 70 6c 65

20 69 73 20 62 6f

67 75 73 2e 2e 2e

7a 01 00 fd a2 b3

74 68 69 73 20 65

78 61 6d 70 6c 65

20 69 73 20 62 6f

But… all of these just look like bytes

Every byte in gaps may be the start of a function

How can we find code in gaps?

Our approach: Use information in known code to model code in gaps

Previous work (Vigna et al., 2007) augments parsing with simple instruction frequency information

Rosenblum, Zhu, Miller, Hunt


Modeling binary code
Modeling Binary Code

Problem reduces to finding function entry points

Task: Classifying every byte in a gap as entry point or non-entry point

  • Content: Idiom features of function entry points

    • Based on instruction sequences

  • Structure: Control flow & conflict features

    • Capture relationship of candidate function entry points

    • Requires joint assignment over all function entry point candidates

Two types of features:

Rosenblum, Zhu, Miller, Hunt


Content based features
Content-based Features

Entry idioms are common patterns at function entry points

Idioms are preceding and succeeding instruction sequences with wildcards

Candidate

For each idiom u,

C1

Entry idioms

push ebp

push ebp|mov esp,ebp

push ebp|*|sub esp

push ebp|*|mov esp,ebp

*|mov_esp,ebp

*|sub 0x8,esp

*|mov 0x8(ebp),eax

PRE nop

PRE ret|nop

PRE pop ebp|*|nop

Rosenblum, Zhu, Miller, Hunt


Call consistency overlap

y1 = 1

y2 = 1

y3 = -1

y4 = 1

Call Consistency & Overlap

Call & conflict features relate candidate FEPs over entire gap

Candidates

C1

C2

C3

C4

Rosenblum, Zhu, Miller, Hunt


Experimental setup
Experimental Setup

  • Large set (100’s) of binaries from department Linux servers and Windows workstations

  • Additional binaries compiled with Intel compiler

  • Binaries have full symbol information

  • Model implemented as extensions to Dyninst instrumentation library

  • Strip binary copies and parse to obtain training set

  • Select top idiom features by forward feature selection

  • Perform logistic regression to build idiom model

  • Evaluate model on test data from gap regions in Step 1.

    • Unstripped copies of binaries provide reference set

Rosenblum, Zhu, Miller, Hunt


Preliminary results
Preliminary Results

  • GNU C Compiler

    • Simple, regular function preamble

  • MS Visual Studio

    • High variation in function entry points

  • Intel C Compiler

    • Most variation in entry points; highly optimized

Rosenblum, Zhu, Miller, Hunt


Preliminary results1
Preliminary Results

Comparison of three binary analysis tools:

  • Original Dyninst

    • Scans for common entry preamble

  • IDA Pro Disassembler

    • Scans for common entry preamble

    • List of Library Fingerprints (Windows)

  • Dyninst w/ Model

    • Model replaces entry preamble heuristic

Rosenblum, Zhu, Miller, Hunt


Preliminary results2
Preliminary Results

  • Classifier maintains high precision with good recall

  • Model performance highly system-dependent

    • MS Visual Studio & Intel C Compiler FEPs are highly variable

Rosenblum, Zhu, Miller, Hunt


Backup slides
Backup Slides

Rosenblum, Zhu, Miller, Hunt


Idiom feature selection training

Features:

Feat1

Feat2

Feat3

...

Featk

Idiom Feature Selection & Training

1. Obtain training data from traditional parse

2. Use Condor HTC to drive forward feature selection on idioms

Statically reachable functions

Corpus is hundreds of stripped binaries

3. Perform logistic regression on the selected idiom features to obtain model parameters t

Rosenblum, Zhu, Miller, Hunt


Model formalization
Model Formalization

  • Joint assignment of yi = {1,-1} for each FEP xi in binary P

  • Unary idiom features fu

    • Weights u trained through logistic regression

  • Binary features fo (overlap), fc (call consistency)

    • Weights o, c large, negative

Rosenblum, Zhu, Miller, Hunt


ad