220 likes | 223 Views
Discovering Similarity of Short Programs by Canonical Form. Baohua Wu University of Pennsylvania. Scenario. With a known malicious program P1 about a security hole, and an unknown suspicious program P2, how to identify the similarity of P2 to P1?
E N D
Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania
Scenario • With a known malicious program P1 about a security hole, and an unknown suspicious program P2, how to identify the similarity of P2 to P1? • If there are known polymorphic malicious program P1, P2, … Pn, how to identify their common “fingerprints”?
Assumption • Malicious programs are short in size, for example • Scripts < 500 lines • Assembly code < 10 kilobytes
Obfuscation Techniques • Dead-Code Insertion • NOP, CLI, STI, etc • Complicated ones: inc/dec, push/pop • Code Transposition • Add (unconditional) branches • Reorder independent instructions
Obfuscation Techniques • Register Reassignment • Replace eax with ebx if ebx is unused in a live range • Prologue/epilogue code to swap registers • Instruction Substitution • IA32 instruction set has many equivalent instructions
Obfuscation Techniques • Data modification • Replace a boolean variable with two integers • X a < b • Encryption • Polymorph Engine • Variable keys, algorithms, decriptors
Obfuscation Summary • Changing instructions inside a basic block • Changing control flows • Dynamic code generation • How to solve them?
Objective of Canonical Form of Programs • Reducing polymorphism • Identifying tokens for statistic analysis
Canonical Form of Programs • Compact intermediate instructions • No or few alternative instructions • Simplified programming model • Code segment – read only • Data segment – heap only (no stack, no registers) • No function calls except system calls • Conditional and loop instructions are kept
More about Canonical Form • Encrypted code are processed in advance • Multiple phases of compilation • Or simply report it as suspicious • No user-defined function calls • Recursive function elimination • Inline function expansion • Code optimization by compiler techniques • no dead or useless code • No or few redundant common expressions
More about Canonical Form • For assembly program, treat registers as variables • No limitation on number of registers • No unnecessary swapping instructions • Rename variables in some Total Order (v1,v2…) • Definition position in the program is a total order • But it may be changed in polymorphism • Main order by data dependency • Secondary order by variable type, length, name, def position • Reorder interexchangeable instructions by alphabetic order
What else for polymorphism? • Changes in algorithm • Not in my scope… • Changes in control flow • Unconditional branch insertion • Combination of conditional branches • Exchanging internal and external loop • Useless branches
Unconditional branch insertion A; B; C; goto 3; 1: C; goto 4; 2: B; goto 1; 3: A; goto 2; 4:
Combination of conditional branches If a < b Then A; Else B; If c < d Then C; Else D; If a < b and c < d Then A; C; Else if a<b and c>=d Then A; D; Else if a>=b and c<d Then B; C; Else B; D;
Exchanging internal and external loop Sum(matrix a) For (i=0;i<10;i++) For (j=0;j<10;j++) sum+= a[i][j]; Sum(matrix a) For (j=0;j<10;j++) For (i=0;i<10;i++) sum+= a[i][j];
Useless branches A; If date<1900 Goto End; B; C; . . . End: D; A; B; C; . . . End: D;
Linearizing Control Flow • …So far,no semantics is lost. Now it isdifferent! • Remove backward branches • Replace them (such as a loop) with repetitive conditional statements • Number of repetitions is set to N (ex. 2) • Remove forward branches by enumerating possible combinations of executed branches • Further change each path into canonical form • CPS -- Canonical Path Set • Critical Canonical Path in CPS is a sub-path of a actual execution path causing damage
Similarity of Canonical Programs • P1 is a known malicious program • P2 is an unknown program • Similarity(P1, P2) =
PathSim: Similarity of Canonical Paths • Recall in canonical paths • Linear execution • No control flow • No redundant common expression • No useless code • No dead code • No registers • Variables are renamed by some total order • Independent instructions are sorted in alphabetic order • Similarity algorithms for text documents can be used
Identifying Critical Canonical Path (CCP) • P1, P2, P3, … Pn are known malicious programs • A CCP must have at least one similar path in all Canonical Path Sets CPS(P1), CPS(P2), … CPS(Pn) • Statistic algorithms can be applied, ex. Gibbs Sampler
Summary • Assumption: malicous programs are short • Canonical form for comparison • Limited number of canonical linear paths • Similarity problem for text documents • Statistic methods to identify common fingerprints
Acknowledgement Thank You All!