450 likes | 805 Views
2. (c) 2008 Hex-Rays SA. Presentation Outline. Why do we need decompilers?Complexity must be justifiedTypical decompiler designThere are some misconceptionsDecompiler based analysisNew analysis type and tools become possibleFuture
E N D
1. Decompilers and beyond
2. 2 (c) 2008 Hex-Rays SA Presentation Outline Why do we need decompilers?
Complexity must be justified
Typical decompiler design
There are some misconceptions
Decompiler based analysis
New analysis type and tools become possible
Future
“...is bright and sunny”
Your feedback
Online copy of this presentation is available athttp://www.hex-rays.com/idapro/ppt/decompilers_and_beyond.ppt
3. 3 (c) 2008 Hex-Rays SA Disassemblers We need disassemblers to analyze binary code
Simple disassemblers produce a listing with instructions
Better disassemblers assist in analysis by annotating the code, good navigation etc. You know the difference.
Even the ideal disassembler stays at low level: the output is an assembler listing
The main output of a disassembler is still one-to-one mapping of opcodes to instruction mnemonics
No leverage, no abstractions, little insight
The analyst must mentally map assembly instructions to higher level abstractions and concepts
A boring and routine task after a while
4. 4 (c) 2008 Hex-Rays SA Disassembler limitations The output is
Boring
Inhuman
Repetitive
Error prone
Requires special skills
Did I say repetitive?
Yet some geeks like it?...
5. 5 (c) 2008 Hex-Rays SA Decompilers The need:
Software grows like gas
Time spent on analysis skyrockets
Malware proliferates and mutates
We need better tools to handle this
Decompilation is the next logical step, yet a tough one
6. 6 (c) 2008 Hex-Rays SA Building ideal decompiler The answer is clear and easy to give: ideal decompilers do not exist
It is customary to compare compilers and decompilers:
Preprocessing
Lexical analysis
Syntax analysis
Code generation
Optimization
This comparison is correct but superficial
7. 7 (c) 2008 Hex-Rays SA Compilers are privileged Strictly defined input language
Anything nonconforming – spit out an error message
Reasonable amount of information on all functions, variables, types, etc.
The output may be ugly
Who will ever read it but some geeks? :)?
8. 8 (c) 2008 Hex-Rays SA Machine code decompilers are impossible Informal and sometimes hostile input
Many problems are unsolved or proved to be unsolvable in general
The output is examined in detail by a human being, any suboptimality is noticed because it annoys the analyst
Conclusion: robust decompilers are impossible
What if we address the common cases? For example, if we cover 90%, will the rest be handled manually?
9. 9 (c) 2008 Hex-Rays SA Easy for humans, hard for computers In fact, many (all?) problems encountered during decompilation are hard
For every problem, there is a naďve solution, which, unfortunately, does not work
Just a few examples...
10. 10 (c) 2008 Hex-Rays SA Function calls are a problem Function calls require answering the following questions:
Where does the function expect its input registers?
Where does it return the result?
What registers or memory cells does it spoil?
How does it change the stack pointer?
Does it return to the caller or somewhere else?
11. 11 (c) 2008 Hex-Rays SA Function return values are a problem Does the function return anything?
How big is the return value?
12. 12 (c) 2008 Hex-Rays SA Function input arguments are a problem When a register is accessed, it can be
To save its value
To allocate stack frame
Used as function argument
13. 13 (c) 2008 Hex-Rays SA Indirect accesses are a problem Pointer aliases
No precise object boundaries
14. 14 (c) 2008 Hex-Rays SA Indirect jumps are a problem Indirect jumps are used for switch idioms and tail calls
Recognizing them is necessary to build the control flow graph
15. 15 (c) 2008 Hex-Rays SA Problems, problems, problems... Save-restore (push/pop) pairs
Partial register accesses (al/ah/ax/eax)?
64-bit arithmetic
Compiler idioms
Variable live ranges (for stack variables)?
Lost type information
Pointers vs. numbers
Virtual functions
Recursive functions
16. 16 (c) 2008 Hex-Rays SA Hopeless situation? Well, yes and no
While fully automatic decompiler capable of handling arbitrary input is impossible, approximative solutions exist
We could start with a “simple” case:
Compiler generated output (no hostile adversary generating increasingly complex input)?
Only 32-bit code
No floating point, exception handling and other fancy stuff
17. 17 (c) 2008 Hex-Rays SA Basic ideas Make some configurable assumptions about the input (calling conventions, stack frames, memory model, etc)?
Use sound theoretical approach to solvable problems (data flow analysis on registers, peephole optimization within basic blocks, instruction simplification, etc)?
Use heuristics for unsolvable problems (indirect jumps, function prolog/epilogs, call arguments)?
Prefer to generate ugly but correct output rather than nice but incorrect code
Let the user guide the decompilation in difficult cases (specify indirect call targets, function prototypes, etc)?
Interactivity is necessary to achieve good results
18. 18 (c) 2008 Hex-Rays SA Decompiler architecture Overall, it could look like this:
19. 19 (c) 2008 Hex-Rays SA Decompilation phases - 1
20. 20 (c) 2008 Hex-Rays SA Decompilation phases - 2
21. 21 (c) 2008 Hex-Rays SA Microcode – just generated It is very detailed
Redundant
One basic block at a time
22. 22 (c) 2008 Hex-Rays SA After preoptimization
23. 23 (c) 2008 Hex-Rays SA After local optimization This is much better
Please note that the condition codes are still present because they might be used by other blocks
Use-def lists are calculated dynamically
24. 24 (c) 2008 Hex-Rays SA After global optimization Condition codes are gone
The LDX instruction got propagated to jz and all references to eax are gone
Note that the jz target has changed (@3) since global optimization removed some unused code and blocks
We are ready for local variable allocation
25. 25 (c) 2008 Hex-Rays SA After local variable allocation All registers have been replaced by local variables (ecx0, esi1; except ds)?
Use-def lists are useless now but we do not need them anymore
Now we will perform structural analysis and create pseudocode
26. 26 (c) 2008 Hex-Rays SA Control graphs Original graph view Control flow graph
27. 27 (c) 2008 Hex-Rays SA Graph structure as a tree Structural analysis extracts the standard control flow constructs from CFG
The result is a tree similar to the one below. It will be used to generate pseudocode
The structural analysis algorithm is robust and can handle any graphs, including irreducible ones
28. 28 (c) 2008 Hex-Rays SA Initial pseudocode is ugly Almost unreadable...
29. 29 (c) 2008 Hex-Rays SA Transformations improve it Some casts still remain
30. 30 (c) 2008 Hex-Rays SA Interactive operation allows us to fine tune it Final result after some renamings and type adjustments:
The initial assemblyis too long to be displayed on a slide
Pseudocode is muchshorter and morereadable
31. 31 (c) 2008 Hex-Rays SA What decompilation gives us Obvious benefits
Saves time
Eliminates routine tasks
Makes source code recovery easier (...)?
New things
Next abstraction level - closer to application domain
Data flow based tools (vulnerability scanner, anyone? :)?
Binary translation
32. 32 (c) 2008 Hex-Rays SA Base to build on... To be useful and make other tools possible, decompiler must have a programmable API
It already exists but it needs some refinement
Microcode is not accessible yet
Decompiler is retargetable (x86 now, ARM will be next)?
Both interactive and batch modes are possible
In addition to being a tool to examine binaries, decompiler could be used for...
33. 33 (c) 2008 Hex-Rays SA ...program verification Well, “verification” won't be strict but it can help to spot interesting locations in the code:
Missing return value validations (e.g. for NULL pointers)?
Missing input value validations
Taint analysis
Insecure code patterns
Uninitialized variables
etc..
34. 34 (c) 2008 Hex-Rays SA ...assembly listing improvement Hardcore users who prefer to work with assembly instructions can benefit from data flow analysis results
Hover the mouse over a register or data to get:
Its possible values or value ranges
Locations where is is defined
Locations where it is used
Highlight definitions or uses of the current register in two different colors
Show list of indirect call targets, calling conventions, etc
Gray out dead instructions
Determine if a value comes from a system call (ReadFile)?
etc...
35. 35 (c) 2008 Hex-Rays SA ...more insight into the application domain One could reconstruct data types used by the application
In fact, serious reverse engineering is impossible without knowing data types (.,,)?
Fortunately API already exposes all necessary information for type handling
Plenty of work ahead
36. 36 (c) 2008 Hex-Rays SA ...more abstract representations Tools to build more abstract representations
Function clustering (think of modules or libraries)?
Global data flow diagrams (functions exposed to tainted data in red)?
Statistical analysis of pseudocode
C++ template detection, generic code detection
37. 37 (c) 2008 Hex-Rays SA ...binary code comparison You know better than me the possible applications
To find code plagiarisms
To detect changes between program versions
To find library functions (high-gear FLIRT)?
etc... (you know better than me :)?
38. 38 (c) 2008 Hex-Rays SA Back to the earth The tools and possibilities described on the previous slides do not exist yet
Yes they become possible thanks to decompilation
We have a long way to go
More processors and platforms
Floating point calculations
Exception handling
Type recovery
Handling hostile code
In fact, too many ideas to enumerate them here
The future is bright... is it?...
39. 39 (c) 2008 Hex-Rays SA The “thank you” slide Thank you for your attention!Questions?