1 / 35

Discoverer : Automatic Protocol Reverse Engineering from Network Traces

USENIX Security (Security ‘07). Discoverer : Automatic Protocol Reverse Engineering from Network Traces. Weidong Cui Jayanthkumar Kannan Helen J. Wang Microsoft Research. Present by Mike Hsiao, 20080125. Outline. Introduction Problem Statement

Download Presentation

Discoverer : Automatic Protocol Reverse Engineering from Network Traces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. USENIX Security (Security ‘07) Discoverer: Automatic Protocol Reverse Engineeringfrom Network Traces Weidong Cui Jayanthkumar Kannan Helen J. Wang Microsoft Research Present by Mike Hsiao, 20080125

  2. Outline • Introduction • Problem Statement • Common protocol idioms and the scope of Discoverer • Design • Evaluation • Related Work* • Limitations and Future Work • Conclusion and Comment

  3. Section 1 Application-level protocol specifications: usage • Application-level protocol specifications are useful for many security applications. • intrusion prevention and detection • deep packet inspection • protocol analyzer • penetration testing • generates network inputs to an application to uncover potential vulnerabilities • Current practice is mostly manual.

  4. Section 1 Discoverer • is a tool for automatically reverse engineering the protocol message formats of an application from its network trace • operates in a protocol-independent fashion • by inferring protocol idioms commonly seen in message formats of many application-level protocols • is then evaluated over a text and two binary protocols

  5. Section 1 Application-level protocol specifications • From documentation or reverse engineered manually • Time-consuming and error-prone • “It took the open-source SAMBA project 12 years to manually reverse engineer the Microsoft SMB protocol.” • “Yahoo messenger protocol has also been persistently reverse engineered, despite which, the open source clients regularly require patching to support proprietary changes in the Yahoo protocol.” • the period between the availability of an official client and an open-source client has been a month

  6. Section 1 Automaticallyreverse engineer message formats • Challenges • Very few hints from the network trace (byte streams) • Protocols are significantly different from each other • Protocol message formats are often context-sensitive • where earlier fields dictate the parsing of the subsequent part of the message • The authors dissect the formless byte streams into text and binary segments or tokens • as a starting point for clustering messages with similar patterns, where each cluster approximates a message format.

  7. Section 1 Evaluation Matrices • Correctness • does one inferred format correspond to exactly one true format? • Conciseness • how many inferred formats is a single true format reflected in? • Coverage • how many messages are covered by the inferred formats?

  8. Section 2 Problem Statement: Common Protocol Idioms • Application session • consists of a series of messages between two hosts that accomplishes a specific task. • Message format specification • a sequence of fields and their semantics • length, offset (byte offset of another field) • pointer (an offset specifies the index of a field) • cookie (session specific opaque data. E.g., session ID) • endpoint-address (IP, port) • set (a group of fields that can be put in an arbitrary order)

  9. Section 2 Common Protocol Idioms: Format Distinguisher • Format Distinguisher (FD) • It serves to differentiate the format of the subsequent part of the message • A message may have a sequence of FD fields, particularly when multiple protocols are encapsulated. • E.g., SMB consists of a NetBIOS header • This implies that the applications need to scan a message from left-to-right, decoding a FD field before parsing the subsequent part of the message.

  10. Section 2 Scope of Discoverer • derive the message format specification • not protocol finite state machine • assume synchronous protocols • A message is a consecutive chunk of application-level data sent in one direction • one or more TCP or UDP connection • UDP connection is a pair of unidirectional UDP flows • focus on applications that do not obfuscate payloads • do not capture timing semantics

  11. Section 3 Design: Overview • Cluster messages with the same format together and infer the message format by comparing messages in a single cluster • Tokenization and Initial Clustering • Recursive Clustering • Merging

  12. Section 3 1-1 Tokenization (1/2) • Text • Identify text bytes by comparing them with the ASCII values of printable characters • Consider a sequence of text bytes sandwiched between two binary bytes as a text segment • Require the sequence to have a minimum length • Use a set of delimiters (e.g., space and tab) to divide a text segment into tokens

  13. Section 3 1-1 Tokenization (2/2) • Binary • They simply declare a single binary byte to be a binary token in its own right. • Error 1: consecutive binary bytes with ASCII values of printable characters are wrongly marked as a text token • Error 2: a text string shorter than the minimum length is wrongly marked as binary tokens • Error 3: a text field consisting of some white space characters is wrongly divided into multiple text tokens

  14. Section 3 1-2 Initial Clustering by Token Patterns • The authors cluster messages based on their token patterns. • The token pattern assigned to a message is a tuple: (dir, class of token 1, class of token 2, …) • E.g., (client to server, text, binary, text) • Note that this initial clustering is coarse-grained since messages with different formats may have the same token pattern.

  15. Section 3 2 Recursive Clustering • The recursive clustering relies on identifying format distinguisher (FD) tokens • To find FD tokens, we need to invoke both format inference and format comparison

  16. Section 3 2-1 Format Inference • This phase takes as input a set of messages and infers a format that succinctly captures the content of the set of messages. • Property Inference • Token class is already identified during the tokenization phase. • Constant or variable tokens can also be easily identified. • Since the set of messages come from a single token-pattern cluster, tokens in one message can be directly compared against their counterparts by simply using the token offset. • Thus, constant tokens are those that take the same value across the entire set of messages, and variable tokens are those that take more than one value.

  17. Section 3 2-1 Format Inference • Semantic Inference • length • intuition: for a specific pair of messages, the difference in the values of potential length fields reflects the difference of the sizes of the messages • potential length: at most four consecutive binary tokens or a text token in the decimal or hex format • offset • compare the value difference with the difference of the offsets of some subsequent tokens • cookie • operate at the end of the merging phase, RolePlayer [3]

  18. Section 3 2-2 Format Comparison • Decide if two inferred message formats are the same? • token-by-token • from left-to-right • Ideally, two tokens can be considered to match if their semantics match.

  19. Section 3 2-3 Recursive Clustering by Format Distinguishers • Three criteria to determine if a token is a FD • number of unique values taken by this token across the set of messages is less than a threshold • (if the 1st criteria is satisfied) Divided the cluster is into sub-clusters by using unique token value. • the size of the largest sub-cluster exceeds a threshold • guarantee a meaningful format inference in at least one sub-cluster • (if potential FD passes 2nd phase) invoke format comparison across sub-clusters

  20. Section 3 2-3 Recursive Clustering by Format Distinguishers • This process is recursively performed on each of the sub-clusters because a message may have more than one FD token. • They find the next FD token by scanning further down the message towards the right (end) of the message. • The format inference is invoked again on the set of messages in each sub-cluster. • The inferred token properties and semantics might change because the set of messages has become smaller.

  21. Section 3 3 Merging with Type-Based SequenceAlignment • In previous phases, we are conservative to ensure that the format inference procedure operates correctly on a set of messages of the same format. • this leads to a new problem of over-classification • E.g., a trace of SMB with 4M messages can come out 7000 cluster/format, but the # of total true format is 130.

  22. Section 3 3 Merging with Type-Based SequenceAlignment • Type-based sequence alignment • It only allows two tokens of the same class (binary or text) to align with each other. • They claim two aligned tokens are matched if they either have the same semantic or share at least one value. • Extra gap constraints

  23. Section 3 An Example: true message from Ethereal

  24. Section 3 An Example: the final inferredformat by Discoverer (1/2)

  25. Section 3 An Example: the final inferredformat by Discoverer (1/2) inferred format is a sequence of tokens with token properties (binary vs. text, constant vs. variable) and semantics (e.g., length fields).

  26. Section 4 Evaluation • 5,700 lines of C++ code on Windows • un-optimized implementation takes about 6-12 hours for a trace of several million messages • Data Sets • a honeyfarm site (which responds to unsolicited, mostly malicious traffic); SMB only. • a busy enterprise (which has diverse and high-volume traffic); HTTP, SMB, RPC.

  27. Section 4 Evaluation Methodology • Correctness • If a cluster contains messages from more than one true format, then Discoverer will make incorrect inference. • For all three protocols, over 90% clusters contain messages from a single true format. • Conciseness • A large number of redundant formats will affect the conciseness of the protocol specifications generated • The ratio from the number of inferred formats to the number of true formats followed by their messages. (5:1) • almost 80% true formats are scattered into at most five clusters.

  28. Section 4 Evaluation Methodology (cont’d) • Coverage • the fraction of messages covered by our inferred formats • the fraction of true formats followed by covered messages • For all the three protocols, the message coverage is above 95% while the format coverage is around 30-40%.

  29. Section 4 Tunable Parameters

  30. Section 4 HTTP The HTTP protocol allows an arbitrary number of “parameter: value” pairs in an arbitrary order. 1. most messages (more than 99%) fall in the first top 1000 true formats. similar trend in the RPC and CIFS/SMB. 2. they inferred 3,926 formats, which covered 5,938,511 out of 5,950,453 messages (99.8%). 3. The covered messages belong to 865 out of 2,696 true formats (32%).

  31. Section 6 Limitations and FutureWork • Trace Dependency • message formats never occur in the trace • certain variable fields never take more than one value in the trace • Pre-Defined Semantics • Only a set of pre-defined semantics can be inferred. • Coalescing Fields • Unlike text fields, no clue may be available in delimiting binary fields • only few approaches (e.g., does this byte vary as much as the other one?)

  32. Section 6 Limitations and FutureWork (cont’d) • Asynchronous Protocols • messages in one direction may be interrupted by those in the other direction • messages in one direction may be delayed allowing two back-to-back messages in the other direction. • Application Sessions • Currently, Discoverer analyzes each connection in isolation. • State Machine Inference • captures the sequences of messages in all sessions in the trace

  33. Conclusion and Comment • Discoverer is a tool that aims to automate this reverse engineering process • Protocol knowledge is very difficult to model automatically. • so far they only model the semantics (offset, length…) • How about the communication interaction? (user intention …)

More Related