1 / 19

Smarter Searching for a Network Packet Database William (Bill) Kenworthy

Smarter Searching for a Network Packet Database William (Bill) Kenworthy School of Information Technology Murdoch University Perth, Western Australia. Content. This is a presentation is about an alternative way to search and/or classify data travelling over a network

werner
Download Presentation

Smarter Searching for a Network Packet Database William (Bill) Kenworthy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Smarter Searching for a Network Packet Database William (Bill) Kenworthy School of Information Technology Murdoch University Perth, Western Australia

  2. Content • This is a presentation is about an alternative way to search and/or classify data travelling over a network • I will be describing the background, methodology and results of the research • This research covers two seemingly disparate disciplines so as the conference has a communication focus there is some background on bioinformatics to set the scene • This presentation is (almost) maths free! • if you want the maths, see the paper :) 2

  3. Motivation • Test and validate alternate ways to view network data • Better visualise the intrinsic relationships between packets of data based on structure rather than content As part of • Investigating the possibilities inherent in mining data via structure based methods • Searching using statistical ranking of possible answers 3

  4. About • Searching for information in high speed network traffic is difficult - but is basically a "solved" problem! • What is still a problem though is searching for partial, obfuscated or spatially separated (in the data stream) search terms • The work described here is a successful attempt to use characteristics more commonly associated with biological systems to identify areas of interest in a network data stream 4

  5. Searching • Traditional database search results are the result of exact (yes/no) matching based on some regular expression system • e.g., [Bb]ank* • Traditional database search results are the result of exact (yes/no) matching based on some regular expression system • e.g., [Bb]ank* • Instead, the algorithms I am recommending match on the low level structure of a sequence of characters • character value and position/relationship in the stream • character/term substitution • results are ranked according to identity score and include false error rate data 5

  6. Problem: dealing with raw bits on a network 6

  7. What do we mean by "bioinformatics" algorithms • There are useful parallels between the way data is structured in a stream of network data and a biological genome • Target the “structures” within a data stream for searching • Very sophisticated, statistically valid search algorithms were developed for use in searching biological data • Results can be statistically correlated and ranked 7

  8. What is constant? - Structure! • The property of the algorithms developed for bioinformatics that we are using primarily targets “structure”: • IP numbers will change in the header of an IP packet. • BUT the position and placement of other tokens near the IP number does not (fixed size fields) • This property extends to data fields • Example: • DNS data packets will have a similar signature with slight differences depending on the mutable data 8

  9. Structure • 00 => A • 01 => C • 10 => G • 11 => T 9

  10. Example plotof relationships 10

  11. Methodology • The software used was standard bioinformatics software with the input data modified to suit • Most bioinformatics software has been implemented by large teams over many years – it is not practical for an individual to re-implement it for a different purpose :( • The software used was standard bioinformatics software with the input data modified to suit • Most bioinformatics software has been implemented by large teams over many years – it is not practical for an individual to re-implement it for a different purpose :( • Solution - translate packetised network data into bioinformatics compatible data files via mapping ones and zeros to the DNA alphabet – basic data abstraction 11

  12. What? • What we are proposing is to intelligently identify network traffic in a way that uses relationships between structural elements embedded in the data rather than the literal content of the data • Use this as a method to identify and classify network data into categories against which an event can be notified • We have created a database of known good and bad data samples which allows us to place network data in one of three possible categories: • known good • known bad • unknown • What we are proposing is to intelligently identify network traffic in a way that uses relationships between structural elements embedded in the data rather than the literal content of the data • Use this as a method to identify and classify network data into categories against which an event can be notified 12

  13. Database Creation • Created with isolated island networks using generic PC's with various operating systems • database pollution was a problem • "Good" samples were typical email, database, browsing • "Bad" samples were from PC's intentionally infected with botnet, virus and worm examples • Database is in the form of indexed motifs in a "BLAST" formatted flat file design 13

  14. Process • Processing flow is started by extracting a packet of data to user space (via the linux kernel netfilter nfqueue module) • The packet (as a whole) is transcoded and searched against the database • returned is a set of "motifs" with score and false error rate statistics for each motif matched in the database • Event notification is based on a threshold basis according to an election process for the top rated N hits returned (hits are ranked in order of the identity score) 14

  15. Implementation • The test design has proven less than reliable under higher packet rates • mainly due to inefficient design • Next step is to implement the reference design as a Snort IDS module and link to Snorts event notification process where the well designed data handling processes will alleviate the problems mentioned above 15

  16. The Future • These techniques have wide applicability to search problems where data is structured but mutable • These techniques have wide applicability to search problems where data is structured but mutable • And for something completely different :) • Using a similar process for detecting collusion between student assignments based on detecting structural similarities in software coding styles • Create database of motifs based on code … search! 16

  17. Existing work? • Considering the advantages I have found, very little work has been undertaken in using these algorithms • IBM proposed “An Intrusion-Detection System Based on the Teiresias Pattern-Discovery Algorithm” in 1999 • IBM has proposed using the Teiresias algorithm for SPAM filtering in 2004. • Commentators thought it was “interesting” but little further activity ... 17

  18. Conclusions • It works :) • Known/Unknown sorting might be a unique “niche” application • Ability to statistically rank similarity is a useful tool opening up access to alternate ways to view search results 18

  19. Questions? • William (Bill) Kenworthy • W.Kenworthy@murdoch.edu.au Thank you! 19

More Related