Smarter Searching for a Network Packet Database William (Bill) Kenworthy

Smarter Searching for a Network Packet Database William (Bill) Kenworthy School of Information Technology Murdoch University Perth, Western Australia

Content • This is a presentation is about an alternative way to search and/or classify data travelling over a network • I will be describing the background, methodology and results of the research • This research covers two seemingly disparate disciplines so as the conference has a communication focus there is some background on bioinformatics to set the scene • This presentation is (almost) maths free! • if you want the maths, see the paper :) 2

Motivation • Test and validate alternate ways to view network data • Better visualise the intrinsic relationships between packets of data based on structure rather than content As part of • Investigating the possibilities inherent in mining data via structure based methods • Searching using statistical ranking of possible answers 3

About • Searching for information in high speed network traffic is difficult - but is basically a "solved" problem! • What is still a problem though is searching for partial, obfuscated or spatially separated (in the data stream) search terms • The work described here is a successful attempt to use characteristics more commonly associated with biological systems to identify areas of interest in a network data stream 4

Searching • Traditional database search results are the result of exact (yes/no) matching based on some regular expression system • e.g., [Bb]ank* • Traditional database search results are the result of exact (yes/no) matching based on some regular expression system • e.g., [Bb]ank* • Instead, the algorithms I am recommending match on the low level structure of a sequence of characters • character value and position/relationship in the stream • character/term substitution • results are ranked according to identity score and include false error rate data 5

Problem: dealing with raw bits on a network 6

What do we mean by "bioinformatics" algorithms • There are useful parallels between the way data is structured in a stream of network data and a biological genome • Target the “structures” within a data stream for searching • Very sophisticated, statistically valid search algorithms were developed for use in searching biological data • Results can be statistically correlated and ranked 7

What is constant? - Structure! • The property of the algorithms developed for bioinformatics that we are using primarily targets “structure”: • IP numbers will change in the header of an IP packet. • BUT the position and placement of other tokens near the IP number does not (fixed size fields) • This property extends to data fields • Example: • DNS data packets will have a similar signature with slight differences depending on the mutable data 8

Structure • 00 => A • 01 => C • 10 => G • 11 => T 9

Example plotof relationships 10

Methodology • The software used was standard bioinformatics software with the input data modified to suit • Most bioinformatics software has been implemented by large teams over many years – it is not practical for an individual to re-implement it for a different purpose :( • The software used was standard bioinformatics software with the input data modified to suit • Most bioinformatics software has been implemented by large teams over many years – it is not practical for an individual to re-implement it for a different purpose :( • Solution - translate packetised network data into bioinformatics compatible data files via mapping ones and zeros to the DNA alphabet – basic data abstraction 11

What? • What we are proposing is to intelligently identify network traffic in a way that uses relationships between structural elements embedded in the data rather than the literal content of the data • Use this as a method to identify and classify network data into categories against which an event can be notified • We have created a database of known good and bad data samples which allows us to place network data in one of three possible categories: • known good • known bad • unknown • What we are proposing is to intelligently identify network traffic in a way that uses relationships between structural elements embedded in the data rather than the literal content of the data • Use this as a method to identify and classify network data into categories against which an event can be notified 12

Database Creation • Created with isolated island networks using generic PC's with various operating systems • database pollution was a problem • "Good" samples were typical email, database, browsing • "Bad" samples were from PC's intentionally infected with botnet, virus and worm examples • Database is in the form of indexed motifs in a "BLAST" formatted flat file design 13

Process • Processing flow is started by extracting a packet of data to user space (via the linux kernel netfilter nfqueue module) • The packet (as a whole) is transcoded and searched against the database • returned is a set of "motifs" with score and false error rate statistics for each motif matched in the database • Event notification is based on a threshold basis according to an election process for the top rated N hits returned (hits are ranked in order of the identity score) 14

Implementation • The test design has proven less than reliable under higher packet rates • mainly due to inefficient design • Next step is to implement the reference design as a Snort IDS module and link to Snorts event notification process where the well designed data handling processes will alleviate the problems mentioned above 15

The Future • These techniques have wide applicability to search problems where data is structured but mutable • These techniques have wide applicability to search problems where data is structured but mutable • And for something completely different :) • Using a similar process for detecting collusion between student assignments based on detecting structural similarities in software coding styles • Create database of motifs based on code … search! 16

Existing work? • Considering the advantages I have found, very little work has been undertaken in using these algorithms • IBM proposed “An Intrusion-Detection System Based on the Teiresias Pattern-Discovery Algorithm” in 1999 • IBM has proposed using the Teiresias algorithm for SPAM filtering in 2004. • Commentators thought it was “interesting” but little further activity ... 17

Conclusions • It works :) • Known/Unknown sorting might be a unique “niche” application • Ability to statistically rank similarity is a useful tool opening up access to alternate ways to view search results 18

Questions? • William (Bill) Kenworthy • W.Kenworthy@murdoch.edu.au Thank you! 19

Smarter Searching for a Network Packet Database William (Bill) Kenworthy

Smarter Searching for a Network Packet Database William (Bill) Kenworthy

Presentation Transcript

Smarter Systems for a Smarter Planet

Develop a search statement for searching a database?

Database Searching: Super Searching Techniques

Database Searching

Smarter Balanced

Online Database Searching

Working Drawings Packet / BOM

Network Excellence for Smarter Grids

Working Smarter

William ”Bill” Taylor

Searching a MySQL Database

Searching a Database

Database Searching

Smart Database Searching

Eligibility Packet

Database Searching for Similar Sequences

Smarter Internet Searching

Database Searching

Database Searching

Searching Intelligently

SEARCHING FOR A SMARTER VoIP SOLUTION?

A Protocol for Packet Network Intercommunication