1 / 44

Statistical Methods for Detecting Computer Attacks from Streaming Internet Data

Statistical Methods for Detecting Computer Attacks from Streaming Internet Data. Ginger Davis, University of Virginia Systems & Information Engineering Department Joint work with: David Marchette & Karen Kafadar INTERFACE 2008 May 22, 2008. Outline. Motivation Data TCP Classification

field
Download Presentation

Statistical Methods for Detecting Computer Attacks from Streaming Internet Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering Department Joint work with: David Marchette & Karen Kafadar INTERFACE 2008 May 22, 2008

  2. Outline • Motivation • Data • TCP Classification • Graphical Displays

  3. Motivation • Cyber attacks on computer networks are threats to nearly all operations in society. • We need computational tools and statistical methods to identify attacks and stop them before they force shutdowns. • Use patterns in Internet traffic data to • Perform user profiling • Detect anomalies, network interruptions, unusual behavior, masquerades

  4. Project Background Facts: • The Internet is growing • Computer network attacks are increasing • Need for network security research & tools = + Personal Computer The Internet (Circa 2006) Burning Power Transformer (May 2007)

  5. Previous Work in Detecting Aberrations • Examples • Disease surveillance • Nuclear product manufacturing • Fraud detection (credit cards; phone use) • These data sets are often • Reasonable small (say less than 100 per day) • Easily stratified (by disease, site, cardholder) • Approximately independent • Can often apply Statistical Process Control tools

  6. Features of Internet Traffic Data • Relentless (“streaming”) • Not independent of other systems: thousands of messages from thousands of ports/addresses each minute • Diverse (text, numeric, image) • Dispersed (geographically) • Data often not from some convenient mathematical pdf

  7. Four Stages of Data Graphics • Static Graphics • Scatterplot, conditioning plot, density plot • Interactive Graphics • Brushing, cropping, cutting, coloring, rotating, linked plots • Dynamic Graphics (interact directly with fixed size data set on the client) • Recursive or dynamically smoothed plot, mode tree • Evolutionary Graphics (continually evolving streaming data sets) • Waterfall diagram, streaming chart, skyline plot

  8. Challenges • Internet traffic data are streaming • Unusable in raw form and require pre-processing • Detecting anomalies requires characterizing typical behavior

  9. Specific challenges for streaming data • Data value • what to collect/discard/save for later • Data warehouse • acquisition, storage, distribution • Tools/algorithms for pre-processing • Methods for analysis • Robustness,sufficiency • Informative visual displays

  10. Internet Traffic Data • All internet communications are transmitted via packets. • Fundamental unit of information is a packet • Packet consists of data and headers that control communication • Internet Protocol (IP) addresses • Transmission Control Protocol (TCP)

  11. Internet Traffic Data

  12. Internet Traffic Data

  13. Internet Traffic Data

  14. IP Header (Marchette 2001)

  15. TCP

  16. TCP

  17. TCP Header

  18. Hierarchy of Data • Packets • Identifying characteristics • Bytes of information being sent • Flows • Communication between source-destination • Connection • Collection of source flows and destination flows • Activity • Collection of similar connections • User session • Collection of activities

  19. Hierarchy of Data Example

  20. Goal for Data Hierarchy • Developing models for each level of the hierarchy which are dependent on models for other levels in the hierarchy

  21. TCP Classification • Detecting anomalies requires characterizing typical behavior • We will classify network traffic according to its application

  22. Background Motivation: • Port numbers map packets to their respective applications • The only thing that matters is that the two communicating hosts know which port number to look for • Malicious users can use a well known port like 80 (web traffic) for other uses and as a result are less likely to be noticed.

  23. Goal and Objective Goal: To prevent malicious users from masquerading their activities. Objective: To develop classification tree and multinomial logit models which could be used to correctly identify application protocols by looking at session variable characteristics

  24. Preliminary Data Processing Methodology: Convert: Binary ->Text -> SQL Proved to be slow, and inefficient Inadequate Session Aggregation Results Data

  25. Data • Revised Data Processing Methodology: • Convert: Binary -> Text -> SQL • Faster, more efficient, tracks more variables for each session

  26. Data Session Aggregation Process: • Ordered observations in database by time • Logically grouped each packet into a session using standard TCP semantics • Created unique session definitions • Maintained averages and variances for each session’s variables • Session completion status is determined and marked according to TCP semantics • Packet and session tables were linked by foreign keys

  27. Data Enterprise Data Set Collected By: Lawrence Berkeley National Laboratory Contains: 129,903,861 TCP Packets 453,135 TCP Sessions GMU Data Set Collected By: George Mason University _______ Contains: 7,024,590 TCP Packets 91,016 TCP Sessions House Data Set Collected By: Capstone Team 8 _______________ Contains: 1,110,335 TCP Packets 21,311 TCP Sessions

  28. Model Creation Training and Testing Data Set Creation

  29. Model Creation Scenarios Used in Data Analysis Real World Corporate Scenario Used all application ports present in the datasets Idealized Scenario Used only “top” application ports in the data sets Home Network Scenario Used only http, https, pop, and smtp application ports present in the data sets

  30. Model Creation • Classification Tree Algorithm Parameters • RPART – originally developed for R • Dependent Variable – Application Port • Independent Variables – 39 session variables • Splitting Criteria – Gini Index

  31. Model Creation Classification Tree Snapshot

  32. Model Creation • Multinomial Logit Models • Dependent variable – Application Port • Independent variables – 39 session variables

  33. Results: Classification Trees Real World Corporate Scenario: All Ports and All Variables Takeaway:Good prediction capability within the same data set; inconsistent results when benchmarked against different data sets.

  34. Results: Classification Trees Idealized Scenario: Top Ports and All Variables Takeaway:Significant prediction improvement for the Enterprise data set. Limiting ports, cleansed the noise from the data.

  35. Results: Classification Trees Home Network Scenario: Four Ports and All Variables Takeaway:Improved prediction results both within and across data sets.

  36. Results: Classification Trees Port 80 Across Data Sets – 4 Application Ports Takeaway:HTTP traffic (port 80) predictions appear to be robust across the models when only looking at four application variables.

  37. Results: Multi-categorical Logistical Idealized Scenario: Top Ports – All Variables Takeaway:Weaker prediction results in the Enterprise data set. Practical in a real-time environment given appropriate environment/implementation.

  38. Conclusion • Project Takeaways: • Replicated / expanded prior research work successfully on real network data • Used a fast/exportable model creation and classification process -> Classification Trees • Created a robust toolkit for processing and storing network data

  39. Future Work • Implement classification trees in a real network security application • Handle minority class presence in the data • Make use of pruning to develop smaller models

  40. Evolutionary Displays for EDA

  41. Waterfall Diagrams (Wegman & Marchette 2003)

  42. Summary / Future Work

More Related