1 / 29

An Analysis of the 1999 DARPA/Lincoln Laboratory Evaluation Data for Network Anomaly Detection

An Analysis of the 1999 DARPA/Lincoln Laboratory Evaluation Data for Network Anomaly Detection. Matt Mahoney mmahoney@cs.fit.edu Feb. 18, 2003. Is the DARPA/Lincoln Labs IDS Evaluation Realistic?. The most widely used intrusion detection evaluation data set.

rex
Download Presentation

An Analysis of the 1999 DARPA/Lincoln Laboratory Evaluation Data for Network Anomaly Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Analysis of the 1999 DARPA/Lincoln LaboratoryEvaluation Data for Network Anomaly Detection Matt Mahoney mmahoney@cs.fit.edu Feb. 18, 2003

  2. Is the DARPA/Lincoln Labs IDS Evaluation Realistic? • The most widely used intrusion detection evaluation data set. • 1998 data used in KDD cup competition with 25 participants. • 8 participating organizations submitted 18 systems to the 1999 evaluation. • Tests host or network based IDS. • Tests signature or anomaly detection. • 58 types of attacks (more than any other evaluation) • 4 target operating systems. • Training and test data released after evaluation to encourage IDS development.

  3. Problems with the LL Evaluation • Background network data is synthetic. • SAD (Simple Anomaly Detector) detects too many attacks. • Comparison with real traffic – range of attribute values is too small and static (TTL, TCP options, client addresses…). • Injecting real traffic removes suspect detections from PHAD, ALAD LERAD, NETAD, and SPADE.

  4. 1. Simple Anomaly Detector (SAD) • Examines only inbound client TCP SYN packets. • Examines only one byte of the packet. • Trains on attack-free data (week 1 or 3). • A value never seen in training is an anomaly. • If there have been no anomalies for 60 seconds, then output an alarm with score 1. Train: 001110111 Test: 010203001323011 60 sec. 60 sec.

  5. DARPA/Lincoln Labs Evaluation • Weeks 1 and 3: attack free training data. • Week 2: training data with 43 labeled attacks. • Weeks 4 and 5: 201 test attacks. Internet Router Sniffer Attacks SunOS Solaris Linux NT

  6. SAD Evaluation • Develop on weeks 1-2 (available in advance of 1999 evaluation) to find good bytes. • Train on week 3 (no attacks). • Test on weeks 4-5 inside sniffer (177 visible attacks). • Count detections and false alarms using 1999 evaluation criteria.

  7. SAD Results • Variants (bytes) that do well: source IP address (any of 4 bytes), TTL, TCP options, IP packet size, TCP header size, TCP window size, source and destination ports. • Variants that do well on weeks 1-2 (available in advance) usually do well on weeks 3-5 (evaluation). • Very low false alarm rates. • Most detections are not credible.

  8. SAD vs. 1999 Evaluation • The top system in the 1999 evaluation, Expert 1, detects 85 of 169 visible attacks (50%) at 100 false alarms (10 per day) using a combination of host and network based signature and anomaly detection. • SAD detects 79 of 177 visible attacks (45%) with 43 false alarms using the third byte of the source IP address.

  9. 1999 IDS Evaluation vs. SAD

  10. SAD Detections by Source Address(that should have been missed) • DOS on public services: apache2, back, crashiis, ls_domain, neptune, warezclient, warezmaster • R2L on public services: guessftp, ncftp, netbus, netcat, phf, ppmacro, sendmail • U2R: anypw, eject, ffbconfig, perl, sechole, sqlattack, xterm, yaga

  11. 2. Comparison with Real Traffic • Anomaly detection systems flag rare events (e.g. previously unseen addresses or ports). • “Allowed” values are learned during training on attack-free traffic. • Novel values in background traffic would cause false alarms. • Are novel values more common in real traffic?

  12. Measuring the Rate of Novel Values • r = Number of values observed in training. • r1 = Fraction of values seen exactly once (Good-Turing probability estimate that next value will be novel). • rh = Fraction of values seen only in second half of training. • rt = Fraction of training time to observe half of all values. Larger values in real data would suggest a higher false alarm rate.

  13. Network Data for Comparison • Simulated data: inside sniffer traffic from weeks 1 and 3, filtered from 32M packets to 0.6M packets. • Real data: collected from www.cs.fit.edu Oct-Dec. 2002, filtered from 100M to 1.6M. • Traffic is filtered and rate limited to extract start of inbound client sessions (NETAD filter, passes most attacks).

  14. Attributes measured • Packet header fields (all filtered packets) for Ethernet, IP, TCP, UDP, ICMP. • Inbound TCP SYN packet header fields. • HTTP, SMTP, and SSH requests (other application protocols are not present in both sets).

  15. Comparison results • Synthetic attributes are too predictable: TTL, TOS, TCP options, TCP window size, HTTP, SMTP command formatting. • Too few sources: Client addresses, HTTP user agents, ssh versions. • Too “clean”: no checksum errors, fragmentation, garbage data in reserved fields, malformed commands.

  16. TCP SYN Source Address r1≈ rh ≈ rt ≈ 50% is consistent with a Zipf distribution and a constant growth rate of r.

  17. Real Traffic is Less Predictable Real r (Number of values) Synthetic Time

  18. 3. Injecting Real Traffic • Mix equal durations of real traffic into weeks 3-5 (both sets filtered, 344 hours each). • We expect r ≥ max(rSIM, rREAL) (realistic false alarm rate). • Modify PHAD, ALAD, LERAD, NETAD, and SPADE not to separate data. • Test at 100 false alarms (10 per day) on 3 mixed sets. • Compare fraction of “legitimate” detections on simulated and mixed traffic for median mixed result.

  19. PHAD • Models 34 packet header fields – Ethernet, IP, TCP, UDP, ICMP • Global model (no rule antecedents) • Only novel values are anomalous • Anomaly score = tn/r where • t = time since last anomaly • n = number of training packets • r = number of allowed values • No modifications needed

  20. ALAD • Models inbound TCP client requests – addresses, ports, flags, application keywords. • Score = tn/r • Conditioned on destination port/address. • Modified to remove address conditions and protocols not present in real traffic (telnet, FTP).

  21. LERAD • Models inbound client TCP (addresses, ports, flags, 8 words in payload). • Learns conditional rules with high n/r. • Discards rules that generate false alarms in last 10% of training data. • Modified to weight rules by fraction of real traffic. If port = 80 then word1 = GET, POST (n/r = 10000/2)

  22. NETAD • Models inbound client request packet bytes – IP, TCP, TCP SYN, HTTP, SMTP, FTP, telnet. • Score = tn/r + ti/fi allowing previously seen values. • ti = time since value i last seen • fi = frequency of i in training. • Modified to remove telnet and FTP.

  23. SPADE (Hoagland) • Models inbound TCP SYN. • Score = 1/P(src IP, dest IP, dest port). • Probability by counting. • Always in training mode. • Modified by randomly replacing real destination IP with one of 4 simulated targets.

  24. Criteria for Legitimate Detection • Source address – target server must authenticate source. • Destination address/port – attack must use or scan that address/port. • Packet header field – attack must write/modify the packet header (probe or DOS). • No U2R or Data attacks.

  25. Mixed Traffic: Fewer Detections, but More are LegitimateDetections out of 177 at 100 false alarms

  26. Conclusions • SAD suggests the presence of simulation artifacts and artificially low false alarm rates. • The simulated traffic is too clean, static and predictable. • Injecting real traffic reduces suspect detections in all 5 systems tested.

  27. Limitations and Future Work • Only one real data source tested – may not generalize. • Tests on real traffic cannot be replicated due to privacy concerns (root passwords in the data, etc). • Each IDS must be analyzed and modified to prevent data separation. • Is host data affected (BSM, audit logs)?

  28. Limitations and Future Work • Real data may contain unlabeled attacks. We found over 30 suspicious HTTP request in our data (to a Solaris based host). IIS exploit with double URL encoding (IDS evasion?) GET /scripts/..%255c%255c../winnt/system32/cmd.exe?/c+dir Probe for Code Red backdoor. GET /MSADC/root.exe?/c+dir HTTP/1.0

  29. Further Reading An Analysis of the 1999 DARPA/Lincoln Laboratories Evaluation Data for Network Anomaly Detection By Matthew V. Mahoney and Philip K. Chan Dept. of Computer Sciences Technical Report CS-2003-02 http://cs.fit.edu/~mmahoney/paper7.pdf

More Related