
Testing Intrusion Detection Systems: A Critic for the 1998 and 1999 DARPA Intrusion Detection System Evaluations as Performed by Lincoln Laboratory By John Mchugh Presented by HongyuGao Feb. 5, 2009
Outline • Lincoln Lab’s evaluation in 1998 • Critic on data generation • Critic on taxonomy • Critic on evaluation process • Brief discussion on 1999 evaluation • Conclusion
The 1998 evaluation • The most comprehensive evaluation of research on intrusion detection systems that has been performed to date
The 1998 evaluation cont’d • Objective: • “To provide unbiased measurement of current performance levels.” • “To provide a common shared corpus of experimental data that is available to a wide range of researchers”
The 1998 evaluation, cont’d • Simulated a typical air force base network
The 1998 evaluation, cont’d • Collected synthetic traffic data
The 1998 evaluation cont’d • Researchers tested their system using the traffic • Receiver Operating Curve (ROC) was used to present the result
1. Critic on data generation • Both background (normal) and attack data are synthesized. • Said to represent traffic to and from a typical air force base. • It is required that such synthesized data should reflect system performance in realistic scenarios.
Critic on background data • Counter point 1 • Real traffic is not well-behaved. • E.g. spontaneous packet storms that are indistinguishable from malicious attempts at flooding. • Not considered in background traffic
Critic on background data, cont’d • Counter point 2 • Low average data rate
Critic on background data, cont’d • Possible negative consequences • System may produce larger amount of FP in realistic scenario. • System may drop packets in realistic scenario
Critic on attack data • The distribution of attack is not realisitic • The number of attacks, which are U2R, R2L, DoS, Probing, is of the same order
Critic on attack data, cont’d • Possible negative consequences • The aggregate detection rate does not reflect the detection rate in real traffic
Critic on simulated AFB network • Not likely to be realistic • 4 real machines • 3 fixed attack target • Flat architecture • Possible negative consequence • IDS can be tuned to only look at traffic targeting to certain hosts • Preclude the execution of “smurf” or ICMP echo attack
2. Critic on taxonomy • Based on the attacker’s point of view • Denial of service • Remote to user • User to root • probing • Not useful describing what an IDS might see
Critic on taxonomy, cont’d • Alternative taxonomy • Classify by protocol layer • Classify by whether a completed protocol handshake is necessary • Classify by severity of attack • Many others…
3. Critic on evaluation • The unit of evaluation • Session is used • Some traffic (e.g. message originating with Ethernet hubs) are not in any session • Is “session” an appropriate unit?
3. Critic on evaluation • Scoring and ROC • Denominator?
Critic on evaluation, cont’d • An non-standard variation of ROC • --Substitue x-axis with false alarms per day • Possible problem • The number of false alarms per unit time may increase significantly with data rate increasing • Suggested alternative • The total number of alert (both TP and FP) • Use the standard ROC
Evaluation on Snort, cont’d • Poor performance on Dos and Probe • Good performance on R2L and U2R • Conclusion on Snort: • Not sufficient to get any conclusion
Critic on evaluation, cont’d • False alarm rate • A crucial concern • The designated maximum value (0.1%) is inconsistent with the maximum operator load set by Lincoln lab (100/day)
Critic on evaluation, cont’d • Does the evaluation result really mean something? • ROC curve reflects the ability to detect attack against normal traffic • What does a good IDS consist of? • Algorithm • Reliability • Good signatures • …
Brief discussion on 1999 evaluation • Have some superficial improvements • Additional hosts and host types are added • New attacks are added • None of these addresses the flaws listed above
Brief discussion on 1999 evaluation, cont’d • Security policy is not clear • What is an attack, what is not? • Scan, probe
Conclusion • The Lincoln lab evaluation is a major and impressive effort. • This paper criticizes the evaluation from different aspects.
Follow-up Work • DETER - Testbed for network security technology. • Public facility for medium-scale repeatable experiments in computer security • Located at USC ISI and UC Berkeley. • 300 PC systems running Utah's Emulab software. • Experimenter can access DETER remotely to develop, configure, and manipulate collections of nodes and links with arbitrary network topologies. • Problem with this is currently that there isn't realistic attack module or background noise generator plugin for the framework. Attack distribution is a problem. • PREDICT - Its a huge trace repository. It is not public and there are several legal issues in working with it.
Follow-up Work • KDD Cup - Its goal is to provide data-sets from real world problems to demonstrate the applicability of dierent knowledge discovery and machine learning techniques. • The 1999 KDD intrusion detection contest uses a labelled version of this 1998 DARPA dataset, • Annotated with connection features. • There are several problems with KDD Cup. Recently, people have found average TCP packet sizes as best correlation metrics for attacks, which is clearly points out the inefficacy.
Discussion • Can the aforementioned problems be addressed? • Dataset • Taxonomy • Unit for analysis • Approach to compare between IDSes • …
The End Thank you