1 / 33

KDD’99 Classifier Learning Contest~Network Intrusion

KDD’99 Classifier Learning Contest~Network Intrusion. Advisor : Dr. Hsu Graduate : Min-Hong Lin IDSL seminar. Outline. Motivation Objective Results of the KDD’99 Classifier Learning The Winning Entry : Bagged Boosting Second-place : Kernel Miner Third-place : The MP13 Approach

Download Presentation

KDD’99 Classifier Learning Contest~Network Intrusion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KDD’99 Classifier Learning Contest~Network Intrusion Advisor: Dr. Hsu Graduate:Min-Hong Lin IDSL seminar

  2. Outline • Motivation • Objective • Results of the KDD’99 Classifier Learning • The Winning Entry : Bagged Boosting • Second-place : Kernel Miner • Third-place : The MP13 Approach • Conclusions • Personal Opinion IDSL

  3. Motivation • Network security is an important issue. • How to prevent network intrusion in advance • Classifier learning can solve these problems. IDSL

  4. Objective • To learn a predictive model capable of distinguishing between legitimate and illegitimate connections in a computer network. IDSL

  5. Introduction • 24 entries were submitted for the contest. • The training and test data were made by Prof. Sal Stolfo and Prof. Wenke Lee • There was a data quality issue with the labels of the test data discovered by Ramesh Agarwal and Mahesh Joshi • Each entry was scored against the corrected test data by a scoring awk script using the cost matrix IDSL

  6. The Winning Entries • The winning entry was submitted by Dr. Bernhard Pfahringer of the Austrian Research Institute for Artificial Intelligence. • Second-place performance was achieved by Itzhak Levin from LLSoft, Inc. in Israel • Third-place performance was achieved by Vladimir Miheev,Alexei Vopilov, and Ivan Shabalin of the company MP13 in Moscow, Russia. • The difference in performance between the three best entries is only of marginal statistical significance IDSL

  7. Performance Of The Winning Entry • The winning entry achieved an average cost of 0.2331 per test example and obtained the following confusion matrix: IDSL

  8. Statistical Significance • The mean score is 0.2331 • The standard deviation is 0.8334 • The standard error is 0.8334/sqrt(N) • The test dataset contains 311,029 examples, but these are not all independent. • The standard error is 0.8334/sqrt(77291) = 0.0030 IDSL

  9. Statistical Significance(contd.) • the winning entry is significantly superior to all others except the second and third best.(2 s.e.) • The first significant difference is between the 17th and 18th best entries. This difference is 0.2952 - 0.2684 = 0.0268(about 9 s.e.) IDSL

  10. A Simple Method Performs Well • one entry was simply “the trusty old 1-nearest neighbor classifier.” (0.2523) • only nine entries scored better than 1-nearest neighbor, of which only six were statistically significantly better. IDSL

  11. Cost-Based Scoring • The cost matrix used for scoring entries IDSL

  12. Training VS. Test Distribution • Some basic domain knowledge about network intrusions suggests that the U2R and R2L categories are intrinsically rare. • The actual distributions of attack types in the training and test 10% datasets are: IDSL

  13. The Winning Entry : Bagged Boosting • The solution is essentially a mixture of bagging and boosting. • Asymmetric error costs are taken into account by minimizing the conditional risk. • The standard sampling with replacement methodology of bagging was modified • put a specific focus on the smaller but expensive-if-predicted-wrongly classes. IDSL

  14. Bagged Boosting:Preliminary Exploration • In an initial test stage : applied various standard learning algorithm • C5 Ripper, naive bayes,nearest neighbor, a back-propagation neural network • This initial scenario was a kind of inverted cross-validation, where the data was split into ten folds • Only one fold was used for learning and all the other nine folds for testing • All variants of C5 were performing much better than naive bayes. • Boosted trees showed a small, but significant lead. IDSL

  15. Bagged Boosting:The Final Predictor • Fifty samples were drawn from the original 5 million odd examples set. • For each sample an ensemble of ten C5 decision trees was induced using both C5's error-cost and boosting options. • The final predictions were computed on top of the 50 single predictions of each of the sub-ensembles by minimizing the conditional risk. • This risk is defined as the sum of the error-costs predicting specific classes times the probabilities of the respective classes. IDSL

  16. Bagged Boosting:Miscellaneous • The training sets about half a million examples took C5 less than an hour to process on a two-processor machine. • 50 such samples were processed, yielding 50X10 trees • A process took more than a day in the final production run. IDSL

  17. LLSoft’s Results:Kernel Miner • Kernel Miner is a new data-mining tool based on building the optimal decision forest. • Kernel Miner is a tool for the description, classification and generalization of data, and for predicting the new cases. • Kernel Miner is a fully automated tool that provides solutions to database users. IDSL

  18. LLSoft’s Results:General Model And Algorithm • Kernel Miner is based upon the global optimization model developed. • This global model is then decomposed into a system of interrelated, intercoordinated and interconsistent models and criteria. • As a result, Kernel Miner constructs the set of locally optimal decision trees (the decision forest) • From which it selects the optimal subset of trees (the subforest) used for predicting the new cases. • Taking into account the parameters of reliability and stability for prediction enables us to avoid the overfitting problem. IDSL

  19. LLSoft’s Results:Task • Training dataset : 494,021 records • Each record contained values of 41 independent variables • The value of dependent variable labeled as either normal(0), or as an attack(1~4) • Test dataset : 311,029 records IDSL

  20. LLSoft’s Results:Approach And Method Used • 1. Coding of Values of Variables • 2. Constructing the Set of Initial “Good” Partitions • 3. Constructing the Decision Trees • 4. Selection of the Optimal Decision Subforest • 5. Prediction on the Test Dataset IDSL

  21. the type is "smurf" if and only if (519 < src_bytes <= 1032) and (service is ecr_i) . IDSL

  22. IDSL

  23. Compare To The Winning Results • Kernel Miner is greater than the winning entry by 657 test examples (289,006 versus 288,349). • Kernel Miner made less misclassifications by 657 errors (22,023 versus 22,680). • However, Kernel Miner made more misclassifications in element (R2L, Normal) of the confusion matrix (14,994 versus 14,527) which were evaluated by the highest cost IDSL

  24. Analysis of Results • the majority of misclassifications belong to the new attack types which were not in the training data. • 4804 errors predicting "normal" for "R2L" records. • The majority of these records were labeled "guess_passwd" in the test dataset (4110 out of 4804). • Note that in the training 10 % dataset, there were only 53 records labeled "guess_passwd“ • Kernel Miner determined the likely precise pattern for such records consisting of 10 decision trees. IDSL

  25. IDSL

  26. IDSL

  27. IDSL

  28. The MP13 Approach • The MP13 method is best summarized as recognition based on voting decision trees using "pipes" in potential space. • The approach employed by «MP13» team works towards the idea of so-called 'Partner Systems'. • It is aimed on effective data analysis and particular problem resolution based on intrinsic formalization of an expert knowledge IDSL

  29. The MP13 Approach:Steps: • Verbal rules constructed by an expert proficient in network security technology and familiar with KDD methods • First echelon of voting decision trees • Second echelon of voting decision trees IDSL

  30. The MP13 Approach:Work Details: • In a preliminary stage, 13 decision trees were generated based on a subset of the training data. • The training dataset was randomly split into three subsamples: 25% for tree generation, 25% for tree tuning, and 50% for estimating model quality. • Prepared learning data set as 10% of the given complete training database (about 400,000 entries) • Randomly removed some of the DOS and "normal" connections from the full training database • Proceeded with learning based on the "one against the rest"' principle • Converted the testing dataset into ‘potential space’ representation. IDSL

  31. The MP13 Approach:Training Algorithm • Use a version of the 'Fragment' algorithm originally invented at the IITP (Russian Academy of Science), in the division 'Partner Systems'. • For constructing a decision tree, training dataset is split into learning and testing samples. • The learning sample is used to find the structure of a tree and to generate a hierarchy of models on this tree. • The testing sample is used to select a sub-tree having optimal complexity. • Repeated application of the algorithm to various splits of the training data in the different subspaces of the initial data description. • generate a set of voting decision trees IDSL

  32. Conclusions • The winning solution was not significantly better than the two runner-ups. • Kernel Miner is a continually developing tool, and new additional methods and algorithms are to be realized in the next versions of the tool. IDSL

  33. Personal Opinion • Different distributions in the training and testing datasets may influence the final result. • A simple method performs well. • Time complexity should take into account. IDSL

More Related