1 / 25

Predicting zero-day software vulnerabilities through data mining --Second Presentation

Predicting zero-day software vulnerabilities through data mining --Second Presentation. Su Zhang. Outline. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing. Functions Available For Our Approach.

isanne
Download Presentation

Predicting zero-day software vulnerabilities through data mining --Second Presentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Predicting zero-day software vulnerabilities through data mining--Second Presentation Su Zhang

  2. Outline • Quick Review. • Data Source – NVD. • Six Most Popular/Vulnerable Vendors For Our Experiments. • Why The Six Vendors Are Chosen. • Data Preprocessing. • Functions Available For Our Approach. • Statistical Results • Plan For Next Phase.

  3. Quick Review

  4. Source Database – NVD • National Vulnerability Database • U.S. government repository of standards based vulnerability management data. • Data included in each NVD entry • Published Date Time • Vulnerable software’s CPE Specification • Derived data • Published Date Time  Month • Published Date Time  Day • Two adjacent vulnerabilities’ CPE  diff (v1,v2)Version diff • CPE Specification  Software Name • Adjacent different Published Date Time  ttpv • Adjacent different Published Date Time  ttnv

  5. Six Most Vulnerable/Popular Vendors • Linux: 56925 instances • Sun: 24726 instances • Cisco: 20120 instances • Mozilla: 19965 instances • Microsoft: 16703 instances • Apple: 14809 instances.

  6. Why We Only Choose Instances Of Pop Vendors—Instances Table

  7. Why We Only Choose Instances Of Pop Vendors—Vulnerability Table

  8. Why We Only Choose Instances Of Pop Vendors • Huge size of nominal types (vendors and software) will result in a scalability issue. • Top six take up 43.4% of all instances. • We have too many vendors(10411) in NVD. • The seventh most popular/vulnerable vendor is much less than the sixth. • Vendors are independent for our approach.

  9. Data Preprocessing • NVD data—Training/Testing dataset • Starting from 2005 since before that the data looks unstable. • Correct some obvious errors in NVD(e.g. “cpe:/o:linux:linux_kernel:390”). • Attributes • Published time : Only use month and day. • Version diff: A normalized difference between two versions. • Vendor: Removed.

  10. Data Preprocessing(cont) • Attributes • “Group” vulnerabilities published at the same day- we can guarantee ttnv/ttpv are non-zero values. • ttnv is the predicted attribute. • For each software • Delete its first bunch of instances. • Delete its last bunch of instances.

  11. version diff Calculation • v1= 3.6.4; v2 = 3.6; MaxVersionLength=4; • v1= expand ( v1, 4 ) = 3.6.4.0 • v2 =expand ( v2, 4 ) = 3.6.0.0 • diff(v1, v2) = (3-3) * 1000 +(6-6) * 100-1 +(4-0) * 100-2 +(0-0) * 100-3 = 4 E -4

  12. An Example Vendor, soft, version, month, day, vdiff, ttpv, ttnv • linux,kernel,2.6.18, 05, 02, 0, 70, 5 • linux,kernel,2.6.19.2, 05, 07,1.02E-4,5, 281

  13. Functions Available For Our Approach On Weka • Least Mean Square. • Linear Regression • Multilayer Perceptron. • SMOreg. • RBF Network. • Gaussian Processes.

  14. Several Statistical Results • Function: Linear Regression • Training Dataset: 66% Linux(Randomly picked since 2005). • Test Dataset: the rest 34% • Test Result: • Correlation coefficient 0.5127 • Mean absolute error 11.2358 • Root mean squared error 25.4037 • Relative absolute error 107.629 % • Root relative squared error 86.0388 % • Total Number of Instances 17967

  15. Correlation Coefficient

  16. Several Definitions About “Error” • Mean absolute error : • Root mean square error:

  17. Several Definitions About “Error”(Cont) • Relative absolute error: • Root relative squared error:

  18. Several Statistical Results • Function: Least Mean Square • Training Dataset: 66% Linux(Randomly picked since 2005). • Test Dataset: the rest 34% • Test Result: • Correlation coefficient -0.1501 • Mean absolute error 7.6676 • Root mean squared error 30.6038 • Relative absolute error 73.449 % • Root relative squared error 103.6507 % • Total Number of Instances 17967

  19. Several Statistical Results • Function: Multilayer Perceptron • Training Dataset: 66% Linux(Randomly picked since 2005). • Test Dataset: the rest 34% • Test Result: • Correlation coefficient 0.9886 • Mean absolute error 0.4068 • Root mean squared error 4.6905 • Relative absolute error 3.7802 % • Root relative squared error 15.1644 % • Total Number of Instances 17967

  20. Several Statistical Results • Function: RBF Network • Training Dataset: 66% Linux(Randomly picked since 2005). • Test Dataset: the rest 34% • Test Result: • Linear Regression Model • ttnv = -15.3206 * pCluster_0_1 + 21.6205 • Correlation coefficient 0.1822 • Mean absolute error 10.5857 • Root mean squared error 29.048 • Relative absolute error 101.4023 % • Root relative squared error 98.3814 % • Total Number of Instances 17967

  21. Summary Of Current Results • Linear Regression: Not accurate enough but looks promising (correlation coefficient: 0.5127). • Least Mean Square: Probably not good for our approach(negative correlation coefficient). • Multilayer Perceptron: Looks good but it couldn’t provide us with a linear model.

  22. Summary Of Current Results (Cont) • SMOreg: For most vendors, it takes too long time to finish (usually more than 80 hours). • RBF Network: Not very accurate. • Gaussian Processes: Runs out of heap memory for most of our experiments.

  23. Possible Ways To Improve The Accuracy Of Our Models. • Adding CVSS metrics as predictive attributes. • Binarize our predictive attributes (e.g. divide ttnv/ttpv into several categories.) • Use regressionSVM with multiplekernels.

  24. Plan For Next Phase • Try to find out an optimal model for our prediction. • Try to investigate how to apply it with MulVAL if we get a good model. Otherwise, find out the reason why it is not accurate enough.

  25. Thank you!

More Related