300 likes | 417 Views
Join Dr. Bhavani Thuraisingham from The University of Texas at Dallas as she discusses the developments in security applications focusing on data mining techniques for intrusion detection. This guest lecture by Mamoun Awad covers various types of intrusions, the identification of malicious traffic using anomaly and misuse detection systems, along with the challenges associated with false positives and negatives in IDS. Attendees will explore advanced methodologies including support vector machines (SVM) and the Dynamically Growing Self-Organizing Tree (DGSOT) algorithm for improved security protocols.
E N D
Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #20 Guest Lecture Data Mining for Intrusion Detection By Mamoun Awad March 24, 2005
Data Mining &Intrusion Detection Systems Mamoun Awad Dept. of Computer Science University of Texas at Dallas
Outline • Intrusion Detection • Data Mining • Approach • Data set & Results
What is an intrusion? • An intrusion can be defined as “any set of actions that attempt to compromise the: • Integrity • confidentiality, or • availability of a resource”.
Intrusion Examples • Virus • Buffer-overflows • 2000 Outlook Express vulnerability. • Denial of Service (DOS) • explicit attempt by attackers to prevent legitimate users of a service from using that service. • Address spoofing • a malicious user uses a fake IP address to send malicious packets to a target. • Many others • R2L, U2R, Probe, …
Intrusion Detection System (IDS) • An Intrusion Detection System (IDS) inspects all inbound and outbound network activity and identifies suspicious patterns that may indicate a network or system attack from someone attempting to break into or compromise a system.
Attack Types • Host-based attacks • Gain access to privileged services or resources on a machine. • Network-based attacks • Make it difficult for legitimate users to access various network services
IDS Categories • Intrusion detection systems are split into two groups: • Anomaly detection systems • Identify malicious traffic based on deviations from established normal network. • Misuse detection systems • Identify intrusions based on a known pattern (signatures) for the malicious activity.
Problem Statement • Goal of Intrusion Detection Systems (IDS): • To detect an intrusion as it happens and be able to respond to it. • False positives: • A false positive is a situation where something abnormal (as defined by the IDS) happens, but it is not an intrusion. • Too many false positives • User will quit monitoring IDS because of noise. • False negatives: • A false negative is a situation where an intrusion is really happening, but IDS doesn't catch it.
Problem Statement • Misuse Detection
Firewall Rules Order Protocol source source destination destination action IP Port IP Port
Problem Statement • Anomaly Detection
Our Approach SVM Class Training Testing Class Training Data Problem??? Testing Data
Our Approach Hierarchical Clustering (DGSOT) SVM Class Training Testing Class Training Data Testing Data
DGOST • Learning Process • Winner Node • Update the Tree • Stopping Criteria
Support Vector Machine • Support Vector Machines (SVM) • One of the most powerful classification techniques • Find hyper-plane that separates classes • Based on the idea of mapping data points to a high dimensional feature space where a separating hyper-plane can be found
Feature Mapping Feature mapping from two dimensional input space to a two dimensional feature space.
SVM Limitations • Long training time limits its use. • Clustering has a positive impact on the training of an SVM -- each cluster is represented by only one reference • Reduce training time • Degrade generalization -- we use a fewer number of points.
Training set • 1998 DARPA data that originated from the MIT Lincoln Lab • http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html • Size: 1012,477 data point
Data set / Attack Types • DOS • denial-of-service • R2L • unauthorized access from a remote machine, e.g. guessing password; • U2R • unauthorized access to local super user (root) privileges, e.g., various ``buffer overflow'' attacks; • Probing • surveillance and other probing, e.g., port scanning.
Methods Weighted Accuracy Average Accuracy Average Training Time Average FP rate Average FN rate Random Selection 62.5% 62.61% 0.049 hours 22.40% 37.38% Pure SVM 62.74% 62.75% 0.51 hours 30.75% 37,24% SVM+Rocchio Bundling 63.09% 63.11% 0.93 hours 30.98% 36.89% SVM + DGSOT 63.34% 63.36% 0.26 hours 51.56% 36.64% Results
Relevant and Important Publications • “A Dynamical Growing Self-Organizing Tree (DGSOT) for Hierarchical Clustering Gene Expression Profiles,” Feng Luo, Latifur Khan , Farokh Bastani, I-Ling Yen and J. Zhou, the Bioinformatics Journal, Oxford University Press, UK, 20 16, (November 2004) 2605-2617. • “Automatic Image Annotation and Retrieval using Weighted Feature Selection”Lei Wang and Latifur Khan to appear in a special issue in Multimedia Tools and Applications, Kulwer Publisher. • “Hierarchical Clustering for Complex Data” Latifur Khan and Feng Luo, to appear in International Journal on Artificial Intelligence Tools, World Scientific publishers. • “A New Intrusion Detection System using Support Vector Machines and Hierarchical Clustering” Latifur Khan, Mamoun Awad, and Bhavani Thuraisingham, to appear in VLDB Journal: The International Journal on Very Large Databases, ACM/Springer-Verlag Publishing.
Relevant and Important Publications • R. Lippman J. Haines, D. Fried., J. Korba, and K. Das, “The 1999 DARPA off-line intrusion detection evaluation” , Computer Networks, 34, pp. 579-595, 2000.