Data Mining &Intrusion Detection

Data Mining &Intrusion Detection Shan Bai Instructor: Dr. Yingshu Li CSC 8712 ,Spring 08

Outline • Intrusion Detection • Data Mining • Data Mining in Intrusion Detection • Reference

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 What is an intrusion? • An intrusion can be defined as “any set of actions that attempt to compromise the: • Integrity • confidentiality, or • availability of a resource”. Incidents Reported to Computer Emergency Response Team/Coordination Center Spread of SQL Slammer worm 10 minutes after its deployment

Intrusion Examples • DOS • denial-of-service • R2L • unauthorized access from a remote machine, e.g. guessing password; • U2R • unauthorized access to local super user (root) privileges, e.g., various ``buffer overflow'' attacks; • Probing • surveillance and other probing, e.g., port scanning. • Trojan horse /worm • Address spoofing • a malicious user uses a fake IP address to send malicious packets to a target. • Many others…

Intrusion Detection System (IDS) • Intrusion Detection System • combination of software and hardware that attempts to perform intrusion detection raises the alarm when possible intrusion happens.

IDS Categories • Intrusion detection systems are split into two groups: • Anomaly detection systems • Identify malicious traffic based on deviations from established normal network. • Misuse detection systems • Identify intrusions based on a known pattern (signatures) for the malicious activity.

Anomaly Detection probable intrusion activity measures • baseline the normal traffic and then look for things that are out of the norm Relatively high false positive rate - anomalies can just be new normal activities.

pattern matching Intrusion Patterns intrusion activities MisuseDetection Example: if (src_ip == dst_ip) then “land attack” • look for known indicators ICMP Scans, port scans, connection attempts CPU, RAM I/O Utilization, File system activity, modification of system files, permission modifications Can’t detect new attacks

Goal of Intrusion Detection Systems (IDS): • To detect an intrusion as it happens and be able to respond to it. • False positives: • A false positive is a situation where something abnormal (as defined by the IDS) happens, but it is not an intrusion. • Too many false positives • User will quit monitoring IDS because of noise. • False negatives: • A false negative is a situation where an intrusion is really happening, but IDS doesn't catch it.

Why do we need Data Mining? • Despite the enormous amount of data, particular events of interest are still quite rare, frequency ranges from 0.1% to less than 10% • We are drowning in data, but starving for knowledge!􀂊

Data Mining vs. KDD • Knowledge Discovery in Databases (KDD): The whole process of finding useful information and patterns in data • Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process • Data mining is the core of the knowledge discovery process

KDD Process • Selection: Obtain data from various sources. • Preprocessing: Cleanse data. • Transformation: Convert to common format. Transform to new format. • Data Mining: Obtain desired results. • Interpretation/Evaluation: Present results to user in meaningful manner

Data Mining: A KDD Process Knowledge • Data mining: core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases

Typical Data Mining Architecture Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Filtering Data cleaning & data integration Data Warehouse Databases

Network intrusion detection Number of intrusions on the network is typically a very small fraction of the total network traffic

Why Can Data Mining Help? • Learn from traffic data • Supervised learning: learn precise models from past intrusions • Unsupervised learning: identify suspicious activities • Maintain models on dynamic data • Correlation of suspicious events across network sites • Helps detect sophisticated attacks not identifiable by single site analyses • Analysis of long term data (months/years) • Uncover suspicious stealth activities (e.g. insiders leaking/modifying information)

Intrusion Detection • Traditional intrusion detection system IDS tools (e.g. SNORT) are based on signatures of known attacks • Limitations • Signature database has to be manually revised for each new type of discovered intrusion • They cannot detect emerging cyber threats • Substantial latency in deployment of newly created signatures across the computer system

Data Mining for Intrusion Detection: Techniques and Applications • Frequent pattern mining • Classification • Clustering • Mining data streams

Frequent pattern mining • Patterns that occur frequently in a database • Mining Frequent patterns – finding regularities • Process of Mining Frequent patterns for intrusion detection • Phase I: mine a repository of normal frequent itemsets for attack-free data • Phase II: find frequent itemsets in the last n connections and compare the patterns to the normal profile

Frequent pattern mining Apriori: • Any subset of a frequent itemset must be also frequent — an anti-monotone property – A transaction containing {beer, diaper, nuts} also contains {beer, diaper} – {beer, diaper, nuts} is frequent {beer, diaper} must also be frequent • No superset of any infrequent itemset should be generated or tested – Many item combinations can be pruned

Sequential Pattern Analysis • Models sequence patterns • (Temporal) order is important in many situations • Time-series databases and sequence databases • Frequent patterns  (frequent) sequential patterns • Sequential patterns for intrusion detection • Capture the signatures for attacks in a series of packets

Sequential Pattern Mining Given a set of sequences, find the complete set of frequent subsequences

Apriori Property in Sequences

Classification: A Two-Step Process • Model construction: describe a set of predetermined classes • Training dataset: tuples for model construction • Each tuple/sample belongs to a predefined class • Classification rules, decision trees, or math formulae • Model application: classify unseen objects • Estimate accuracy of the model using an independent test set • Acceptable accuracy  apply the model to classify data tuples with unknown class labels

Classification

Classification :Decision Tree • A node in the tree: a test of some attribute • A branch: a possible value of the attribute • Classification • Start at the root • Test the attribute • Move down the tree branch

Neural classification: HIDE • “A hierarchical network intrusion detection system using statistical processing and neural network classification” by Zheng et al. • Five major components • Probes collect traffic data • Event preprocessor preprocesses traffic data and feeds the statistical model • Statistical processor maintains a model for normal activities and generates vectors for new events • Neural network classifies the vectors of new events • Post processor generates reports

Clustering • What Is Clustering? • Group data into clusters • – Similar to one another within the same cluster • – Dissimilar to the objects in other clusters • – Unsupervised learning: no predefined classes

Clustering • What Is A Good Clustering? • High intra-class similarity and low interclasssimilarity • Depending on the similarity measure • The ability to discover some or all of the hidden patterns

Clustering • Clustering Approaches • Partitioning algorithms • – Partition the objects into k clusters • – Iteratively reallocate objects to improve the clustering • Hierarchy algorithms • – Agglomerative: each object is a cluster, merge clusters to form larger ones • – Divisive: all objects are in a cluster, split it up into smaller clusters

Clustering • K-Means: Example

Mining Data Streams for Intrusion Detection • Maintaining profiles of normal activities • The profiles of normal activities may drift • Identifying novel attacks • Identifying clusters and outliers in traffic data streams • Reduce the future alarm load by writing filtering rules that automatically discard well-understood false positives

Data Mining for Intrusion Detection • Misuse detection • Predictive models are built from labeled data sets (instances are labeled as “normal” or “intrusive”) • These models can be more sophisticated and precise than manually created signatures • Recent research e.g. JAM (Java Agents for Metalearning)

pattern matching Intrusion Patterns intrusion activities Misuse Detection Example: if (src_ip == dst_ip) then “land attack” • look for known indicators ICMP Scans, port scans, connection attempts CPU, RAM I/O Utilization, File system activity, modification of system files, permission modifications Can’t detect new attacks

JAM (Java Agents for Metalearning) • JAM (developed at Columbia University) uses data mining techniques to discover patterns of intrusions. It then applies a meta-learning classifier to learn the signature of attacks. • The association rules algorithm determines relationships between fields in the audit trail records, and the frequent episodes algorithm models sequential patterns of audit events. Features are then extracted from both algorithms and used to compute models of intrusion behavior. • The classifiers build the signature of attacks. So thus, data mining in JAM builds misuse detection model. • Classifiers in the JAM are generated by using rule learning program on training data of system usage. After training, resulting classification rules is used to recognize anomalies and detect known intrusions. • The system has been tested with data from Sendmail-based attacks, and with network attacks using TCP dump data.

Data Mining for Intrusion Detection • Anomaly detection • Identifies anomalies as deviations from “normal” behavior • E.g. ADAM: Audit Data Analysis and Mining; MINDS – MINnesota INtrusion Detection System

Anomaly Detection probable intrusion activity measures • baseline the normal traffic and then look for things that are out of the norm Relatively high false positive rate - anomalies can just be new normal activities.

ADAM: Audit Data Analysis and Mining Detecting Intrusion by Data Mining Combination of Association Rule and Classification Rule • Firstly, ADAM collects known frequent datasetsan off-line algorithm • Secondly, ADAM runs an online algorithm • Finds last frequent connection records • Compare them with known mined data • Discards those, which seems to be normal • Suspicious ones are forwarded to the classifier • Trained classifier then classify the suspicious data as one of the following: • Known type of attack • Unknown type of attack • False alarm

ADAM: Detecting Intrusion by Data Mining

ADAM: Audit Data Analysis and Mining • ADAM has two phases in their model • 1st Phase: Train the classifier • Offline process • Takes place only once • Before the main experiment • 2nd Phase: Using the trained classifier • Trained classifier is then used to detect anomalies • Online process

The MINDS Project • MINDS – MINnesota INtrusion Detection System • Learning from Rare Class – Building rare class prediction models • Anomaly/outlier detection • Summarization of attacks using association pattern analysis Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

MINDS - Learning from Rare Class • Problem: Building models for rare network attacks (Mining needle in a haystack) • Standard data mining models are not suitable for rare classes • Models must be able to handle skewed class distributions • Learning from data streams - intrusions are sequences of events

MINDS - Anomaly Detection • Detect novel attacks/intrusions by identifying them as deviations from “normal”, i.e. anomalous behavior • Identify normal behavior • Construct useful set of features • Define similarity function • Use outlier detection algorithm • Nearest neighbor approach • Density based schemes • Unsupervised Support Vector Machines (SVM)

Experimental Evaluation • Publicly available data set • DARPA 1998 Intrusion Detection Evaluation Data Set prepared and managed by MIT Lincoln Lab includes a wide variety of intrusions simulated in a military network environment • Real network data from • University of Minnesota Anomaly detection is applied • 4 times a day • 10 minutes time window Open source signature-based network IDS network www.snort.org 10 minutes cycle 2 millions connections net-flow data using CISCO routers Anomaly scores Association pattern analysis … … MINDSanomaly detection Data preprocessing

MINDS - Framework for Mining Associations Ranked connections attack Discriminating Association Pattern Generator Anomaly Detection System normal update • Build normal profile • Study changes in normal behavior • Create attack summary • Detect misuse behavior • Understand nature of the attack R1: TCP, DstPort=1863  Attack … … … … R100: TCP, DstPort=80 Normal Knowledge Base MINDS association analysis module

Discovered Real-life Association Patterns Rule 1: SrcIP=XXXX, DstPort=80, Protocol=TCP, Flag=SYN, NoPackets: 3, NoBytes:120…180 (c1=256, c2 = 1) Rule 2: SrcIP=XXXX, DstIP=YYYY, DstPort=80, Protocol=TCP,Flag=SYN, NoPackets: 3, NoBytes: 120…180 (c1=177, c2 = 0) • At first glance, Rule 1 appears to describe a Web scan • Rule 2 indicates an attack on a specific machine • Both rules together indicate that a scan is performed first, followed by an attack on a specific machine identified as vulnerable by the attacker

Discovered Real-life Association Patterns DstIP=ZZZZ, DstPort=8888, Protocol=TCP (c1=369, c2=0)DstIP=ZZZZ, DstPort=8888, Protocol=TCP, Flag=SYN (c1=291, c2=0) • This pattern indicates an anomalously high number of TCP connections on port 8888 involving machine ZZZZ • Follow-up analysis of connections covered by the pattern indicates that this could be a machine running a variation of the Kazaa file-sharing protocol • Having an unauthorized application increases the vulnerability of the system

Discovered Real-life Association Patterns…(ctd) SrcIP=XXXX, DstPort=27374, Protocol=TCP, Flag=SYN, NoPackets=4, NoBytes=189…200 (c1=582, c2=2) SrcIP=XXXX, DstPort=12345, NoPackets=4, NoBytes=189…200 (c1=580, c2=3) SrcIP=YYYY, DstPort=27374, Protocol=TCP, Flag=SYN, NoPackets=3, NoBytes=144 (c1=694, c2=3) …… • This pattern indicates a large number of scans on ports 27374 (which is a signature for the SubSeven worm) and 12345 (which is a signature for NetBus worm) • Further analysis showed that no fewer than five machines scanning for one or both of these ports in any time window

Data Mining &Intrusion Detection