data mining intrusion detection n.
Skip this Video
Loading SlideShow in 5 Seconds..
Data Mining &Intrusion Detection PowerPoint Presentation
Download Presentation
Data Mining &Intrusion Detection

Loading in 2 Seconds...

play fullscreen
1 / 62

Data Mining &Intrusion Detection - PowerPoint PPT Presentation

  • Uploaded on

Data Mining &Intrusion Detection. Shan Bai Instructor: Dr. Yingshu Li CSC 8712 ,Spring 08. Outline. Intrusion Detection Data Mining Data Mining in Intrusion Detection Reference. 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002. What is an intrusion?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Data Mining &Intrusion Detection' - MikeCarlo

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
data mining intrusion detection

Data Mining &Intrusion Detection

Shan Bai

Instructor: Dr. Yingshu Li

CSC 8712 ,Spring 08

  • Intrusion Detection
  • Data Mining
  • Data Mining in Intrusion Detection
  • Reference
what is an intrusion

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002

What is an intrusion?
  • An intrusion can be defined as “any set of actions that attempt to compromise the:
    • Integrity
    • confidentiality, or
    • availability

of a resource”.

Incidents Reported to Computer Emergency Response Team/Coordination Center

Spread of SQL Slammer worm 10 minutes

after its deployment

intrusion examples
Intrusion Examples
  • DOS
    • denial-of-service
  • R2L
    • unauthorized access from a remote machine, e.g. guessing password;
  • U2R
    • unauthorized access to local super user (root) privileges, e.g., various ``buffer overflow'' attacks;
  • Probing
    • surveillance and other probing, e.g., port scanning.
  • Trojan horse /worm
  • Address spoofing
    • a malicious user uses a fake IP address to send malicious packets to a target.
  • Many others…
intrusion detection system ids
Intrusion Detection System (IDS)
  • Intrusion Detection System
    • combination of software and hardware that attempts to perform intrusion detection raises the alarm when possible intrusion happens.
ids categories
IDS Categories
  • Intrusion detection systems are split into two groups:
    • Anomaly detection systems
      • Identify malicious traffic based on deviations from established normal network.
    • Misuse detection systems
      • Identify intrusions based on a known pattern (signatures) for the malicious activity.
anomaly detection
Anomaly Detection

probable intrusion

activity measures

  • baseline the normal traffic and then look for things that are out of the norm

Relatively high false positive rate - anomalies can just be new normal activities.

misuse detection

pattern matching

Intrusion Patterns




Example: if (src_ip == dst_ip) then “land attack”

  • look for known indicators ICMP Scans, port scans, connection attempts CPU, RAM I/O Utilization, File system activity, modification of system files, permission modifications

Can’t detect new attacks

Goal of Intrusion Detection Systems (IDS):
    • To detect an intrusion as it happens and be able to respond to it.
  • False positives:
    • A false positive is a situation where something abnormal (as defined by the IDS) happens, but it is not an intrusion.
    • Too many false positives
      • User will quit monitoring IDS because of noise.
  • False negatives:
    • A false negative is a situation where an intrusion is really happening, but IDS doesn't catch it.
  • Intrusion Detection
  • Data Mining
  • Data Mining in Intrusion Detection
  • Reference
why do we need data mining
Why do we need Data Mining?
  • Despite the enormous amount of data, particular events of interest are still quite rare, frequency ranges from 0.1% to less than 10%
  • We are drowning in data, but starving for knowledge!􀂊
data mining vs kdd
Data Mining vs. KDD
  • Knowledge Discovery in Databases (KDD): The whole process of finding useful information and patterns in data
  • Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process
  • Data mining is the core of the knowledge discovery process
kdd process
KDD Process
  • Selection: Obtain data from various sources.
  • Preprocessing: Cleanse data.
  • Transformation: Convert to common format. Transform to new format.
  • Data Mining: Obtain desired results.
  • Interpretation/Evaluation: Present results to user in meaningful manner
data mining a kdd process
Data Mining: A KDD Process


  • Data mining: core of knowledge discovery process

Pattern Evaluation

Data Mining

Task-relevant Data


Data Warehouse

Data Cleaning

Data Integration


typical data mining architecture
Typical Data Mining Architecture

Graphical user interface

Pattern evaluation

Data mining engine


Database or data warehouse server


Data cleaning & data integration




  • Intrusion Detection
  • Data Mining
  • Data Mining in Intrusion Detection
  • Reference
Network intrusion detection

Number of intrusions on the network is typically a very small fraction of the total network traffic

why can data mining help
Why Can Data Mining Help?
  • Learn from traffic data
    • Supervised learning: learn precise models from past intrusions
    • Unsupervised learning: identify suspicious activities
  • Maintain models on dynamic data
  • Correlation of suspicious events across network sites
    • Helps detect sophisticated attacks not identifiable by single site analyses
  • Analysis of long term data (months/years)
    • Uncover suspicious stealth activities (e.g. insiders leaking/modifying information)
intrusion detection
Intrusion Detection
  • Traditional intrusion detection system IDS tools (e.g. SNORT) are based on signatures of known attacks
  • Limitations
    • Signature database has to be manually revised for each new type of discovered intrusion
    • They cannot detect emerging cyber threats
    • Substantial latency in deployment of newly created signatures across the computer system
data mining for intrusion detection techniques and applications
Data Mining for Intrusion Detection: Techniques and Applications
  • Frequent pattern mining
  • Classification
  • Clustering
  • Mining data streams
frequent pattern mining
Frequent pattern mining
  • Patterns that occur frequently in a database
  • Mining Frequent patterns – finding regularities
  • Process of Mining Frequent patterns for intrusion detection
    • Phase I: mine a repository of normal frequent itemsets for attack-free data
    • Phase II: find frequent itemsets in the last n connections and compare the patterns to the normal profile
frequent pattern mining1
Frequent pattern mining


• Any subset of a frequent itemset must be also frequent — an anti-monotone property

– A transaction containing {beer, diaper, nuts} also

contains {beer, diaper}

– {beer, diaper, nuts} is frequent {beer, diaper} must

also be frequent

• No superset of any infrequent itemset should be generated or tested

– Many item combinations can be pruned


Sequential Pattern Analysis

  • Models sequence patterns
  • (Temporal) order is important in many situations
    • Time-series databases and sequence databases
    • Frequent patterns  (frequent) sequential patterns
  • Sequential patterns for intrusion detection
    • Capture the signatures for attacks in a series of packets
sequential pattern mining
Sequential Pattern Mining

Given a set of sequences, find the complete set of frequent subsequences


Classification: A Two-Step Process

  • Model construction: describe a set of predetermined classes
    • Training dataset: tuples for model construction
      • Each tuple/sample belongs to a predefined class
    • Classification rules, decision trees, or math formulae
  • Model application: classify unseen objects
    • Estimate accuracy of the model using an independent test set
    • Acceptable accuracy  apply the model to classify data tuples with unknown class labels
classification decision tree
Classification :Decision Tree
  • A node in the tree: a test of some attribute
  • A branch: a possible value of the attribute
  • Classification
    • Start at the root
    • Test the attribute
    • Move down the tree branch

Neural classification: HIDE

  • “A hierarchical network intrusion detection system using statistical processing and neural network classification” by Zheng et al.
  • Five major components
    • Probes collect traffic data
    • Event preprocessor preprocesses traffic data and feeds the statistical model
    • Statistical processor maintains a model for normal activities and generates vectors for new events
    • Neural network classifies the vectors of new events
    • Post processor generates reports
  • What Is Clustering?
  • Group data into clusters
    • – Similar to one another within the same cluster
    • – Dissimilar to the objects in other clusters
    • – Unsupervised learning: no predefined classes
  • What Is A Good Clustering?
    • High intra-class similarity and low interclasssimilarity
      • Depending on the similarity measure
    • The ability to discover some or all of the hidden patterns
  • Clustering Approaches
    • Partitioning algorithms
      • – Partition the objects into k clusters
      • – Iteratively reallocate objects to improve the clustering
    • Hierarchy algorithms
      • – Agglomerative: each object is a cluster, merge clusters to form larger ones
      • – Divisive: all objects are in a cluster, split it up into smaller clusters
  • K-Means: Example
mining data streams for intrusion detection
Mining Data Streams for Intrusion Detection
  • Maintaining profiles of normal activities
    • The profiles of normal activities may drift
  • Identifying novel attacks
    • Identifying clusters and outliers in traffic data streams
  • Reduce the future alarm load by writing filtering rules that automatically discard well-understood false positives
data mining for intrusion detection
Data Mining for Intrusion Detection
  • Misuse detection
      • Predictive models are built from labeled data sets (instances are labeled as “normal” or “intrusive”)
      • These models can be more sophisticated and precise than manually created signatures
    • Recent research e.g. JAM (Java Agents for Metalearning)
misuse detection1

pattern matching

Intrusion Patterns



Misuse Detection

Example: if (src_ip == dst_ip) then “land attack”

  • look for known indicators ICMP Scans, port scans, connection attempts CPU, RAM I/O Utilization, File system activity, modification of system files, permission modifications

Can’t detect new attacks

jam java agents for metalearning
JAM (Java Agents for Metalearning)
  • JAM (developed at Columbia University) uses data mining techniques to discover patterns of intrusions. It then applies a meta-learning classifier to learn the signature of attacks.
  • The association rules algorithm determines relationships between fields in the audit trail records, and the frequent episodes algorithm models sequential patterns of audit events. Features are then extracted from both algorithms and used to compute models of intrusion behavior.
  • The classifiers build the signature of attacks. So thus, data mining in JAM builds misuse detection model.
  • Classifiers in the JAM are generated by using rule learning program on training data of system usage. After training, resulting classification rules is used to recognize anomalies and detect known intrusions.
  • The system has been tested with data from Sendmail-based attacks, and with network attacks using TCP dump data.
data mining for intrusion detection1
Data Mining for Intrusion Detection
  • Anomaly detection
    • Identifies anomalies as deviations from “normal” behavior
    • E.g. ADAM: Audit Data Analysis and Mining; MINDS – MINnesota INtrusion Detection System
anomaly detection1
Anomaly Detection

probable intrusion

activity measures

  • baseline the normal traffic and then look for things that are out of the norm

Relatively high false positive rate - anomalies can just be new normal activities.

adam audit data analysis and mining
ADAM: Audit Data Analysis and Mining

Detecting Intrusion by Data Mining

Combination of Association Rule and Classification Rule

  • Firstly, ADAM collects known frequent datasetsan off-line algorithm
  • Secondly, ADAM runs an online algorithm
    • Finds last frequent connection records
    • Compare them with known mined data
    • Discards those, which seems to be normal
    • Suspicious ones are forwarded to the classifier
    • Trained classifier then classify the suspicious data as one of the following:
      • Known type of attack
      • Unknown type of attack
      • False alarm
adam audit data analysis and mining1
ADAM: Audit Data Analysis and Mining
  • ADAM has two phases in their model
  • 1st Phase: Train the classifier
    • Offline process
    • Takes place only once
    • Before the main experiment
  • 2nd Phase: Using the trained classifier
    • Trained classifier is then used to detect anomalies
    • Online process
the minds project
The MINDS Project
  • MINDS – MINnesota INtrusion Detection System
    • Learning from Rare Class – Building rare class prediction models
    • Anomaly/outlier detection
    • Summarization of attacks using association pattern analysis

Rules Discovered:

{Milk} --> {Coke}

{Diaper, Milk} --> {Beer}

minds learning from rare class
MINDS - Learning from Rare Class
  • Problem: Building models for rare network attacks (Mining needle in a haystack)
    • Standard data mining models are not suitable for rare classes
      • Models must be able to handle skewed class distributions
    • Learning from data streams - intrusions are sequences of events
minds anomaly detection
MINDS - Anomaly Detection
  • Detect novel attacks/intrusions by identifying them as deviations from “normal”, i.e. anomalous behavior
    • Identify normal behavior
    • Construct useful set of features
    • Define similarity function
    • Use outlier detection algorithm
      • Nearest neighbor approach
      • Density based schemes
      • Unsupervised Support Vector Machines (SVM)
experimental evaluation
Experimental Evaluation
  • Publicly available data set
        • DARPA 1998 Intrusion Detection Evaluation Data Set prepared and managed by MIT Lincoln Lab includes a wide variety of intrusions simulated in a military network environment
  • Real network data from
    • University of Minnesota

Anomaly detection is applied

  • 4 times a day
  • 10 minutes time window

Open source signature-based network IDS


10 minutes cycle

2 millions connections

net-flow data using CISCO routers

Anomaly scores

Association pattern analysis

MINDSanomaly detection

Data preprocessing

minds framework for mining associations
MINDS - Framework for Mining Associations

Ranked connections


Discriminating Association Pattern Generator

Anomaly Detection System



  • Build normal profile
  • Study changes in normal behavior
  • Create attack summary
  • Detect misuse behavior
  • Understand nature of the attack

R1: TCP, DstPort=1863  Attack

R100: TCP, DstPort=80 Normal

Knowledge Base

MINDS association analysis module

discovered real life association patterns
Discovered Real-life Association Patterns

Rule 1: SrcIP=XXXX, DstPort=80, Protocol=TCP, Flag=SYN, NoPackets: 3, NoBytes:120…180 (c1=256, c2 = 1)

Rule 2: SrcIP=XXXX, DstIP=YYYY, DstPort=80, Protocol=TCP,Flag=SYN, NoPackets: 3, NoBytes: 120…180 (c1=177, c2 = 0)

  • At first glance, Rule 1 appears to describe a Web scan
  • Rule 2 indicates an attack on a specific machine
  • Both rules together indicate that a scan is performed first, followed by an attack on a specific machine identified as vulnerable by the attacker
discovered real life association patterns1
Discovered Real-life Association Patterns

DstIP=ZZZZ, DstPort=8888, Protocol=TCP (c1=369, c2=0)DstIP=ZZZZ, DstPort=8888, Protocol=TCP, Flag=SYN (c1=291, c2=0)

  • This pattern indicates an anomalously high number of TCP connections on port 8888 involving machine ZZZZ
  • Follow-up analysis of connections covered by the pattern indicates that this could be a machine running a variation of the Kazaa file-sharing protocol
  • Having an unauthorized application increases the vulnerability of the system

Discovered Real-life Association Patterns…(ctd)

SrcIP=XXXX, DstPort=27374, Protocol=TCP, Flag=SYN, NoPackets=4, NoBytes=189…200 (c1=582, c2=2)

SrcIP=XXXX, DstPort=12345, NoPackets=4, NoBytes=189…200 (c1=580, c2=3)

SrcIP=YYYY, DstPort=27374, Protocol=TCP, Flag=SYN, NoPackets=3, NoBytes=144 (c1=694, c2=3)


  • This pattern indicates a large number of scans on ports 27374 (which is a signature for the SubSeven worm) and 12345 (which is a signature for NetBus worm)
  • Further analysis showed that no fewer than five machines scanning for one or both of these ports in any time window
discovered real life association patterns ctd
Discovered Real-life Association Patterns…(ctd)

DstPort=6667, Protocol=TCP (c1=254, c2=1)

  • This pattern indicates an unusually large number of connections on port 6667 detected by the anomaly detector
  • Port 6667 is where IRC (Internet Relay Chat) is typically run
  • Further analysis reveals that there are many small packets from/to various IRC servers around the world
  • Although IRC traffic is not unusual, the fact that it is flagged as anomalous is interesting
    • This might indicate that the IRC server has been taken down (by a DOS attack for example) or it is a rogue IRC server (it could be involved in some hacking activity)
discovered real life association patterns ctd1
Discovered Real-life Association Patterns…(ctd)

DstPort=1863, Protocol=TCP, Flag=0, NoPackets=1, NoBytes<139 (c1=498, c2=6)DstPort=1863, Protocol=TCP, Flag=0 (c1=587, c2=6)DstPort=1863, Protocol=TCP (c1=606, c2=8)

  • This pattern indicates a large number of anomalous TCP connections on port 1863
  • Further analysis reveals that the remote IP block is owned by Hotmail
  • Flag=0 is unusual for TCP traffic
minds conclusion

Outsider attack

  • Network intrusion

MINDS Research

  • Defining normal behavior
  • Feature extraction
  • Similarity functions
  • Outlier detection
  • Result summarization
  • Detection of attacks originating from multiple sites

Insider attack

  • Policy violation

Worm/virus detection

after infection

MINDS: Conclusion
  • Data mining based algorithms are capable of detecting intrusions that cannot be detected by state-of-the-art signature based methods
    • SNORT has static knowledge manually updated by human analysts
    • MINDS anomaly detection algorithms are adaptive in nature
    • MINDS anomaly detection algorithms can also be effective in detecting anomalous behavior originating from a compromised or infected machine
ids using both misuse and anomaly detection rids 100
IDS Using both Misuse and Anomaly Detection:RIDS-100
  • RIDS( Rising Intrusion Detection System) is provided by Rising Tech. It is a leader in antivirus and content security software and services in China.
  • The company is a leading provider of client, gateway and server security solutions for virus protection, firewall and intrusion detection technologies and security services to enterprises and service providers around China.
  • RIDS make the use of both intrusion detection technique, misuse and anomaly detection.
  • Distance based outlier detection algorithm is used for detection deviational behavior among collected network data.
  • For misuse detection, it has very vast set of collected data pattern which can be matched with scanned network data for misuse detection.
  • This large amount of data pattern is scanned using data mining classification Decision Tree algorithm.
a cooperative anomaly and intrusion detection system caids
A cooperative anomaly and intrusiondetection system (CAIDS),
  • built with a network-based intrusion detection system (NIDS) and an anomaly detection system (ADS) operating interactively through a signature generator.
a cooperative anomaly and intrusion detection system caids1
A cooperative anomaly and intrusiondetection system (CAIDS),
  • A frequent episode rule (FER) is generated out of a collection of frequent episodes. The FER is defined over episode sequences with multiple connection events.
  • For an example, we envision a window where we observe a 3-event sequence:
  • E, D, and F. An FER is generated as: E → D, F
  • confidence level freq (a U b)/freq (b)=0.8,
  • where a represents the event E on the LHS and b corresponds to the two events D and F on the RHS of the rule.
  • If the b occurs with 5% and the joint event a and bhas 4% to occur, there is a (0.04/0.05) = 80% chance that D and F will follow in the same window.
a cooperative anomaly and intrusion detection system caids2
A cooperative anomaly and intrusiondetection system (CAIDS),
  • In practice, the event E could be an authentication service characterized by two attributes
  • (service =authentication, flag=SF).
  • The events D, F may be two sequential smtp requests denoted by (service = smtp).
  • Thus we can derive an FER with a confidence level of c = 80%, that two smtp services will follow the authentication service within a window w = 2 sec. The three joint traffic events accounts with a support level s = 10% out of all the network connections being evaluated. This FER is formally stated as follows:
  • (service = authentication) → (service = smtp)
  • (service = smtp) (0.8, 0.1, 2 sec) (1)
a cooperative anomaly and intrusion detection system caids3
A cooperative anomaly and intrusiondetection system (CAIDS),
  • An association rule is aimed at finding interesting intra-relationship inside a single connection record
  • In general, an FER is specified by the following expression:
  • L1, L2,…, Ln R1,…, Rm (c, s, window) (2)
  • Li (1 ≤ i ≤ n) and Rj (1 ≤ j ≤m) are ordered traffic connection events.
  • We call L1, L2,…, Ln the LHS episode and R1,…, Rm the RHS of the episode rule.
a cooperative anomaly and intrusion detection system caids4
A cooperative anomaly and intrusiondetection system (CAIDS),

Architecture of the CAIDS simulator built with a 2,000-signature Snort

and an anomaly detection subsystem (ADS) with 60 FERs after 2 weeks

of rule training over the Lincoln Lab IDS evaluation dataset

  • In this report we have studied basic concept and some classic system models, like ADAM ,MINDSin this area.
  • To make summary of those system models, their technologies and their validation methods.
  • Hope to a overview on currently development in this area and how data mining is evolving into the field of network intrusion detection.
  • DARPA 1998 data set
    • A cleansed set in KDDCup’99
    • DARPA 1991 data set is also available
  • Daniel Barbara, Julia Couto, Sushil Jajodia, Leonard Popyack, Ningning Wu, “ADAM: Detecting Intrusions by Data Mining”, Proceedings of the 2001 IEEE Workshop on Information Assurance and Security, United States Military Academy, West Point, NY, 5-6 June 2001
  • Zhang, J. and Zulkernine, M. 2006. A Hybrid Network Intrusion Detection Technique Using Random Forests. In Proceedings of the First international Conference on Availability, Reliability and Security (April 20 - 22, 2006).
  • W. Lee et al. A data mining framework for building intrusion detection models. In Information and System Security, Vol. 3, No. 4, 2000.
  • Ertoz L. et Al, "MINDS - Minnesota Intrusion Detection System", Next Generation Data Mining Chapter 3, 2004
  • Exploiting efficient data mining techniques to enhance intrusion detection systems Lu, C.-T.; Boedihardjo, A.P.; Manalwar, P. Information Reuse and Integration, Conf, 2005. IRI -2005 IEEE International Conference on. Volume , Issue , 15-17 Aug. 2005 Page(s): 512 - 517
  • Sal Stolfo, Andreas Prodromidis, Shelley Tselepis, Wenke Lee, Dave Fan, and Phil Chan (Honorable mention (runner-up) for Best Paper Award in Applied Research Category) In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD '97), Newport Beach, CA, August 1997