600 likes | 839 Views
Outline. Introduction to Knowledge Discovery in Databases and Data MiningData Mining Tools, Techniques, and TasksHigh-dimensional dataFeature and values reduction, and sampling Rare EventsWhat are they?What are the application domains exhibiting these characteristics?What are the limitations
E N D
1. Data Mining for Analysis of Rare Events: A Case of Computer Security and Other Applications Jozef Zurada
Department of Computer Information Systems
College of Business
University of Louisville
Louisville, Kentucky
USA
email: jmzura01@louisville.edu
2. Outline Introduction to Knowledge Discovery in Databases and Data Mining
Data Mining Tools, Techniques, and Tasks
High-dimensional data
Feature and values reduction, and sampling
Rare Events
What are they?
What are the application domains exhibiting these characteristics?
What are the limitations of standard data mining techniques?
Major Techniques for Detecting Rare Events?
Supervised (Classification) techniques - Predictive Modeling
Tree based approaches, Neural networks
Unsupervised Techniques
Anomaly/Outlier Detection, Clustering
Other Data Mining Techniques – Association Rules
Case Study: Intrusion Detection Systems
What are the general types/categories of cyber attacks
Data Mining architecture for Intrusion Detection Systems
Conclusion and Questions
3. What is KDD? Finding/extracting interesting information from data stored in large databases/data warehouses
Interesting
non-trivial
implicit
previously unknown (novel)
easily understood
rule length, number of conditions in a rule
potentially useful (actionable)
Information
patterns
rules
correlations
relationships hidden in data
descriptions of rare events
detection of outliers/anomalies/rare events
prediction of events
Interesting patterns represent knowledge
4. Measures of Pattern Interestingness Objective
Rule support
Represents the percentage of transactions from a transaction database that the given rule satisfies
Probability P(XnY), where XnY indicates that a transaction contains both X and Y
support (X?Y) = P(XnY) =
Rule confidence
Assesses the degree of certainty of the detected association
Conditional probability P(Y|X), that is, the probability that a transaction containing X also contains Y
confidence (X?Y) = P(Y|X) =
Subjective
based on user beliefs in the data
Each measure associated with a threshold controlled by the user
Rules that do not satisfy a confidence threshold of, say 50%, considered uninteresting
reflect noise, exceptions, or minority cases
Objective measures are combined with subjective measures
5. Steps in the KDD Process Understanding the application domain
relevant prior knowledge and goals of application
Data cleaning, integration, and preprocessing (60% of effort)
Creating a target data set
data selection and transformation
feature and data reduction
selection of variables, sampling of rows
Applying the DM technique(s) - the core of KDD
choosing task: classification, prediction, clustering
choosing the algorithm
search for patterns of interest
Interpreting & evaluating mined patterns
Use of discovered knowledge
6. A KDD Process
7. A KDD Process These activities are iterative, interactive and have a user-friendly character
End-user has to accept/reject the results produced by the KDD system
8. KDD: Integration of Many Disciplines Database Technology
Statistics
Machine Learning & Artificial Intelligence
Information Science
High-Performance Computing
Visualization
Pattern Recognition
Neural Networks
Fuzzy Logic
Evolutionary Computing
Graph Theory
9. Data Mining Techniques Neural Networks
Decision Trees
Fuzzy Systems (Logic, Rules)
Genetic Algorithms
Association Rules
Memory-based Reasoning (k-Nearest Neighbor)
Deviation/Anomaly Detection
Allow one to
learn from data
understand something new
answer tough questions
locate a problem
Can be complemented by traditional statistical techniques, OLAP, and SQL queries
10. Unsupervised DM Techniques Use unsupervised learning
no target or class variable
groups input data records into classes based on self-similarities in the data
The goal is not specific
“Tell me something interesting about the data”
“What common characteristics/profiles do terrorists share?”
“What is the activity pattern of a typical network intruder?”
No constraints on a DM system
No indications of what the user expects and what kind of discovery could be of interest
Examples: clustering, finding association rules, deviation detection, neural networks
11. Supervised DM Techniques Use supervised learning
classification, prediction
target (dependent) variable has clearly defined label
Attempt to
predict a specific data value
weight, height, age
classify/categorize an item into a fixed set of known classes
(yes/no, friend/foe, healthy/bankrupt, legitimate/illegitimate)
Goal is specific
Ex. “Will this company go bankrupt?”
“Is this individual a friend or a foe (terrorist)?”
“Is this credit card transaction legitimate or fraudulent?”
“Is someone trying to access a computer network an intruder or not?”
12. Classification Task Deals with discrete outcomes “intruder/non-intruder”, “legitimate/fraudulent”, “friend/foe”
Learning a function that classifies a data item into one of several predefined classes
set of rules
mathematical equation
set of weights
Training set consists of pre-classified examples
Newly presented object is assigned a class
A network system administrator can use the classifier to decide whether a person accessing the network is an intruder or not
13. Clustering Task Unsupervised learning
Segmenting a heterogeneous population into number of more homogeneous clusters or groups
No predefined classes which will be used for training
The records are grouped together based on self-similarity
It is up to you what meaning, if any, to attach to the resulting classes
It is often done as a prelude to some other form of DM (classification)
Often based on computing the distances between data points
14. Optimization Task Finding one or a series of optimal solutions from among a very large number of possible solutions
Traditional mathematical techniques may break down because of billions of combinations
15. High-Dimensionality Data Data/dimensionality reduction
# of features
# of samples
# of values for the features
Gains of data reduction
Improved predictive/descriptive accuracy
Model better understood
Uses less rules, weights, variables
Fewer features
Next round of data collection, irrelevant features can be discarded
16. Data Preparation Always done, regardless of the DM task and technique
Depends on
amounts of data
DM task (classification, clustering/segmentation)
types of values (numeric or categorical) for features/variables
behavior of data with respect to time
Normalization
data values scaled to a specific range: [0,1], z-scores
Reasons
features with larger values overweight features with smaller values
clustering techniques based on computing the distance between data points
neural networks learn better
prevents saturation of neurons
17. Data Preparation Data Smoothing/Rounding
Minor differences between the values of a feature unimportant
Binning
placing values in different intervals by consulting their neighbors
Transformation of features
Reduces the # of features
18. Data Preparation Outlier detection
Samples inconsistent with respect to the remaining data
Not an easy subject
Some applications focused on outlier detection; others are not
Ex. detecting fraudulent credit card transactions
1 out of 10,000 transactions is fraudulent.
In many classes of DM applications, we remove them
Careful with the automatic removal of outliers
Methods for outlier detection
Visualization for 2-D, 3-D or 4-D
Based on mean and variance of feature
Distance-based
multidimensional samples
calculate the distance between all samples in an n-dim dataset
outliers are those samples which do not have enough neighbors
19. Sampling Millions of cases; often 20,000 or so is enough
Sample has the same probability distribution as the population
Random sampling
with replacement
without replacement
Stratified sampling
Initial data set is split into non-overlapping subsets
sampling is performed on each strata independently of another
Incremental sampling
Increasingly larger random subsets to observe the trends in performances of the tool and to stop when no progress is made
How many samples?
No simple answer - enough
The # depends on
algorithms
# of classes the algorithm predicts
# of variables in a data set
reliability of the results
20. Feature Reduction Hundreds of features
many irrelevant, correlated, redundant
Feature selection often a space search problem
Small # of features ? can be searched exhaustively (all combinations)
20 features: 220 combinations > 1,000,000 combinations
21. Feature Reduction Methods Independent examination of features based on the mean & variance
Test features separately – one feature at a time
Feature examined normally distributed
Given feature is independent of the others
Examines one feature at a time without taking into account the relationship to other features
Collective examination of features based on feature means and covariances
tests all features together
features have normally distributed values
impractical and computationally prohibitive
yields huge search space
22. Principal component analysis (PCA)
Very popular, well-established, frequently used
Complex in terms of calculations
Components, which contribute the least to the variation in the data set, are eliminated
Entropy measure
Called unsupervised feature selection
no output feature containing a class label
Removing an irrelevant feature from a set may not change the information content of the data set
Information content is measured by entropy
Features on numeric or categorical scale
Numeric - normalized Euclidean distance
Categorical - Hamming distance Feature Reduction Methods
23. Neural Networks Enable to acquire, store, and utilize experiential knowledge
Try to emulate biological neurological systems
Try to mimic/approximate the way the human brain functions and processes information
Used successfully for the following tasks
Classification
Clustering
Optimization
Implemented as mathematical models of the human brain
24. Characterized by their three properties:
Computational property
built of neurons
summation node and activation function
organized in layers
interconnected using weights
Architecture of the network
Feed-forward NN with error back-propagation
classification, prediction
Kohonen network
clustering (segmentation)
Learning property
supervised mode (with a teacher)
unsupervised mode (without a teacher)
Knowledge is encoded in network’s weights Neural Networks
26. Decision Trees Useful for classification tasks
Learn from data, like neural networks
Operation based on the algorithms that
make the clusters at the node purer and purer by progressively reducing disorder (impurity) in the original data set
impurity is measured by entropy
find the optimum number of splits and determine where to partition the data to maximize the information gain
Nodes, branches and leaves indicate the variables, conditions, and outcomes, respectively
Most predictive variable placed at the top node of the tree
Model is represented in the form of explicit and understandable rule-like relationships among variables
Each rule represents a unique path from the root to each leaf
Not as robust and good as neural networks in detecting complex nonlinear relationships between variables
28. Fuzzy Logic Enables to build fuzzy systems
Knowledge is encoded in fuzzy sets and fuzzy rules
Fuzzy rules: enable one to reason or describe a process in terms of approximations
Fuzzy sets: sets without clearly defined boundaries
Can produce very accurate results
Fast response time
Knowledge about the fuzzy rules and fuzzy sets
elicited from domain experts
generated from the given data
neuro-fuzzy systems
29. Genetic Algorithms Solve problems (mainly optimization) by borrowing a technique from nature
Use 3 Darwin’s basic principles
Survival of the fittest (reproduction)
Cross-breeding (crossovers)
Mutation
to create approximate solutions for problems
fitness function selection and encoding genomes is often difficult
Example
You work for a shipping firm and have to make shipments to 6 different towns. You have one car and your task is to minimize the distance traveled. The plane can visit each city only once and can start from any city.
31. Rare Events We are drowning in the massive amount of data that are being collected, while starving for knowledge at the same time
Despite the enormous amount of data, particular events of interest are still quite rare
Rare events are events that occur very infrequently, i.e. their frequency ranges from .01% to 10%
However, when they occur, their consequences can be quite dramatic often in a negative sense
32. Applications of Rare Cases Network intrusion detection
Number of intrusions on the network is typically a very small fraction of the total network traffic
Credit card fraud transaction
Millions of legitimate transactions are stored, while only a very small percentage is fraudulent
Medical diagnostics
When classifying the pixels in mammogram images, cancerous pixels represent only a very small fraction of the entire image
33. Applications of Rare Cases Web mining
< 3% of all people visiting Amazon.com make a purchase
Identifying passengers at airports (through biometrics) and screening their luggage
Only an extremely small number of passengers is suspected of hostile activities; the same refers to the passenger’s luggage that may contain explosives
Fraud detection
auto insurance: detecting people who stage accidents to collect on insurance
Profiling Individuals
finding clusters of “model” terrorists who share similar characteristics
Money laundering, Financial fraud, Churn analysis
34. Key Technical Challenges for Detecting Rare Events Large data size
High dimensionality
Temporal nature of the data
Skewed class distribution
Rare events are underrepresented in the data set – minority class
Data preprocessing
On-line analysis
35. Limitations of Standard Data Mining Schemes Many classic data mining issues and methods apply in the domain of rare cases
Limitations
Standard approaches for feature selection and construction, computing distances between samples, and sampling do not work well for rare case analysis
While most normal events are similar to each other, rare events are quite different from one another
Regular network traffic is fairly standard, while suspicious ones vary from the standard ones in many different ways
Metrics used to evaluate normal event detection methods
Overall classification accuracy is not appropriate for evaluating methods for rare event detection
In many applications data keeps arriving in real-time, and there is a need to detect rare events on the fly, with models built only on the events seen so far
36. Computer Security Broad and extremely important field
Generally encompasses two aspects
How computers can be used to secure the information contained within organizations
Detection and/or prevention of unauthorized access or attacks on computers, networks, operating system, data, and applications local to an organization
How computers can be used to detect hostile activity in a sensitive geographical area (such as in an airport)
Involves computer vision technology
Identifying patterns of activities that can suggest a friend or foe
37. Computer Security The ability of a computer system to protect information and system resources with respect to
Confidentiality: Prevention of unauthorized disclosure of information
Integrity: Prevention of unauthorized modification of information
Availability: Prevention of unauthorized withholding of information
Intrusion – Cyber attack that tries to bypass security mechanism
Outsider – attack on the system from Internet
Hackers, spies, kiddies
Stealing, spying, probing (to collect information about the host)
DoS attacks, viruses, worms
Insider (employee) – attempt to gain and misuse non-authorized privileges
38. Taxonomy of Computer Attacks Intrusions can be classified according to several categories
Attack type:
DoS, worms/trojan horses
Number of network connections involved in the attack
single connection cyber attacks
multiple connections cyber attacks
Source of the attack
multiple location vs. single location; coordinated/distributed attacks?
inside vs. outside
Target of the attack
Single or many different destinations
Environment (network, host, P2P (peer-to-peer), wireless networks, ..)
Less secure physical layer
No traffic concentration points for monitoring packets
Automation (manual, automated, semi-automated attack)
Need to analyze network data from several sites to detect these attacks
39. Prevention – Existing Security Mechanisms Security protocols and policies
IPSec – Security at the IP layer
Source authentication
Encryption
Secure Socket Layer (SSL)
Source authentication
Encryption
Host based protections
Regularly installing patches, defending accounts, integrity checks
Firewalls
Control flow of traffic between networks
Block traffic from Internet and to Internet
Monitor communication between networks and examines each packet to see if it should be let through
All the above mechanism are insufficient due to
Security holes, insider attacks, multiple levels of data confidentiality within an organization
Sophistication of cyber attacks, their severity, and increased intruders’ knowledge
Data mining can help
It is not a cure for all problems
40. Motivation - Data Mining for Intrusion Detection Increased interest in data mining based intrusion detection
Attacks for which it is difficult to build signatures
Attack stealthiness
Unforeseen/Unknown/Emerging attacks
Distributed/coordinated attacks
Data mining approaches for intrusion detection
Misuse detection
Supervised learning
Anomaly detection
Unsupervised learning
Summarization of attacks using association rules
41. Motivation - Data Mining for Intrusion Detection Data mining approaches for intrusion detection
Misuse detection
Supervised learning
Based on extensive knowledge of patterns associated with known attacks provided by human experts
Building predictive models from labeled data sets (instances are labeled as “normal” or “intrusive”) to identify known intrusions
Major advantages
High accuracy in detecting many kinds of known attacks
Produce models that can be easily understood
Major limitations
Cannot detect unknown and emerging attacks
The data has to be labeled
Signature database has to be manually revised for each new type of discovered attack
Major approaches: pattern (signature) matching, expert systems, neural networks, decision trees, logistic regression, memory-based reasoning
SNORT system
42. Motivation - Data Mining for Intrusion Detection Data mining approaches for intrusion detection
Anomaly detection
Unsupervised learning
Based on profiles that represent normal behavior of users, hosts, or networks, and detecting attacks as significant deviations from this profile
Major benefit - potentially able to recognize unforeseen attacks
Major limitation - possible high false alarm rate, since detected deviations do not necessarily represent actual attacks
Major approaches: statistical methods, expert systems, clustering, neural networks, outlier detection schemes, deviation/anomaly detection
Analyze each event to determine how similar (or dissimilar) it is to the majority
Success depends on the choice of similarity measures, dimension weighting
Summarization of attacks using association rules
43. IDS – Information Source Host-based IDS
base the decisions on information obtained from a single host (e.g. system log data, system calls data)
Network-based IDS
make decisions according to the information and data obtained by monitoring the traffic in the network to which the hosts are connected
Wireless network IDS
detect intrusions by analyzing traffic between mobile nodes
Application Logs
detect intrusions analyzing for example database logs (database misuse), web logs
IDS Sensor Alerts
analysis on low-level sensor alarms
Analysis of alarms generated by other IDSs
44. Data Sources in Network Intrusion Detection Network traffic data is usually collected using “network sniffers”
Tcpdump
08:02:15.471817 0:10:7b:38:46:33 0:10:7b:38:46:33 loopback 60:
0000 0100 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000
08:02:19.391039 172.16.112.100.3055 > 172.16.112.10.ntp: v1 client strat 0 poll 0 prec 0
08:02:19.391456 172.16.112.10.ntp > 172.16.112.100.3055: v1 server strat 5 poll 4 prec - 16 (DF)
net-flow tools
Source and destination IP address, Source and destination ports, Type of service, Packet and byte counts, Start and end time, Input and output interface numbers, TCP flags, Routing information (next-hop address, source autonomous system (AS) number, destination AS number)
0624.12:4:39.344 0624.12:4:48.292 211.59.18.101 4350 160.94.179.138 1433 6 2 3 144
0624.9:1:10.667 0624.9:1:19.635 24.201.13.122 3535 160.94.179.151 1433 6 2 3 132
0624.12:4:40.572 0624.12:4:49.496 211.59.18.101 4362 160.94.179.150 1433 6 2 3 152
Collected data are in the form of network connections or network packets (a network connection may contain several packets)
45. Projects: Data Mining in Intrusion Detection MADAM ID (Mining Audit Data for Automated Models for Intrusion Detection) – Columbia University, Georgia Tech, Florida Tech
ADAM (Audit Data Analysis and Mining) - George Mason University
MINDS (University of Minnesota)
Intelligent Intrusion Detection – IIDS (Mississippi State University)
Data Mining for Network Intrusion Detection (MITRE corporation)
Institute for Security Technology Studies (ISTS), Dartmouth College
Intrusion Detection Techniques (Arizona State University)
Agent based data mining system (Iowa State University)
IDDM (Intrusion Detection using Data Mining Techniques) – Department of Defense, Australia
46. Data Preprocessing for Data Mining in ID Converting the data from monitored system (computer network, host machine, …) into data (features) that will be used in data mining models
For misuse detection, labeling data examples into normal or intrusive may require enormous time for many human experts
Building data mining models
Misuse detection models
Anomaly detection models
Analysis and summarization of results
49. Misuse Detection - Evaluation of Rare Class Problems – F-value
Accuracy is not sufficient metric for evaluation
(TN+TP)/(TN+FP+FN+TP)
Ex. Network traffic data set with 99.99% of normal data and 0.01% of intrusions
Trivial classifier that labels everything with the normal class can achieve 99.99% accuracy!!!
Focus on both recall and precision
Recall (R) = TP/(TP+FN)
Precision (P) = TP/(TP+FP)
F-measure = 2*R*P/(R+P)
50. Misuse Detection - Evaluation of Rare Class Problems – ROC
51. Misuse Detection - Evaluation of Rare Class Problems – ROC
52. Misuse Detection - Manipulating Data Records Over-sampling the rare class
Make the duplicates of the rare events until the data set contains as many examples as the majority class => balance the classes
Does not increase information but increase misclassification cost
SMOTE (Synthetic Minority Over-sampling TEchnique)
Synthetic generating the minority class examples
When generating artificial minority class example, distinguish two types of features
Continuous
Nominal (Categorical) features
Down-sizing (undersampling) the majority class
Sample the data records from majority class
Randomly
Near misses examples
Examples far from minority class examples (far from decision boundaries)
Introduce sampled data records into the original data set instead of original data records from the majority class
Usually results in a general loss of information and potentially overly general rules
53. Unsupervised Techniques – Anomaly Detection Build models of “normal” behavior and detect anomalies as deviations from it
Possible high false alarm rate - previously unseen (yet legitimate) data records may be recognized as anomalies
Two types of techniques
with access to normal data
with NO access to normal data (not known what is “normal”)
54. Outlier Detection Schemes Outlier is defined as a data point which is very different from the rest of the data based on some measure
Detect novel attacks/intrusions by identifying them as deviations from “normal” behavior
Identify normal behavior
Construct useful set of features
Define similarity function
Use outlier detection algorithm
Statistics based approaches
Distance based approaches
Nearest neighbor approaches
Clustering based approaches
Density based schemes
55. Distance-based Outlier Detection Scheme k-Nearest Neighbor approach
For each data point d compute the distance to the k-th nearest neighbor dk
Sort all data points according to the distance dk
Outliers are points that have the largest distance dk and therefore are located in the more sparse neighborhoods
Usually data points that have top n% distance dk are identified as outliers
n – user parameter
Not suitable for datasets that have modes with varying density
56. Distance-based Outlier Detection Scheme
57. Model based outlier detection schemes Use a prediction model to learn the normal behavior
Every deviation from learned prediction model can be treated as anomaly or potential intrusion
Recent approaches:
Neural networks
Unsupervised Support Vector Machines (SVMs)
58. Neural networks for outlier detection Use a replicator 4-layer feed-forward neural network (RNN) with the same number of input and output nodes
Input variables are the output variables so that RNN forms a compressed model of the data during training
A measure of outlyingness is the reconstruction error of individual data points
59. Conclusions Data mining analysis of rare events requires special attention
Many real world applications exhibit “needle-in the-haystack” type of problem
Current “state of the art” data mining techniques are still insufficient for efficient handling rare events
Need for designing better and more accurate data mining models
60. References Han, J., and Kamber, M., Data Mining: Concept and Techniques, Morgan Kaufmann Publishers, 2001
Kantardzic, M., Data Mining Methods, Concepts, and Algorithms, IEEE-Press/Wiley, 2003
Lazarevic, A., Srivastava, J., Kumar, V., Data Mining for Computer Security Applications, IEEE ICDM 2003 Tutorial
Kantardzic, M., and Zurada, J. (Eds.), Next Generations of Data Mining Applications, IEEE Press/Wiley, 2005
Tan, P., Steinbach, M., Kumar, V., Introduction to Data Mining, Addison Wesley, 2005
Zurada, J., Knowledge Discovery and Data Mining, Lecture Notes on Blackboard, Spring 2005
61. Thank you for attending this lecture! Questions/Discussion?