Data Mining for Analysis of Rare Events: A Case of Computer Security and Other Applications

1. Data Mining for Analysis of Rare Events: A Case of Computer Security and Other Applications Jozef Zurada Department of Computer Information Systems College of Business University of Louisville Louisville, Kentucky USA email: jmzura01@louisville.edu

2. Outline Introduction to Knowledge Discovery in Databases and Data Mining Data Mining Tools, Techniques, and Tasks High-dimensional data Feature and values reduction, and sampling Rare Events What are they? What are the application domains exhibiting these characteristics? What are the limitations of standard data mining techniques? Major Techniques for Detecting Rare Events? Supervised (Classification) techniques - Predictive Modeling Tree based approaches, Neural networks Unsupervised Techniques Anomaly/Outlier Detection, Clustering Other Data Mining Techniques � Association Rules Case Study: Intrusion Detection Systems What are the general types/categories of cyber attacks Data Mining architecture for Intrusion Detection Systems Conclusion and Questions

3. What is KDD? Finding/extracting interesting information from data stored in large databases/data warehouses Interesting non-trivial implicit previously unknown (novel) easily understood rule length, number of conditions in a rule potentially useful (actionable) Information patterns rules correlations relationships hidden in data descriptions of rare events detection of outliers/anomalies/rare events prediction of events Interesting patterns represent knowledge

4. Measures of Pattern Interestingness Objective Rule support Represents the percentage of transactions from a transaction database that the given rule satisfies Probability P(XnY), where XnY indicates that a transaction contains both X and Y support (X?Y) = P(XnY) = Rule confidence Assesses the degree of certainty of the detected association Conditional probability P(Y|X), that is, the probability that a transaction containing X also contains Y confidence (X?Y) = P(Y|X) = Subjective based on user beliefs in the data Each measure associated with a threshold controlled by the user Rules that do not satisfy a confidence threshold of, say 50%, considered uninteresting reflect noise, exceptions, or minority cases Objective measures are combined with subjective measures

5. Steps in the KDD Process Understanding the application domain relevant prior knowledge and goals of application Data cleaning, integration, and preprocessing (60% of effort) Creating a target data set data selection and transformation feature and data reduction selection of variables, sampling of rows Applying the DM technique(s) - the core of KDD choosing task: classification, prediction, clustering choosing the algorithm search for patterns of interest Interpreting & evaluating mined patterns Use of discovered knowledge

6. A KDD Process

7. A KDD Process These activities are iterative, interactive and have a user-friendly character End-user has to accept/reject the results produced by the KDD system

8. KDD: Integration of Many Disciplines Database Technology Statistics Machine Learning & Artificial Intelligence Information Science High-Performance Computing Visualization Pattern Recognition Neural Networks Fuzzy Logic Evolutionary Computing Graph Theory

9. Data Mining Techniques Neural Networks Decision Trees Fuzzy Systems (Logic, Rules) Genetic Algorithms Association Rules Memory-based Reasoning (k-Nearest Neighbor) Deviation/Anomaly Detection Allow one to learn from data understand something new answer tough questions locate a problem Can be complemented by traditional statistical techniques, OLAP, and SQL queries

10. Unsupervised DM Techniques Use unsupervised learning no target or class variable groups input data records into classes based on self-similarities in the data The goal is not specific �Tell me something interesting about the data� �What common characteristics/profiles do terrorists share?� �What is the activity pattern of a typical network intruder?� No constraints on a DM system No indications of what the user expects and what kind of discovery could be of interest Examples: clustering, finding association rules, deviation detection, neural networks

11. Supervised DM Techniques Use supervised learning classification, prediction target (dependent) variable has clearly defined label Attempt to predict a specific data value weight, height, age classify/categorize an item into a fixed set of known classes (yes/no, friend/foe, healthy/bankrupt, legitimate/illegitimate) Goal is specific Ex. �Will this company go bankrupt?� �Is this individual a friend or a foe (terrorist)?� �Is this credit card transaction legitimate or fraudulent?� �Is someone trying to access a computer network an intruder or not?�

12. Classification Task Deals with discrete outcomes �intruder/non-intruder�, �legitimate/fraudulent�, �friend/foe� Learning a function that classifies a data item into one of several predefined classes set of rules mathematical equation set of weights Training set consists of pre-classified examples Newly presented object is assigned a class A network system administrator can use the classifier to decide whether a person accessing the network is an intruder or not

13. Clustering Task Unsupervised learning Segmenting a heterogeneous population into number of more homogeneous clusters or groups No predefined classes which will be used for training The records are grouped together based on self-similarity It is up to you what meaning, if any, to attach to the resulting classes It is often done as a prelude to some other form of DM (classification) Often based on computing the distances between data points

14. Optimization Task Finding one or a series of optimal solutions from among a very large number of possible solutions Traditional mathematical techniques may break down because of billions of combinations

15. High-Dimensionality Data Data/dimensionality reduction # of features # of samples # of values for the features Gains of data reduction Improved predictive/descriptive accuracy Model better understood Uses less rules, weights, variables Fewer features Next round of data collection, irrelevant features can be discarded

16. Data Preparation Always done, regardless of the DM task and technique Depends on amounts of data DM task (classification, clustering/segmentation) types of values (numeric or categorical) for features/variables behavior of data with respect to time Normalization data values scaled to a specific range: [0,1], z-scores Reasons features with larger values overweight features with smaller values clustering techniques based on computing the distance between data points neural networks learn better prevents saturation of neurons

17. Data Preparation Data Smoothing/Rounding Minor differences between the values of a feature unimportant Binning placing values in different intervals by consulting their neighbors Transformation of features Reduces the # of features

18. Data Preparation Outlier detection Samples inconsistent with respect to the remaining data Not an easy subject Some applications focused on outlier detection; others are not Ex. detecting fraudulent credit card transactions 1 out of 10,000 transactions is fraudulent. In many classes of DM applications, we remove them Careful with the automatic removal of outliers Methods for outlier detection Visualization for 2-D, 3-D or 4-D Based on mean and variance of feature Distance-based multidimensional samples calculate the distance between all samples in an n-dim dataset outliers are those samples which do not have enough neighbors

19. Sampling Millions of cases; often 20,000 or so is enough Sample has the same probability distribution as the population Random sampling with replacement without replacement Stratified sampling Initial data set is split into non-overlapping subsets sampling is performed on each strata independently of another Incremental sampling Increasingly larger random subsets to observe the trends in performances of the tool and to stop when no progress is made How many samples? No simple answer - enough The # depends on algorithms # of classes the algorithm predicts # of variables in a data set reliability of the results

20. Feature Reduction Hundreds of features many irrelevant, correlated, redundant Feature selection often a space search problem Small # of features ? can be searched exhaustively (all combinations) 20 features: 220 combinations > 1,000,000 combinations

21. Feature Reduction Methods Independent examination of features based on the mean & variance Test features separately � one feature at a time Feature examined normally distributed Given feature is independent of the others Examines one feature at a time without taking into account the relationship to other features Collective examination of features based on feature means and covariances tests all features together features have normally distributed values impractical and computationally prohibitive yields huge search space

22. Principal component analysis (PCA) Very popular, well-established, frequently used Complex in terms of calculations Components, which contribute the least to the variation in the data set, are eliminated Entropy measure Called unsupervised feature selection no output feature containing a class label Removing an irrelevant feature from a set may not change the information content of the data set Information content is measured by entropy Features on numeric or categorical scale Numeric - normalized Euclidean distance Categorical - Hamming distance Feature Reduction Methods

23. Neural Networks Enable to acquire, store, and utilize experiential knowledge Try to emulate biological neurological systems Try to mimic/approximate the way the human brain functions and processes information Used successfully for the following tasks Classification Clustering Optimization Implemented as mathematical models of the human brain

24. Characterized by their three properties: Computational property built of neurons summation node and activation function organized in layers interconnected using weights Architecture of the network Feed-forward NN with error back-propagation classification, prediction Kohonen network clustering (segmentation) Learning property supervised mode (with a teacher) unsupervised mode (without a teacher) Knowledge is encoded in network�s weights Neural Networks

26. Decision Trees Useful for classification tasks Learn from data, like neural networks Operation based on the algorithms that make the clusters at the node purer and purer by progressively reducing disorder (impurity) in the original data set impurity is measured by entropy find the optimum number of splits and determine where to partition the data to maximize the information gain Nodes, branches and leaves indicate the variables, conditions, and outcomes, respectively Most predictive variable placed at the top node of the tree Model is represented in the form of explicit and understandable rule-like relationships among variables Each rule represents a unique path from the root to each leaf Not as robust and good as neural networks in detecting complex nonlinear relationships between variables

28. Fuzzy Logic Enables to build fuzzy systems Knowledge is encoded in fuzzy sets and fuzzy rules Fuzzy rules: enable one to reason or describe a process in terms of approximations Fuzzy sets: sets without clearly defined boundaries Can produce very accurate results Fast response time Knowledge about the fuzzy rules and fuzzy sets elicited from domain experts generated from the given data neuro-fuzzy systems

29. Genetic Algorithms Solve problems (mainly optimization) by borrowing a technique from nature Use 3 Darwin�s basic principles Survival of the fittest (reproduction) Cross-breeding (crossovers) Mutation to create approximate solutions for problems fitness function selection and encoding genomes is often difficult Example You work for a shipping firm and have to make shipments to 6 different towns. You have one car and your task is to minimize the distance traveled. The plane can visit each city only once and can start from any city.

31. Rare Events We are drowning in the massive amount of data that are being collected, while starving for knowledge at the same time Despite the enormous amount of data, particular events of interest are still quite rare Rare events are events that occur very infrequently, i.e. their frequency ranges from .01% to 10% However, when they occur, their consequences can be quite dramatic often in a negative sense

32. Applications of Rare Cases Network intrusion detection Number of intrusions on the network is typically a very small fraction of the total network traffic Credit card fraud transaction Millions of legitimate transactions are stored, while only a very small percentage is fraudulent Medical diagnostics When classifying the pixels in mammogram images, cancerous pixels represent only a very small fraction of the entire image

33. Applications of Rare Cases Web mining < 3% of all people visiting Amazon.com make a purchase Identifying passengers at airports (through biometrics) and screening their luggage Only an extremely small number of passengers is suspected of hostile activities; the same refers to the passenger�s luggage that may contain explosives Fraud detection auto insurance: detecting people who stage accidents to collect on insurance Profiling Individuals finding clusters of �model� terrorists who share similar characteristics Money laundering, Financial fraud, Churn analysis

34. Key Technical Challenges for Detecting Rare Events Large data size High dimensionality Temporal nature of the data Skewed class distribution Rare events are underrepresented in the data set � minority class Data preprocessing On-line analysis

35. Limitations of Standard Data Mining Schemes Many classic data mining issues and methods apply in the domain of rare cases Limitations Standard approaches for feature selection and construction, computing distances between samples, and sampling do not work well for rare case analysis While most normal events are similar to each other, rare events are quite different from one another Regular network traffic is fairly standard, while suspicious ones vary from the standard ones in many different ways Metrics used to evaluate normal event detection methods Overall classification accuracy is not appropriate for evaluating methods for rare event detection In many applications data keeps arriving in real-time, and there is a need to detect rare events on the fly, with models built only on the events seen so far

36. Computer Security Broad and extremely important field Generally encompasses two aspects How computers can be used to secure the information contained within organizations Detection and/or prevention of unauthorized access or attacks on computers, networks, operating system, data, and applications local to an organization How computers can be used to detect hostile activity in a sensitive geographical area (such as in an airport) Involves computer vision technology Identifying patterns of activities that can suggest a friend or foe

37. Computer Security The ability of a computer system to protect information and system resources with respect to Confidentiality: Prevention of unauthorized disclosure of information Integrity: Prevention of unauthorized modification of information Availability: Prevention of unauthorized withholding of information Intrusion � Cyber attack that tries to bypass security mechanism Outsider � attack on the system from Internet Hackers, spies, kiddies Stealing, spying, probing (to collect information about the host) DoS attacks, viruses, worms Insider (employee) � attempt to gain and misuse non-authorized privileges

38. Taxonomy of Computer Attacks Intrusions can be classified according to several categories Attack type: DoS, worms/trojan horses Number of network connections involved in the attack single connection cyber attacks multiple connections cyber attacks Source of the attack multiple location vs. single location; coordinated/distributed attacks? inside vs. outside Target of the attack Single or many different destinations Environment (network, host, P2P (peer-to-peer), wireless networks, ..) Less secure physical layer No traffic concentration points for monitoring packets Automation (manual, automated, semi-automated attack) Need to analyze network data from several sites to detect these attacks

39. Prevention � Existing Security Mechanisms Security protocols and policies IPSec � Security at the IP layer Source authentication Encryption Secure Socket Layer (SSL) Source authentication Encryption Host based protections Regularly installing patches, defending accounts, integrity checks Firewalls Control flow of traffic between networks Block traffic from Internet and to Internet Monitor communication between networks and examines each packet to see if it should be let through All the above mechanism are insufficient due to Security holes, insider attacks, multiple levels of data confidentiality within an organization Sophistication of cyber attacks, their severity, and increased intruders� knowledge Data mining can help It is not a cure for all problems

40. Motivation - Data Mining for Intrusion Detection Increased interest in data mining based intrusion detection Attacks for which it is difficult to build signatures Attack stealthiness Unforeseen/Unknown/Emerging attacks Distributed/coordinated attacks Data mining approaches for intrusion detection Misuse detection Supervised learning Anomaly detection Unsupervised learning Summarization of attacks using association rules

41. Motivation - Data Mining for Intrusion Detection Data mining approaches for intrusion detection Misuse detection Supervised learning Based on extensive knowledge of patterns associated with known attacks provided by human experts Building predictive models from labeled data sets (instances are labeled as �normal� or �intrusive�) to identify known intrusions Major advantages High accuracy in detecting many kinds of known attacks Produce models that can be easily understood Major limitations Cannot detect unknown and emerging attacks The data has to be labeled Signature database has to be manually revised for each new type of discovered attack Major approaches: pattern (signature) matching, expert systems, neural networks, decision trees, logistic regression, memory-based reasoning SNORT system

42. Motivation - Data Mining for Intrusion Detection Data mining approaches for intrusion detection Anomaly detection Unsupervised learning Based on profiles that represent normal behavior of users, hosts, or networks, and detecting attacks as significant deviations from this profile Major benefit - potentially able to recognize unforeseen attacks Major limitation - possible high false alarm rate, since detected deviations do not necessarily represent actual attacks Major approaches: statistical methods, expert systems, clustering, neural networks, outlier detection schemes, deviation/anomaly detection Analyze each event to determine how similar (or dissimilar) it is to the majority Success depends on the choice of similarity measures, dimension weighting Summarization of attacks using association rules

43. IDS � Information Source Host-based IDS base the decisions on information obtained from a single host (e.g. system log data, system calls data) Network-based IDS make decisions according to the information and data obtained by monitoring the traffic in the network to which the hosts are connected Wireless network IDS detect intrusions by analyzing traffic between mobile nodes Application Logs detect intrusions analyzing for example database logs (database misuse), web logs IDS Sensor Alerts analysis on low-level sensor alarms Analysis of alarms generated by other IDSs

44. Data Sources in Network Intrusion Detection Network traffic data is usually collected using �network sniffers� Tcpdump 08:02:15.471817 0:10:7b:38:46:33 0:10:7b:38:46:33 loopback 60: 0000 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 08:02:19.391039 172.16.112.100.3055 > 172.16.112.10.ntp: v1 client strat 0 poll 0 prec 0 08:02:19.391456 172.16.112.10.ntp > 172.16.112.100.3055: v1 server strat 5 poll 4 prec - 16 (DF) net-flow tools Source and destination IP address, Source and destination ports, Type of service, Packet and byte counts, Start and end time, Input and output interface numbers, TCP flags, Routing information (next-hop address, source autonomous system (AS) number, destination AS number) 0624.12:4:39.344 0624.12:4:48.292 211.59.18.101 4350 160.94.179.138 1433 6 2 3 144 0624.9:1:10.667 0624.9:1:19.635 24.201.13.122 3535 160.94.179.151 1433 6 2 3 132 0624.12:4:40.572 0624.12:4:49.496 211.59.18.101 4362 160.94.179.150 1433 6 2 3 152 Collected data are in the form of network connections or network packets (a network connection may contain several packets)

45. Projects: Data Mining in Intrusion Detection MADAM ID (Mining Audit Data for Automated Models for Intrusion Detection) � Columbia University, Georgia Tech, Florida Tech ADAM (Audit Data Analysis and Mining) - George Mason University MINDS (University of Minnesota) Intelligent Intrusion Detection � IIDS (Mississippi State University) Data Mining for Network Intrusion Detection (MITRE corporation) Institute for Security Technology Studies (ISTS), Dartmouth College Intrusion Detection Techniques (Arizona State University) Agent based data mining system (Iowa State University) IDDM (Intrusion Detection using Data Mining Techniques) � Department of Defense, Australia

46. Data Preprocessing for Data Mining in ID Converting the data from monitored system (computer network, host machine, �) into data (features) that will be used in data mining models For misuse detection, labeling data examples into normal or intrusive may require enormous time for many human experts Building data mining models Misuse detection models Anomaly detection models Analysis and summarization of results

49. Misuse Detection - Evaluation of Rare Class Problems � F-value Accuracy is not sufficient metric for evaluation (TN+TP)/(TN+FP+FN+TP) Ex. Network traffic data set with 99.99% of normal data and 0.01% of intrusions Trivial classifier that labels everything with the normal class can achieve 99.99% accuracy!!! Focus on both recall and precision Recall (R) = TP/(TP+FN) Precision (P) = TP/(TP+FP) F-measure = 2*R*P/(R+P)

50. Misuse Detection - Evaluation of Rare Class Problems � ROC

51. Misuse Detection - Evaluation of Rare Class Problems � ROC

52. Misuse Detection - Manipulating Data Records Over-sampling the rare class Make the duplicates of the rare events until the data set contains as many examples as the majority class => balance the classes Does not increase information but increase misclassification cost SMOTE (Synthetic Minority Over-sampling TEchnique) Synthetic generating the minority class examples When generating artificial minority class example, distinguish two types of features Continuous Nominal (Categorical) features Down-sizing (undersampling) the majority class Sample the data records from majority class Randomly Near misses examples Examples far from minority class examples (far from decision boundaries) Introduce sampled data records into the original data set instead of original data records from the majority class Usually results in a general loss of information and potentially overly general rules

53. Unsupervised Techniques � Anomaly Detection Build models of �normal� behavior and detect anomalies as deviations from it Possible high false alarm rate - previously unseen (yet legitimate) data records may be recognized as anomalies Two types of techniques with access to normal data with NO access to normal data (not known what is �normal�)

54. Outlier Detection Schemes Outlier is defined as a data point which is very different from the rest of the data based on some measure Detect novel attacks/intrusions by identifying them as deviations from �normal� behavior Identify normal behavior Construct useful set of features Define similarity function Use outlier detection algorithm Statistics based approaches Distance based approaches Nearest neighbor approaches Clustering based approaches Density based schemes

55. Distance-based Outlier Detection Scheme k-Nearest Neighbor approach For each data point d compute the distance to the k-th nearest neighbor dk Sort all data points according to the distance dk Outliers are points that have the largest distance dk and therefore are located in the more sparse neighborhoods Usually data points that have top n% distance dk are identified as outliers n � user parameter Not suitable for datasets that have modes with varying density

56. Distance-based Outlier Detection Scheme

57. Model based outlier detection schemes Use a prediction model to learn the normal behavior Every deviation from learned prediction model can be treated as anomaly or potential intrusion Recent approaches: Neural networks Unsupervised Support Vector Machines (SVMs)

58. Neural networks for outlier detection Use a replicator 4-layer feed-forward neural network (RNN) with the same number of input and output nodes Input variables are the output variables so that RNN forms a compressed model of the data during training A measure of outlyingness is the reconstruction error of individual data points

59. Conclusions Data mining analysis of rare events requires special attention Many real world applications exhibit �needle-in the-haystack� type of problem Current �state of the art� data mining techniques are still insufficient for efficient handling rare events Need for designing better and more accurate data mining models

60. References Han, J., and Kamber, M., Data Mining: Concept and Techniques, Morgan Kaufmann Publishers, 2001 Kantardzic, M., Data Mining Methods, Concepts, and Algorithms, IEEE-Press/Wiley, 2003 Lazarevic, A., Srivastava, J., Kumar, V., Data Mining for Computer Security Applications, IEEE ICDM 2003 Tutorial Kantardzic, M., and Zurada, J. (Eds.), Next Generations of Data Mining Applications, IEEE Press/Wiley, 2005 Tan, P., Steinbach, M., Kumar, V., Introduction to Data Mining, Addison Wesley, 2005 Zurada, J., Knowledge Discovery and Data Mining, Lecture Notes on Blackboard, Spring 2005

61. Thank you for attending this lecture! Questions/Discussion?

Data Mining for Analysis of Rare Events: A Case of Computer Security and Other Applications

Data Mining for Analysis of Rare Events: A Case of Computer Security and Other Applications

Presentation Transcript

Data Mining Techniques for CRM

CAPPS II: A Case Study of Homeland Security Computer Applications

Chapter 2 Data Mining

Data Mining

Data Mining Tools

Business System Analysis & Decision Making – Data Mining and Web Mining

Data Mining for Intrusion Detection: A Critical Review

DATA MINING LECTURE 2

Data Mining Techniques for CRM

Malicious Code Detection and Security Applications

An Introduction to Data Mining Hosein Rostani Alireza Zohdi

Applications of Spatial Data Mining & Visualization - Case Studies

Data and Applications Security Developments and Directions

Data Mining: Concepts and Techniques — Additional Applications and Emerging Topics —

Data Mining: Data

Session – I Data Mining : Concepts and Techniques

Data Mining and Warehousing: Chapter 8

CS/CMPE 636 – Advanced Data Mining

Data Mining

Data and Applications Security Developments and Directions

Applications of Data Mining in Microarray Data Analysis

Data Mining for Analysis of Rare Events: A Case of Computer Security and Other Applications