1 / 16

A Generic Approach to Big Data Alarms Prioritization

A Generic Approach to Big Data Alarms Prioritization. Ossi Askew, Darshit Mody, Ayushi Vyas, Tiffany Branker Pedro Vasseur, Stephan Barabassi. Introduction. The process of identifying and acting upon a possible data leak in a timely manner is a continuing challenge for most organizations.

hnaomi
Download Presentation

A Generic Approach to Big Data Alarms Prioritization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Generic Approach to Big Data Alarms Prioritization Ossi Askew, Darshit Mody, Ayushi Vyas, Tiffany Branker Pedro Vasseur, Stephan Barabassi

  2. Introduction • The process of identifying and acting upon a possible data leak in a timely manner is a continuing challenge for most organizations. • The volume of data being ingested keeps growing every day as organizations continue to place this information in very large data repositories in order to mine insightful patterns that can be used for key decision-making tasks. • In order to safeguard this sensitive data from being accessed without the proper authorization, companies have installed Data Leak Detection (DLD) applications that monitor the access to internal data repositories. • An alarm record is created in the form of a security log record whenever any anomalous data querying behavior is detected by a DLD engine. • The alarms need to be analyzed in a timely manner by security experts to determine if indeed they are malicious or benign.

  3. Current State of DLD Alarm Handling Alarm, alarm, alarm, alarm, alarm, alarm, alarm, alarm…. e f a Security Analysts Query Functions and Applications Data Leak Detection Engine g c b Manual review and decision-making of unclassified alarms Big Data Tables (Hadoop) Access Rules d

  4. Previous work • The first data analysis technique used was an unsupervised data clustering model and the ensuing identification of determining attributes of a “data-in-motion” test sample. • Teams found that the target UID path of the uploaded datasets triggered the false positives. • The team discovered a pattern where any file upload that took more than a second were classified as a false alarm. • The second data analysis technique, a decision tree algorithm, proved to be more useful in multiple environments because it could consider various attributes of the data. • Team leveraged as much as possible on the previous work done by applying to the “data at rest” condition especially in the Big Data security logs.

  5. Primary Approach To improve the efficiency and effectiveness of the analysis effort associated with data leak detection alert logs an approach could be achieved by using traditional data mining methods, as well as Big Data analytics techniques that can classify and prioritize true and false positives. To use machine learning methods that can iteratively confirm the nature and priority of the alerts, and hence reduce the time and cost incurred in the manual process of investigating and acting upon malicious data access. For prediction of binomial results, such as true or false, a decision tree model was chosen, specifically the ID3 component. After reviewing several data mining tools, the data mining tool RapidMiner has been chosen to test this algorithm.

  6. Primary Approach - continued ID3: (Iterative Dichotomiser 3) It builds a decision tree from a fixed set of examples. The resulting tree is used to classify future samples. The examples of the given Example Set have several attributes and every example belongs to a class (like yes or no). The leaf nodes of the decision tree contain the class name whereas a non-leaf node is a decision node. The decision node is an attribute test with each branch (to another decision tree) being a possible value of the attribute. ID3 uses feature selection heuristic to help it decide which attribute goes into a decision node. The required heuristic can be selected by a criterion parameter. * * From RapidMiner official documentation

  7. Secondary Approach To provide an automated feedback mechanism to the Data Leak Detection Rules engine, the Decision Tree Training set and the Subject matter expert for improving the prediction accuracy in continued learning. The automated feedback should be executed based on a programmed algorithm that compares records of determining attributes and associated target value in the training set to the corresponding record attributes and associated target in the DLD access rules engine.

  8. Desired State of Alarm Handling Predicting and Prioritizing Algorithm Alarm, alarm, alarm, alarm, alarm, alarm, alarm, alarm….. e f Truealarm, true alarm, false alarm, false alarm, false alarm a c g Query Functions and Applications Data Leak Prevention Rules Engine i b Automated Feedback Automated feedback h Big Data Tables (Hadoop) Security Analyst d Access Rules Final decision-making of classified alarms j Manual feedback

  9. Results . Initial trained tree

  10. Results - continued The alarms generated by the DLD engine were categorized and prioritized as true and false positives. The RapidMiner tool was used to attain accuracy of over 90 percent after re-selecting the key contributing attributes for the decision tree algorithm. Introduced the filtering of some input variables. After several selections and trials, discovered that ID3 model improved its prediction capabilities by using the attribute selection of Alarm, Component Accessed and Role, and with an adjusted learning criteria of gain_ratio, minimal size for split of2, minimal leaf size of 2 instead of 4, and minimal gain increased gradually up to 0.90 from 0.10. This created a new learned tree structure with Role as the root node instead of Component Used. .

  11. Results - continued The alarms generated by the DLD engine were categorized and prioritized as true and false positives. The RapidMiner tool was used to attain accuracy of over 90 percent after re-selecting the key contributing attributes for the decision tree algorithm. A prototype of component I, the automated feedback, was proposed for future testing.

  12. Conclusion/Summary The subject matter experts, the security analysts, will spend less time sorting and reviewing true alarms since the false alarms have been identified for later non-urgent review It is proposed that the time and effort required to manually update the Access Rules violation decision table and the model’s training set will be minimized by developing a programmatic approach to generate component I that could efficiently replace component J. This approach can be generalized and be applied to other types of DLDs.

  13. Future Work Design an efficient approach that can anonymize data security alarm records for scoring purposes. Continue enhancing the algorithm for the iterative machine learning and re-training to keep improving the confirmation and prioritization of anomalous querying of Big Data repositories based on • increasing the accuracy of prediction over 95% in the decision tree component of the algorithm • the programmatic comparison of existing rules and confirmed alarms using the formula described on the next slide

  14. Future Work - continued Logic for the execution of component i. If {TSIV1, TSIV2,…TSIVn, TSTV } of x is not equal to {ARIV1, ARIV2,…ARIVn, ARTV} of y then execute component i Where, TSIV stands for the value of a contributing Training Set Independent Variable TSTV stands for the value of a Training Set Target Variable e.g. True or False ARIV stands for the value of the corresponding Access Rules Independent Variable ARTV stands for the value of the corresponding Access Rules Target Variable e.g. True or False 1 to n is the distinct value of an Independent Variable in an instance of the Training Set, x is a record in a Training Set y is a record in the Access Rules table.

  15. Q and A’s?

  16. Thank you.

More Related