user modeling through symbolic learning the lus method and initial results
Download
Skip this Video
Download Presentation
User Modeling Through Symbolic Learning: The LUS Method and Initial Results

Loading in 2 Seconds...

play fullscreen
1 / 36

User Modeling Through Symbolic Learning: The LUS Method and Initial Results - PowerPoint PPT Presentation


  • 57 Views
  • Uploaded on

User Modeling Through Symbolic Learning: The LUS Method and Initial Results. Guido Cervone Ken Kaufman Ryszard Michalski Machine Learning and Inference Laboratory School of Computational Sciences George Mason University Fairfax, VA, USA {cervone, kaufman, michalski}@gmu.edu

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' User Modeling Through Symbolic Learning: The LUS Method and Initial Results' - rozene


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
user modeling through symbolic learning the lus method and initial results
User Modeling Through Symbolic Learning:The LUS Method and Initial Results

Guido Cervone

Ken Kaufman

Ryszard Michalski

Machine Learning and Inference Laboratory

School of Computational Sciences

George Mason University

Fairfax, VA, USA

{cervone, kaufman, michalski}@gmu.edu

http://www.mli.gmu.edu

research objectives
Research Objectives

The main objectives of this research are:

(1) To develop a new methodology for user modeling, called LUS (Learning User Style)

(2) To test and evaluate LUS on datasets consisting of real user activity

(3) To implement an experimental computer intrusion detection system based on the LUS methodology

main features of lus
Main Features of LUS
  • User models are created automatically through a process of symbolic inductive learning from training data sets characterizing users’ interaction with computers
  • Models are in the form of symbolic descriptions based on attributional calculus, a representation system that combines elements of propositional logic, first-order predicate logic, and multiple-valued logic
  • Generated user models are easy to interpret by human experts, and can thus be modified or adjusted manually
  • Generated user models are evaluated automatically on testing data sets using an episode classifier
terminology
Terminology
  • An eventis a description of an entity (e.g., a user activity) at a given time or during a given time period, represented by a vector of attribute-values that characterizes the use of the computer by a user at a specific time.
  • A session is a sequence of events characterizing a user’s interaction with the computer from logon to logoff.
  • An episode is a sequence of user states extracted from a session that is used for training or testing/execution of user models; it may contain consecutive states or selected states from a session(s).
  • In the training phase (during which user models are learned) it is generally desirable to use long episodes, as this helps to generate more accurate and complete user models. In the testing(orexecution) phase it is desirable to be able to use short episodes, so that a legitimate or illegitimate user can be identified from as little information as possible.
approach
Approach
  • System polls active processes every half-second and logs information on the processes and the users responsible for them
  • Data extracted from the logs takes the form of vectors of values of nominal, temporal and structured attributes
  • Initial experiments concentrated on one attribute, mode, a derived attribute based on the class of process that was running (e.g., compiler)
  • Data from successive records are combined into n-grams, e.g., <compiler, print, web, print>
  • Sets of n-grams comprising an episode are passed to the AQ20 learner
aq20 algorithm application
AQ20 Algorithm Application
  • Each training n-gram is used as an example of the class representing the user whose activity it reflects.
  • To learn a user’s profile, AQ20 divides the n-grams into positive examples (examples representing the user whose profile is being learned) and negative examples (examples representing other users’ activities)
  • AQ20 searches for maximal conjunctive rules that cover positive examples, but not negative ones, and selects the best ones according to user-specified criteria
  • The rule:[User = 1] if [mode1 = compiler] and [mode2 and mode4 = print] will be returned in the form:[User = 1] <=<compiler, print, *, print>
  • Rules and conditions may be annotated with weights (e.g., p, n, u)
epicn episode classification user identification matching episodes with n gram patterns
EPICn: Episode Classification User IdentificationMatching Episodes with n-gram Patterns
  • EPICn matches episodes with n-gram-based patterns of different users’ behavior and computes a degree of match for each user
  • EPIC employs the ATEST program for matching individual events with patterns
  • The results from ATEST for each n-gram in the episode are aggregated to give overall episode scores for each class (profile)
  • EPIC allows flexible classification: all classes whose scores are both above theepisode thresholdand within theepisode toleranceof the best achieved scored are returned as classifications
experiments
Experiments
  • Two sets of preliminary experiments were performed for different training and testing data sizes.
    • Small: First 7 users (SD)
    • Large: All 23 users (LD)
  • Rules were learned with AQ19 and AQ20, using different control parameters (TF, PD, LEF -- 3 different for SD and LD)
  • EPICn was used to test the learned hypotheses.
data used in the experiments
Data Used in the Experiments
  • 24 users for a total of 4,808,024 4-grams.
  • Each user has different number of sessions, each varying in length.
  • The data contains many repetitions.
  • This is by far the largest dataset AQ20 has been applied to.
slide12

Experiment 1

A Sample of Results from AQ20 (7 Users)

[user = 0]

<{explorer,web,office,sql,rundll32,system,time,install},

{explorer,web,logon,rundll32,system,time,install},

{explorer,web,office,logon,printing,rundll32,system,time,install}

{web,office,rundll32,system,time,install,multimedia}>

: pd=171,nd=52,ud=27, pt=2721, nt=710, ut=160, qd=0.372459, qt=0.60304

[user = 1] 

<{netscape,msie,telnet,explorer,web,acrobat,logon,system,welcome,help},

{netscape,msie,telnet,explorer,web,acrobat,logon,rundll32,welcome,help},

{netscape,msie,telnet,explorer,web,acrobat,logon,printing,welcome,dos,help},

{netscape,msie,telnet,explorer,web,acrobat,logon,welcome,dos,help}>

: pd=260,nd=54,ud=28,pt=20713,nt=132,ut=2019,qd=0.610064,qt=0.986564

...................

slide13

Distribution of Positive and Negative Events In the Training Set for Each User(80% of total data; the rest 20% constituted the testing dataset)

slide14
Predictive Accuracy of User ModelsGenerated Using PD mode and LEF1 (MaxNewPositives,0; MinNumSelectors,0)
slide15
Predictive Accuracy of User ModelsGenerated Using PD mode and LEF2 (MaxQ,0;MaxNewPositives,0; MinNumSelectors,0)
slide16
Predictive Accuracy of User ModelsGenerated Using PD mode and LEF3 (MaxTotQ,0; MaxNewPositives,0; MinNumSelectors,0)
sample rules for user 0 pd mode lef1
Sample Rules for User 0PD mode LEF1
  •   # -- This learning took:   # -- System time 10.45  # -- User time   10  # -- Number of stars generated = 46  # -- Number of rules for this class = 42  # -- Average number of rules kept from each stars = 1  # -- Size of the training events in the target class:          345  # -- Size of the training events in the other class(es):       5236  # -- Size of the total training events in the target class:    3573  # -- Size of the total training in the other class(es):        616828

[User = 0]

  • <{mail,office,printing,rundll32,system,time,install}{web,rundll32,system,time,install}{explorer,web,mail,office,logon,rundll32,system,install,multimedia} {explorer,web,office,logon,sql,rundll32,system,help,install,multimedia}>        : pd=149,nd=20,ud=22,pt=2490,nt=75,ut=37,qd=0.377406,qt=0.676398
  • <{explorer,web,office,sql,rundll32,system,time,install,multimedia} {explorer,web,office,logon,sql,rundll32,system,time,install}    {web,office,logon,sql,printing,rundll32,system,time,install}{web,rundll32,system,time,install}>        : pd=136,nd=30,ud=8,pt=2481,nt=1148,ut=14,qd=0.318267,qt=0.473443
  • <{explorer,web,rundll32,system,multimedia}{explorer,system,time,install}{explorer,rundll32,system,time,install,multimedia}{explorer,rundll32,system,time,install,multimedia}>        : pd=107,nd=21,ud=32,pt=2453,nt=930,ut=474,qd=0.255909,qt=0.496713
experiment 2
Experiment 2
  • In this experiment hypotheses were generated to describe the behavior of all 24 users
  • The training set consisted of approximately 4 million 4-grams
  • The testing set consisted of approximately 1 million 4-grams
description of experiment 2
Description of Experiment 2
  • Experiments were performed using 20% and 100% of the training set (which constituted 80% of the sessions that make up the training set)
  • Experiments were performed in PD and TF modes
  • Three different LEFs were used:
    • LEF1: (TF MODE) <MaxNewPositives,0; MinNumSelectors,0>
    • LEF2: (TF MODE) <MaxEstimatedPositives, 0; MinEstimatedNegatives, 0; MaxNewPositives,0, MinNumSelectors,0>
    • LEF3: (PD MODE) MaxQ, 0, MaxNewPositives,0, MinNumSelectors,0
experiment 21
Experiment 2
  • When combining all of a user’s testing data into a single long episode, out of the 24 users:
    • 20 users classified correctly.
    • 3 users could not be classified because the degrees of match of the best-scoring users were insufficiently separated
    • 1 user was classified incorrectly
sample rules for user 0
Sample Rules for User 0
  • # -- This learning took: # -- System time 767.15 # -- User time 768 # -- Number of stars generated = 57 # -- Number of rules for this class = 52 # -- Average number of rules kept from each stars = 1 # -- Size of the training events in the target class: 346 # -- Size of the training events in the other class(es): 71931 # -- Size of the total training events in the target class: 1826 # -- Size of the total training in the other class(es): 3750169[user=0] <- <explorer,install,multimedia,system,time> <multimedia,system> <explorer,install,system> <explorer,install,multimedia,system> : pd=64,nd=31,ud=8,pt=916,nt=404,ut=11,qd=0.124322,qt=0.348035 # 18648 <- <explorer,install,office,rundll32,system,time> <multimedia,system> <install,multimedia,rundll32,system,time> <explorer,install,rundll32,system,time> : pd=68,nd=42,ud=9,pt=919,nt=73,ut=11,qd=0.121131,qt=0.466232 # 24747 <- <explorer,help,install,mail,multimedia,rundll32,system,time,web> <help,install,logon,mail,office,rundll32,system,time,web> <help,install,mail,office,printing,rundll32,system,time,web> <help,install,rundll32,system,time> : pd=140,nd=343,ud=41,pt=1316,nt=701,ut=66,qd=0.1159,qt=0.470102 # 5068 <- <install,office,printing,system> <install,rundll32,time> <install,multimedia,office,sql,system,web> <explorer,install,multimedia,rundll32,system,web> : pd=43,nd=4,ud=2,pt=397,nt=4,ut=2,qd=0.11365,qt=0.215245 # 7642
best rule for user 23
Best Rule for User 23
  • # -- This learning took: # -- System time -39.9073 # -- User time 4256 # -- Number of stars generated = 658 # -- Number of rules for this class = 533 # -- Average number of rules kept from each stars = 1 # -- Size of the training events in the target class: 9712 # -- Size of the training events in the other class(es): 40602 # -- Size of the total training events in the target class: 1337548 # -- Size of the total training in the other class(es): 2063808
  • [user=23> <- <ControlPanel,activesync,id,mail,multimedia,netscape,network,spreadsheet,system,wordprocessing> <ControlPanel,activesync,explorer,id,logon,mail,msie,multimedia,netscape,network,printing,spreadsheet,web,wordprocessing> <ControlPanel,activesync,mail,multimedia,netscape,printing,spreadsheet,wordprocessing> <ControlPanel,activesync,mail,multimedia,netscape,spreadsheet,web,wordprocessing> : pd=4685,nd=878,ud=975,pt=1296647,nt=1166,ut=34254,qd=0.388046,qt=0.967985 # 3524022
experiments with smaller test episodes
Experiments with Smaller Test Episodes
  • In experiments with 150 session-sized testing episodes, some performed with traditional “best matching” and others with threshold-tolerance matching, identification accuracy was as follows:
    • Traditional ATEST (Rform) scoring, threshold-tolerance matching: 169 classifications, 75 correct, 84 incorrect
    • Traditional, best only matching: 71 (47.3%) correct
    • Simple scoring, threshold-tolerance matching: 165 classifications, 117 correct, 48 incorrect
    • Simple scoring, besst-only matching: 112 (74.7%) correct
prediction based approach
Prediction-Based Approach

In the prediction-based approach, events characterizing a user are pairs

<predecessor, successor>,

where:

predecessor is a sequence of lb states of the user (in the experiments, modes) that directly precede a given time instance t, and

successor is a sequence of lf states of the user (in the experiments, modes) that occur immediately after t.

Parameters lb and lf, called look-back and look-forward respectively, are determined experimentally.

an initial small experiment
An Initial Small Experiment

Rules were learned using decomposition model with lookback of 1, 2, 3, and 4. The results provided by EPICp were as follows:

CONFUSION MATRIX

Data-1 Data-2 Data-3

User 1: 374 86 66

User 2: 202 141 130

User 3: 176 97 557

topics for further research
Topics for Further Research
  • Comparative study of the n-gram-based methodology for currently available datasets using different control parameters
  • Study performance degradation on reduced session size
  • Annotate process tables with window information
  • Testing the ability to identify unknown users
  • Development and implementation of a prediction-based approach using a dedicated program sequential pattern discovery (SPARCum)
  • Employment of multivariate representation, e.g., <mode, process name, time>
  • Improving the representational space through constructive induction
  • Handling drift and shift of user models
  • Coping with incremental growth and change in the user population
conclusions
Conclusions
  • LUS methodology uses symbolic learning to generate user signatures
  • Unlike traditional classifiers, EPICn classifies based on episodes rather than individual events
  • Initial experiments have been promising, but several real world situations have yet to be addressed in full
  • Multistrategy approaches may lead to further performance improvement
ad