CISA
Download
1 / 30

CISA - PowerPoint PPT Presentation


  • 137 Views
  • Uploaded on

CISA. Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003. Agenda. Background and Overview Architecture Algorithms Results. MURALS: Multiple Use Real-time Analytics for Large Scale Data. Major information technology initiative

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CISA' - yazid


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

CISA

Continually Improving Stream Analysis

Nancy McMillan

Doug Mooney

Dave Burgoon

March 14, 2003


Agenda
Agenda

  • Background and Overview

  • Architecture

  • Algorithms

  • Results


Murals multiple use real time analytics for large scale data
MURALS:Multiple Use Real-time Analytics for Large Scale Data

  • Major information technology initiative

    • Objective: Develop intellectual property addressing the challenges created by:

      • Data generation/collection at previously unimaginable rates

      • Growing expectation that real time decision-making is feasible and necessary for competitive advantage

      • Dramatic increase in the data to information ratio

      • Compelling need for balance between result precision and timeliness

  • Sponsored development of two technologies

    • InfoRes: Addresses IT issues associated with real-time querying of very large relational databases

    • CISA: Addresses IT issues associated with real-time analysis of high volume (varying arrival speed) stream data


Background our problem space
Background:Our problem space

  • Many data sources supplying stream data

  • Stream data can be summarized by a set of features/summary statistics over some time window

  • Each data source needs continually classified or characterized

  • Classification/characterization of a single data source may depend on data from other data sources

  • Examples:

    • Computers connecting to a firewall

    • Sensor networks


Internet security example who is trying to inappropriately access a company s network
Internet Security ExampleWho is trying to inappropriately access a company’s network?

  • There are 19 firewalls recording connections in a log file

    • Date/Time • Source and Destination IP addresses

    • Protocol • Action (Accept/Drop/Decrypt/..)

    • Service • Rule

  • Inbound and outbound connections and warnings over a six day period in July 2002 were logged

    • but connections from site to site VPNs are not

    • only externally initiated connections are being analyzed

    • more data (6 days in September) were provided later


The problem the faster data arrives the more processing power required for real time analysis

Every data arrival initiates some tasks (store data, recalculate features, update decisions, etc.), which each require computational time

Systems designed for gushing data waste resources when data trickles.

Systems designed for slower data flow fail when data arrives too fast.

More sophisticated analysis techniques (better features, decision algorithms, etc.) require more computational time, but can provide better answers

Analytics designed for gushing data don’t provide the best answer possible when data trickles.

Analytics designed for slower data flow don’t provide timely answers when data arrives too fast

The Problem: The faster data arrives, the more processing power required for real-time analysis.

To what data arrival rate should system be designed?


The cisa answer a precision speed trade off
The CISA Answer: recalculate features, update decisions, etc.), which each require computational timeA precision-speed trade-off

  • When the data arrives more slowly than the system design rate, the best possible answer is provided

    • All data is considered.

    • Best analysis techniques are used.

  • As the data flows faster than the system design rate the accuracy and/or precision of the solution degrades smoothly.

  • System achieves precision-speed trade-off through:

    • Architecture

      • Answer not based on all current data

      • Requires feedback from algorithm so most important data is considered

    • Algorithms

      • Partial/approximate solutions provided


Architecture and algorithm overview how cisa achieves precision speed trade off

Architecture recalculate features, update decisions, etc.), which each require computational time

Assign analysis tasks to asynchronously operating objects

storage, characterization, decision-making, and visualization

Prioritize analysis tasks associated with each new piece of data

Data likely to impact analysis is analyzed sooner

Algorithm

Use incremental algorithms where possible

Update previous answer with new data rather than re-analyze all data

Stop or modify iterative or multi-step algorithms before completion when new data arrivals need to enter algorithm

Partial/approximate solutions provided

Architecture and Algorithm OverviewHow CISA achieves precision-speed trade-off


Agenda1
Agenda recalculate features, update decisions, etc.), which each require computational time

  • Background and Overview

  • Architecture

  • Algorithms

  • Results


Cisa architectural components diagram
CISA Architectural Components recalculate features, update decisions, etc.), which each require computational timeDiagram


Internet security example architecture diagram
Internet Security Example Architecture recalculate features, update decisions, etc.), which each require computational timeDiagram

Java

Access database

JMS object communication

SAS Analytics


Advantages issues related to rapid prototyping decisions

Advantages recalculate features, update decisions, etc.), which each require computational time

Asynchronous

Prioritized Lists

Open Source / Off-the-shelf

Platform Independent

Issues

Slow – system resources, ”thrashing”, db, (network speeds)

JMS Implementations vary slightly

Advantages

Easy communication with Java

Easily and quickly developed

data storage and

feature calculation

Issues

Slow

Not available on many platforms

Advantages / IssuesRelated to rapid prototyping decisions

JMS

Access


Agenda2
Agenda recalculate features, update decisions, etc.), which each require computational time

  • Background and Overview

  • Architecture

  • Algorithms

  • Results


Candidate cisa algorithms a very broad group of statistical methods

Feature characteristics recalculate features, update decisions, etc.), which each require computational time

Relies on more than one feature

Some of the individual features take time to compute or measure

Meaningful nested "sub-algorithms" can be built on increasing sets of features

Data source characteristics

The algorithm can efficiently, update its current solution when feature values for only a small group of source objects change

There is a natural method for prioritizing objects

Candidate CISA AlgorithmsA very broad group of statistical methods…


Construction methodologies general
Construction Methodologies recalculate features, update decisions, etc.), which each require computational timeGeneral

  • Feature Priority

    • Order features (statically)

    • Create series of nested models that use an increasing number of features

    • Develop a function to assign priorities based on feature order and current object classification

  • Data Source Priority

    • Order data sources (dynamically)

    • Assign priorities based on uncertainty of classification or cost of misclassification

    • Incremental algorithms are usually essential

  • Combinations of Both


Construction methodologies examples
Construction Methodologies recalculate features, update decisions, etc.), which each require computational timeExamples

  • Feature Priority: Decompose an algorithm into subalgorithms that use subsets of features. Prioritize feature computation.

    • Example: Decision tree using X1,X2,… , Xn

    • Prioritize order of Xi computation based on tree structure

    • Use pruned trees to classify:

      {X1}, {X1,X2}, {X1, X2, X3}, …, {X1, X2, …, Xn}

  • Data Source Priority:

    • Example: Cluster analysis—All features needed

    • Objects with incomplete feature sets get higher priority

    • Objects with more uncertain classifications get higher priority


Feature priority construction decision tree example
Feature Priority Construction recalculate features, update decisions, etc.), which each require computational timeDecision tree example


Agenda3
Agenda recalculate features, update decisions, etc.), which each require computational time

  • Background and Overview

  • Architecture

  • Algorithms

  • Results


Internet security example who is trying to inappropriately access the company s network
Internet Security Example recalculate features, update decisions, etc.), which each require computational timeWho is trying to inappropriately access the company’s network?

  • There are 19 firewalls recording connections in a log file

    • Date/Time • Source and Destination IP addresses

    • Protocol • Action (Accept/Drop/Decrypt/..)

    • Service • Rule

  • Inbound and outbound connections and warnings over a six day period in July 2002 were logged

    • but connections from site to site VPNs are not

    • only externally initiated connections are being analyzed

    • more data (6 days in September) were provided later


External network connectors summary statistics features

Quickly calculated features recalculate features, update decisions, etc.), which each require computational time

% Drop

% Accept

Hits/Sec

# Hits

More time consuming features

# Different Services

Different Services/Hit

# Different IPs

Different IPs/Hit

External Network Connectors Summary statistics/features


N=3 recalculate features, update decisions, etc.), which each require computational time

Slow Port and IP Scans

High Services

High Number of IPs

High Number of Hits

Low Hits/Sec

Large Drop %

N=4636

Suspicious

Large Drop %

Medium IP/Hit

Low everything else

N=10

Fast IP Address Scans

Low Services

High Number of Hits

High IP/Hit

High Number of Hits/Sec

Large Drop %

Mostly Foreign

Represent 40% of External Connections

N=7828

Normal

High Accept %

N=8055

Suspicious-Too Early to Tell

Large Drop %

High IP/Hit

Few Hits

N=36

Port Scans

High Services

Large Drop %

Dates: 7/21/02 -7/27/02


External network connectors classifications
External Network Connectors recalculate features, update decisions, etc.), which each require computational timeClassifications

70%-80% of IPs stay in same group from day to day.


External network connectors rule based feature priority classification algorithm
External Network Connectors recalculate features, update decisions, etc.), which each require computational timeRule-based, feature priority classification algorithm

Priority


Precision speed trade off expected results
Precision-Speed Trade-off recalculate features, update decisions, etc.), which each require computational timeExpected results

100

%

0

Connections per second

Correctly classified same level algorithm

Correctly classified different level algorithm

Consistently classified

Inconsistently classified


Precision speed trade off observed results
Precision-Speed Trade-off recalculate features, update decisions, etc.), which each require computational timeObserved results


External network connectors dynamic data source priority algorithm
External Network Connectors recalculate features, update decisions, etc.), which each require computational timeDynamic, data source priority algorithm

  • Traditional cluster analysis (e.g., K-means) is time consuming on large datasets

  • Incremental clustering algorithm required for reasonable performance

  • Our approach:

    • After first cluster analysis, use centroid locations to seed the next analysis

    • Used the SAS procedure FASTCLUS for proof-of-concept purposes


Dates: 8/11/02 - 8/17/02 recalculate features, update decisions, etc.), which each require computational time

Outlier

Outlier: n=1 (0.32% of connections) Extremely high services China


Dates: 8/11/02 - 8/17/02 recalculate features, update decisions, etc.), which each require computational time

Cluster 0: n = 5207 (10.11% of connections) High Accept % Mix Max Hits Mix IP/Hit

Cluster 1: n = 2561 (17.16% of connections) High Drop % Medium IP/Hit

Cluster 2: n = 7 (50.35% of connections) High Drop % High Num Hits High Num IPs High Max Hits/Sec

Cluster 3: n = 180 (17.81% of connections) High Services and/or Max Hits/Sec Mixed

Cluster 4: n = 4 (01.42% of connections) High Drop % High Services 94.5% of connections from Korea 1 of 4 IPs from Korea Average 23 sec between hits

Cluster 5: n = 5104 (02.82% of connections) High IP/Hit High Drop %

Cluster 0

Cluster 2

Cluster 4

Cluster 1

Cluster 5

Cluster 3


External network connector classifications dashboard report

Drop % recalculate features, update decisions, etc.), which each require computational time

Service/Hit

IPS/Hit

Max Hit/Sec

IPs Scanned

Services Scanned

% of Sources

% Connections

External Network Connector Classifications Dashboard report


External network connector classifications outlier report

Drop % recalculate features, update decisions, etc.), which each require computational time

Service/Hit

IPS/Hit

Max Hit/Sec

IPs Scanned

Services Scanned

External Network Connector ClassificationsOutlier report


ad