Semi automated discovery of application session structure
Download
1 / 35

Semi-Automated Discovery of Application Session Structure - PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on

Semi-Automated Discovery of Application Session Structure. Jayanthkumar Kannan (Berkeley) , Jaeyeon Jung (Mazu Networks) , Vern Paxson (Berkeley) , Can Emre Koksal (EPFL) ACM Internet Measurement Conference 2006. Outline. Introduction Background Session Extraction Structure Abstraction

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Semi-Automated Discovery of Application Session Structure' - amaris


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Semi automated discovery of application session structure

Semi-Automated Discovery of Application Session Structure

Jayanthkumar Kannan (Berkeley), Jaeyeon Jung (Mazu Networks), Vern Paxson (Berkeley), Can Emre Koksal (EPFL)

ACM Internet Measurement Conference 2006


Outline
Outline

  • Introduction

  • Background

  • Session Extraction

  • Structure Abstraction

  • Results

  • Conclusion & Comments

Speaker: Li-Ming Chen


Network traffic analysis
Network traffic analysis

  • Previous works have extensively examined network behavior at the level of packets and connections.

    • Dynamics, self-similarity

    • Packet delays and loses

    • Connection characteristics at different sites

    • Transport behavior, structural analysis

    • Applications: traffic engineering, capacity planning, anomaly detection

  • What about session level analysis?

Speaker: Li-Ming Chen


Understanding traffic at a higher level session level analysis
Understanding traffic at a higher level - Session level analysis

  • Comparatively, the structure of user-initialed sessions remains much less explored

    • Sessions – application sessions

    • Denote as a group of connections associated with a single network task (response to a user event !)

  • What could be considered as application sessions?

    • Applications have pre-specified forms (e.g., FTP sessions)

    • More types of sessions:

      • User behavior (e.g., Web surfing, sending e-mail)

      • Anomalies, mis-configuration

      • Malicious activities (e.g., Botnet)

Speaker: Li-Ming Chen


Results examples
Results/Examples

  • FTP session

  • or imagine:

  • logging into a website and listening online music..

  • Botnet zombie receiving instructions from its master and proceeding..

Speaker: Li-Ming Chen


Benefits of session level analysis
Benefits of session level analysis

  • For the researchers

    • Aid with traffic characterization and monitoring

    • Provide a foundation for forming source models

      • Descriptions of network activity in terms of what a source is attempting to achieve using the network

    • Aid with anomaly detection

  • For the administrators

    • Track application use in their network at a higher level

    • Provide richer information for framing network policies

    • Anomaly detection

Speaker: Li-Ming Chen


Problem goals
Problem & Goals

  • Mine a connection-level trace

  • Derive session descriptors(abstract descriptions of the session-structure) for the different applications present in the trace

    • Without any a prior knowledge about the application

  • Deduce descriptors to provide qualitative structure for the analysts

  • Express these descriptors as

    • Regular expressions

    • Deterministic finite automata (DFAs)

      • The expression focus on the order, type, directionality of the connections, but not their inter-arrival timing !

Speaker: Li-Ming Chen


Approach the concept
Approach (the concept)

Session

Descriptors

Connection

-level

traffic trace

Session

Extraction

Structure

Abstraction

  • Reduces a stream of connections

  • down to a stream of sessions

  • (Observation) connections

  • belonging to the same session

  • tend to occur “close” to one another

  • Model the temporal characteristics

  • of session arrivals

  • Attempts to infer succinct session

  • descriptors from each application

  • Simplify the raw descriptions to a

  • generalized form

  • Provide complexity-coverage

  • curves to represent the trade off

  • between economy-of-expression

  • and more detailed fidelity

Speaker: Li-Ming Chen


Outline1
Outline

  • Introduction

  • Background

  • Session Extraction

  • Structure Abstraction

  • Results

  • Conclusion & Comments

Speaker: Li-Ming Chen


Dataset
Dataset

  • Connection-level traces collected at the border of the LBNL

    • 1 month trace, about 2700K connections per day

  • 1st half – used to develop and calibrate the model

  • 2nd half – apply the model to infer descriptors for about 40 different applications, including:

    • Content-transfer (SMTP, FTP, HTTP)

    • Remote access (SSH, Telnet)

    • Database (OracleSQL, MySQL)

    • P2P (BitTorrent)

    • Mapping, authentication, remote desktop…, etc

  • How to evaluate? Based on the Spec. or human inspect

Speaker: Li-Ming Chen


Terminology
Terminology

  • Connection C:

    • Denote by (proto, dir, remote-host, local-host, start-time, duration)

      • proto: destination port X

      • dir: incoming or outgoing connection

  • Type of a connection T(C):

    • Define as (proto, dir)

  • Session S = (C1, C2,…, Cn)

    • a sequence of connections involve only a single local-host and single remote-host

  • Application A(S):

    • Associated with a session S as T(C1)

  • A session S belongs to the session type ST(S) = (T1, T2,…, Tn)

    • For all i ≦n, Ti = T(Ci)

Speaker: Li-Ming Chen


Types of sessions
Types of Sessions

  • Singleton

    • A lone connection by itself

  • Homogeneous sessions

    • Sessions consisting of consecutive invocations of the same application protocol and all with the same directionality

      • -> same connection type !

  • Mixed sessions

    • Sessions involving different connection types

  • Sessions involving multiple remote hosts… future work

Speaker: Li-Ming Chen


Applications vs types of sessions
Applications vs. Types of Sessions

  • Different applications vary widely in the prevalence they exhibit for each of these types of session structure

  • E.g.,

    • LDAP (mapping): 11% singleton, 88% homo

    • SSH (remote access): 80% singleton, 18% homo

    • GridFTP (content-transfer): 58% singleton, 42% mixed

    • About half of the 40 applications involve more complex structure..

Speaker: Li-Ming Chen


Outline2
Outline

  • Introduction

  • Background

  • Session Extraction

  • Structure Abstraction

  • Results

  • Conclusion & Comments

Session

Descriptors

Connection

-level

traffic trace

Session

Extraction

Structure

Abstraction

Speaker: Li-Ming Chen


Session extraction
Session extraction

  • Problem:

    • Given a stream of connections,

    • Parse and reduce it into sessions (a stream of application-level sessions)

  • When observing a new connection Ci, the algorithm must decide:

    • (a) Ci is part of a current session !?

    • (b) Ci represents the beginning of a new session !?

  • Observation/Assumption:

    • The connections in a session are causally related

    • Such connections tend to occur “close” to each other

Speaker: Li-Ming Chen


1 extracting homogeneous sessions the aggregation rule

time

1. Extracting homogeneous sessions(the aggregation rule)

  • Considering connections less than a time Taggreg apart as part of the same session [24]

  • For Ci and already existed active session Sj

    • Sj = (C1j, …, Cnj) and A(Sj) ≡ T(C1j) = T(Ci)

    • If Cnj arrived less than Taggreg in the past from Ci’s arrival, then we consider Ci part of Sj

  • What about the connections involving different proto, or some what further apart ??

Sj =

C1j

C2j

Cnj

Ci

Taggreg

[24] C. Nuzman, I. Saniee, W. Sweldens, and A. Weiss, “A compound model for TCP

connection arrivals for LAN and WAN applications,” Computer Network, 2002.

Speaker: Li-Ming Chen


2 extracting mixed sessions
2. Extracting mixed sessions

not exactly

the same

  • Attempt to access possible causality

    • For Ci and already existed active session Sk

    • Sk = (C1k, …, Cmk) and A(Sk) ≡ T(C1k) ≠ T(Ci)

    • Try to find if Ci is a “triggered” connection of C1k ?

    • Bases on the observation, if Ci is causally related to Sk, then its arrival is likely to be “closer” to Sk, in comparison to the case where Ci is a normal connection.

  • (Approach) devised a statistical test:

    • Identifies pairs of causally linked connections

    • Builds a base model of what is “normal”, and flags deviations

      • Using null hypothesis test

Speaker: Li-Ming Chen


2 extracting mixed sessions causality detection algorithm
2. Extracting mixed sessions(causality detection algorithm)

  • On the arrival of a connection C of type T involving a local-host L

    • Let the sessions observed at L in the previous Ttrigger (500) seconds be S1, S2, …, Sn

    • Check & simply aggregate C to the most recent homo-sessions Si

    • Estimate the rate of connection arrivals at L for each session type within the past Trate (3600) seconds

    • For 1 ≤ i ≤ n, compute P[Ti, T, xi], for xi the interval between the arrival of Si and C

    • If P[Ti, T, xi] < α and C and Si involve the same remote-host, then add C to Si

      • else C is considered to be the 1st connection of a new session Si+1

Speaker: Li-Ming Chen


2 extracting mixed sessions causality detection algorithm cont d

time

2. Extracting mixed sessions(causality detection algorithm) (cont’d)

  • (Empirically known fact) arrival model is often roughly stationary Poisson over hourly periods

  • Identify connections whose arrivals deviate from this model as triggered connections

    • Arrival process of unrelated (normal) connections = union of independent Poisson processes

    • Quite close coincidental arrivals are very rare

    • Therefore: arrivals that are close are likely related, i.e., part of same session

  • P[T1, T2, x] is the probability that two

  • sessions have an arrival within time x.

  • If P[..] < α, declare C1, C2 in same session

FTP, rateλ1

C1

inter-arrival x

HTTP

rateλ2

might longer

than Taggreg

C2

Speaker: Li-Ming Chen


Outline3
Outline

  • Introduction

  • Background

  • Session Extraction

  • Structure Abstraction

  • Results

  • Conclusion & Comments

Session

Descriptors

Connection

-level

traffic trace

Session

Extraction

Structure

Abstraction

Speaker: Li-Ming Chen


Structure abstraction
Structure abstraction

  • Derive succinct descriptions for application session based on the set of session types (ST) reported by Session Extraction

  • Use regular expressions & DFA to represent an application session

    • Good balance between expressiveness and ease of generation

    • Further refine this representation by labeling state transitions with probabilities

      • Avoid false positive

Speaker: Li-Ming Chen


Exact dfa vs nature dfa
Exact DFA vs. “Nature” DFA

  • (Naïve approach) Simply build

  • a DFA that matches the list of

  • all the observed sessions

  • More complex due to the fact

  • that it has to completely

  • capture several FTP sessions

Exact FTP DFA

Nature

FTP DFA

  • A more traceable DFA for FTP

  • Benefits:

  • Simplicity,

  • Generalization,

  • Highlighting Common Behavior,

  • Minimizing False Positives

Speaker: Li-Ming Chen


Structure abstraction framework
Structure Abstraction Framework

Session

Descriptors

Connection

-level

traffic trace

Session

Extraction

Structure

Abstraction

(4 steps)

1

2

3

4

  • Semi-automatic

  • Lack of the ground truth

  • Categorize sessions

  • based on the server port

  • of the 1st connection

  • Construct exact DFA Efrom the union

  • of each observed session types (ST)

Speaker: Li-Ming Chen


Step 3 coverage phase
Step 3: Coverage Phase

  • Given exact DFA E

  • Aim to extract a set of DFAs that capture subsets of the observed session behavior

    • Best trade off simplicity-of-expression (fewest states/edges) for coverage (capturing most types of behavior)

  • A greedy algorithm: DFA E -> DFA F1, F2, …, Fn

    • Feed every session instance in ST to E

    • Compute hit count h(e) for every edge

    • Next, compute augmented hit count h’(e) = Σh(e’)

      • e’ reachable form e

    • Order edges by decreasing h’(e), denote by e1, e2,…

    • Construct DFAs Fi by taking the union of all edges e1, …, ei

Speaker: Li-Ming Chen


Step 4 generalization phase
Step 4: Generalization Phase

  • Generalize F1, F2, … to a set of transformation of generalized DFAs G1, G2, …

  • 3 workable generalization rules:

    • Prefix Rule: STi in trace -> all prefixes of STi

    • Counting Rule: (aBc) & (aBnc) in trace -> (aB+c)

    • Invert Direction Rule: STi in trace -> invert(STi)

ftp_in

ftp_out

data_in

data_out

data_out

data_in

data_in

data_out

data_in

Speaker: Li-Ming Chen

Refer to author’s slides


Outline4
Outline

  • Introduction

  • Background

  • Session Extraction

  • Structure Abstraction

  • Results

  • Conclusion & Comments

  • Parameters:

  • Taggreg = 100 sec

  • Ttrigger = 500 sec

  • Trate = 1 hr

  • Threshold α = 0.1

  • Tservice≥ 5

  • Counting rule |B| = 2

  • Only feed session types

  • of length ≤ 10

Speaker: Li-Ming Chen


Ftp session structures content transfer
FTP session structures (content transfer)

  • The fraction of session types in ST accepted by Gi,

  • weighted by the frequency with which the type occurs.

  • Gi may have more or fewer than i edges

DFA: 4 edges

4

2: singleton

Speaker: Li-Ming Chen


Ftp session structure cont d
FTP session structure (cont’d)

  • DFA: 8 edges

  • Single data transfer

  • in the opposite dir

DFA: 8 edges

But fewer actual

edges

DFA: 10 edges

HTTP connections can

occur during FTP sessions

DFA: 18 edges

Coverage: 99%

Speaker: Li-Ming Chen


Timbuktu session structures remote desktop
Timbuktu session structures (remote desktop)

  • 2: Singleton > 90%

  • Others < 10%

DFA: 4 edges

4

Speaker: Li-Ming Chen


Timbuktu session structures cont d
Timbuktu session structures (cont’d)

DFA: 10 edges

Speaker: Li-Ming Chen


Http session structure content transfer
HTTP session structure(content transfer)

  • DFA: 30 edges

  • (for saving space…,

  • only choose sessions begun with an

  • outgoing HTTP connections…)

  • More complex, ~99% are singleton

  • or aggregated sessions that reflect

  • successive retrieval of multiple

  • pages from the same server !

Speaker: Li-Ming Chen


Finding attacks using anomaly detection
Finding Attacks Using Anomaly Detection

  • One goal is to detect network attacks by finding sessions that deviate from established session structures.

    • Such deviations would reflect either unintended mis-configurations, scanning, or “phone home” connections associated with compromises.

Speaker: Li-Ming Chen


Outline5
Outline

  • Introduction

  • Background

  • Session Extraction

  • Structure Abstraction

  • Results

  • Conclusion & Comments

Speaker: Li-Ming Chen


Conclusion
Conclusion

  • Session extraction

    • A statistical technique to extract application sessions from a connection-level trace of network activity

  • Structure abstraction

    • A method to deduce descriptors that can be used by an analyst to capture the qualitative structure of such sessions.

  • The results show that the proposed method works well over many of the applications in the trace

  • The future work:

    • Evaluate/validate the proposed method over more applications

    • Extend the method to support single-to-multiple host sessions

    • Try to collate descriptors for closely-related protocols

Speaker: Li-Ming Chen


Comments
Comments

  • This method statistically correlate connections by observing connection-level traffic traces

    • Might not suitable for a complex environment..

    • What if the packet-level traces can be acquired ?

  • Surprisingly, a particular application can manifest various session structures

  • Session structures in this paper will help to find out the host-based anomaly

  • Single-to-multiple host sessions might be more helpful to the observation/identification of the worm-like activities

Speaker: Li-Ming Chen


ad