Information Fusion

Information Fusion Ganesh Godavari

DDoS Data Set • DARPA DDoS data set (2000) is available • MIT Lincoln Laboratory • Data Set spans approximately 3 hours • The five phases of the attack scenario depicted [1]: • IPsweep of the Air Force Base from a remote site • Probe of live IP's to look for the sadmind daemon running on Solaris hosts • Breakins via the sadmind vulnerability, both successful and unsuccessful on those hosts • Installation of the trojan mstream DDoS software on three hosts at the AFB • Launching the DDoS

Attack Scenario [1]

Phase 1 Attack (DDoS DataSet) Date Time Duration SrcIP Target IP Analyzer Service 03/07/2000 09:51:36 00:00:00 202.77.162.213 172.16.115.5 tcpdump_inside icmp-E-R 03/07/2000 09:51:36 00:00:05 172.16.112.194 202.77.162.213 tcpdump_inside icmp-E-Rp 03/07/2000 09:51:36 00:00:00 202.77.162.213 172.16.115.20 tcpdump_inside icmp-E-R 03/07/2000 09:51:36 00:00:00 172.16.115.20 202.77.162.213 tcpdump_inside icmp-E-Rp 03/07/2000 09:51:38 00:00:00 202.77.162.213 172.16.115.87 tcpdump_inside icmp-E-R 03/07/2000 09:51:38 00:00:00 172.16.115.87 202.77.162.213 tcpdump_inside icmp-E-Rp 03/07/2000 09:51:41 00:00:00 202.77.162.213 172.16.115.234 tcpdump_inside icmp-E-R 03/07/2000 09:51:50 00:00:00 202.77.162.213 172.16.113.50 tcpdump_inside icmp-E-R 03/07/2000 09:51:50 00:00:00 172.16.113.50 202.77.162.213 tcpdump_inside icmp-E-Rp 03/07/2000 09:51:51 00:00:00 202.77.162.213 172.16.113.84 tcpdump_inside icmp-E-R 03/07/2000 09:51:51 00:00:09 172.16.112.194 202.77.162.213 tcpdump_inside icmp-E-Rp 03/07/2000 09:51:51 00:00:00 202.77.162.213 172.16.113.105 tcpdump_inside icmp-E-R 03/07/2000 09:51:51 00:00:00 172.16.113.105 202.77.162.213 tcpdump_inside icmp-E-Rp 03/07/2000 09:51:52 00:00:00 202.77.162.213 172.16.113.148 tcpdump_inside icmp-E-R : : : : : : : : : : : : 03/07/2000 09:52:00 00:00:00 202.77.162.213 172.16.112.194 tcpdump_inside icmp-E-R 03/07/2000 09:52:00 00:00:00 202.77.162.213 172.16.112.207 tcpdump_inside icmp-E-R icmp-E-R => icmp-echo-request icmp-E-Rp => icmp-echo-reply

Algorithm Step 1: go over the data file and build vocabulary • Read all the unique fields in the data files Step 2: identify the frequent vocabulary in the data file • How to determine frequency? How can one determine the threshold for frequency ? Step 3: Generate cluster candidates • Lines containing the same frequent words form cluster Step 4: Identify temporal relationships between cluster candidates • The 24 relationships of data Step 5: Generate unique lines • Lines in the data file in based on the candidate cluster

Need Suggestions • Is it safe to assume that a threshold parameter is provided? • Cluster candidate generation can involve too much data generation (next slide shows how) • 24 relations cover everything. Need to identify on which we are interested in?

Cluster Candidate Generation • Data Set has 8 dimensions • frequent words(4byte col. # word) with threshold > 10 are • 0004202.77.162.213 repeated 22 • 000103/07/2000 repeated 33 • 000300:00:00 repeated 31 • 0007icmp-echo-request repeated 22 • 0007icmp-echo-reply repeated 11 • 0006tcpdump_inside repeated 33 • 0005202.77.162.213 repeated 11

Candidate Generation Example • Example 03/07/2000 09:51:36 00:00:00202.77.162.213 172.16.115.5 tcpdump_insideicmp-E-R 03/07/2000 09:51:36 00:00:05 172.16.112.194 202.77.162.213tcpdump_inside icmp-E-Rp 03/07/2000 09:51:36 00:00:00202.77.162.213 172.16.115.20 tcpdump_inside icmp-E-R 03/07/2000 09:51:36 00:00:00 172.16.115.20 202.77.162.213 tcpdump_insideicmp-E-Rp • In all data first field is common so should they be considered as a candidate cluster? for each frequent-word in frequent-word-list { While (Read a Line of data != EOF) { if (frequent-word in line) add line no. to Cluster } // end of while } // end of for Cluster 1 = { line 1, line 2, line 3, line 4} Cluster 2 = { line 1, line 3, line 4} Cluster 3 = { line 1, line 3} Cluster 4 = { line 2, line 4} Cluster 5 = { line 1, line 2, line 3, line 4} Cluster 5 = { line 1, line 3} Cluster 6 = { line 2, line 4}

Another Approach? • Reduction but loss of information? Char Key While (Read a Line of data != EOF) { for each frequent-word in frequent-word-list { if (frequent-word in line) key = key + frequent-word } // end of for if ( key not in Cluster) add line no. to cluster } // end of while • Cluster 1 = { line 1, line 3} • Cluster 2 = { line 2} • Cluster 3 = { line 4}

Temporal Relations • Unable to find a place where the 24 temporal relationship do not meet • Need to identify relationships that are needed by the decision making

Work to be done • Completed the algorithm and coding part till step 4.

References [1] MIT Lincoln laboratories http://www.ll.mit.edu/IST/ideval/data/2000/2000_data_index.html

Information Fusion