Beyond Keyword Filtering for Message and Conversation Detection

Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College skill@cs.queensu.ca

The problem: Pick out the most `interesting’ intercepted messages when conventional markers (sender/receivers etc.) are missing. The solution: Look for correlated use of words that are used with the “wrong” frequency, caused by substitution to evade keyword filtering. The technique: Use singular value decomposition and independent component analysis applied to noun frequency profiles; suspicious related messages appear as outliers. Messages with ordinary word frequencies and lone eccentrics do not show up. So it can be applied to large sets of messages to select the interesting few.

THE PROBLEM

Many governments collect and analyze message traffic (e.g. Echelon) – email, file traffic/web, cellphone traffic, radio. There are 3 levels of analysis: 1. Match the content of individual messages against a watch list of words that suggest the message is suspicious. German Federal Intelligence Service: nuclear proliferation (2000 terms), arms trade (1000), terrorism (500), drugs (400), as of 2000 (certainly changed now). Countermeasures: use a speech code (hard in realtime) or use locutions (“the package is ready”). Main benefit: Changes behavior of those who DON’T want their messages intercepted.

2. Look for sets of messages that are connected, that form a conversation, based on some of their properties: sender/receiver identities, time of transmission, specialized word use, etc.. (Social Network Analysis) Countermeasures: conceal the connections between the messages by making sure they share no obvious attributes: * use temporary email addresses, stolen cell phones * decouple by using intermediaries * smear time factors e.g. by using web sites In general, hide in the background noise .

3. Look for sets of messages that are connected in more subtle ways because of correlation among their properties. Workable countermeasures are hard to find because: * conversations are about something, so that correlation in their content arises naturally * sensitivity to watch list surveillance alters the way words are used We hypothesize that related messages among a threat group in the context of watch list surveillance will be characterized by correlated word use; but that the words will be used with the “wrong” frequencies. Common words will be used as if they were uncommon; uncommon words will be used as if they were common.

THE DATA

The frequency of words in English (and many other languages) is Zipf – frequent words are very frequent, and frequency drops off very quickly. We restrict our attention to nouns. In English Most common noun – time 3262nd most common noun – quantum We assume that messages are reduced to a frequency histogram of their nouns (this can be done reliably with a tagger).

A message-frequency matrix has a row corresponding to each message, and a column corresponding to each noun. The ij th entry is the frequency of noun j in message i . The matrix is very sparse. We generate artificial datasets using a Poisson distribution with mean f * 1/j+1 , where f models the base frequency. We add 10 extra rows representing the correlated threat messages, using a block of 6 columns, uniformly randomly 0s and 1s, added at columns 301—306.

A message-rank matrix has a row corresponding to each message, and a column corresponding to the rank, in English, of the j th most frequent noun in the message. Message-rank matrices have many fewer columns, which makes them easier and faster to work with (e.g. Enron email dataset: 200,000+ `words’ but average number of nouns per message <200). Message-frequency matrices have been extensively studied in IR, but message-rank matrices not at all. Message-rank messages are insensitive to countermeasures such as using words with almost the right frequency.

messages nouns

messages rank of jth noun in message

THE TECHNIQUES

Matrix decompositions. The basic idea: * Treat the dataset as a matrix, A, with n rows and m columns; * Factor A into the product of two matrices, C and F A = C F where C is n x r, F is r x m and r is smaller than m. Think of F as a set of underlying `real’ somethings and C as a way of `mixing’ these somethings together to get the observed attribute values. Choosing r smaller than m forces the decomposition to somehow represent the data more compactly. F A = C

Two matrix decompositions are useful : Singular value decomposition (SVD) – the rows of F are orthogonal axes such that the maximum possible variation in the data lies along the first axis; the maximum of what remains along the second, and so on. The rows of C are coordinates in this space. Independent component analysis (ICA) – the rows of F are statistically independent factors. The rows of C describe how to mix these factors to produce the original data. Strictly speaking, the row of C are not coordinates, but we can plot them to get some idea of structure.

The messages with correlated unusual word usage are marked with red circles First 3 dimensions – SVD

First 3 dimensions – ICA

(Fortunately) both unusual word use and correlated word use are necessary to make such messages detectable. Correlation with proper word frequencies (SVD) So ordinary conversations don’t show up as false positives!!

Correlation with proper word frequencies (ICA)

Uncorrelated with unusual word frequencies (SVD) Conversations about unusual things don’t show up as false positives either!!

Uncorrelated with unusual word frequencies (ICA)

This trick permits a new level of sophistication in connecting related messages into conversations when the usual indicators are not available. It does exactly the right thing – ignoring conversations about ordinary topics, and conversations about unusual topics, but homing in on conversations about unusual topics using inappropriate words. Because the dataset is sparse, SVD takes time linear in the number of messages. The complexity of ICA is less clear but there are direct hardware implementations.

Message-rank matrices are useful because they defend against the countermeasure of rules like “use the word 5 ranks below the one you want to use”. Such rules are easy to apply with access to the internet, for example the site www.fabrica.it/wordcount/main.php. However, this isn’t so easy in real-time communication.

SVD of message-rank matrix has a fan shape. Points are labelled with the length of each message

Same plot with messages labelled by the average rank of the nouns they contain. Length of message and average rank are correlated – partly because of opportunity, but it’s not clear that this the whole story.

Replacing words with those, say, five positions down the list does not show up in the SVD of a message-frequency matrix:

But it’s very clear in the SVD of a message-rank matrix:

We have been applying these techniques to the Enron email dataset, which is a good surrogate for intercepted communications: * about 500,000 emails * about 1500 people * partially known `command and control’ structure Early results from several groups were presented at the Workshop on Link Analysis, Counterterrorism and Security: www.cs.queensu.ca/home/skill/siamworkshop.html also New York Times Week in Review this weekend

Beyond Keyword Filtering for Message and Conversation Detection