Multivariate Event Detection

Multivariate Event Detection Manu Shukla 3/23/2013

Basics • Use fast subset scan (Neill ‘12, J.R. Stat. Soc.) to do multivariate event detection • Multivariate event detection in this case essentially is finding keyword combinations in tweets that are most likely to signify event (in this scenario social unrest) • Reduce problem to filtering out combinations that have low probability of forming clusters using score function F(S) that satisfy Linear Time Subset Scanning (LTSS) • Find keyword combination clusters as proposed by fast subset scan technique after applying filtering

Filtering • The filtering is done following two principles • By location • By probability to form clusters based on F(S) and LTSS • Use data structures kd-tree and fp-tree to aid in filtering

Theorems • Two branch and bound algorithms are used: • Theorem 1: Given a spatial region R and a set of itemsets {A1,…,AK}, in which some of the itemsets may overlap, for any superset B{A1,...,AK}, we have the following upperbound property: FS(min{R.Ai.LTSS.count}i=1K, max{min{R.Ai.LTSS.count}i=1K, max{R.Ai.LTSS.minbase}i=1K}) > R.{B}.LTSS.FS • Theorem 2: Given a spatial region R and a set of itemsets {A1,…,AK}, in which some of the itemsets may overlap, for any superset B{A1,…,AK}, we have the following upperbound property: FS(min{R.Ai.LTSS.count}i=1K, max{min{R.Ai.LTSS.count}i=1K, max{R.Ai.LTSS.minbase}i=1K}, Call=min{R.Ai.LTSS.count}i=1K, Ball) > R.{B}.LTSS.FS

Details • Score function F(S) is Kulldorff statistic: F(S;C,B,Call,Ball) = C log(C/B) + (Call - C) log((Call – C)/(Ball – B))- Call log(Call / Ball) • C and B are respectively the aggregate count Σcti and aggregate baseline Σbti in region S for the given time interval • Call and Ball are the total aggregate count Σcti and baseline Σbti for all spatial locations si • R.A.LTSS.count and R.A.LTSS.baseare defined as the LTSS subset count and base in the region R • R.A.LTSS.minbase=min{R.A.p.base | p ε R.A.LTSS}

Steps • Build candidate clusters of single keyword terms using any technique (graph partitioning) • Filter single keyword terms spatially using 2 theorems using kd tree • Build fp-tree of keyword combinations • Filter fp-tree using 2 theorems • Cluster using fast subset scan

Issues • Scaling as keyword combinations increase exponentially (Distributed?) • Verifying the quality of clusters

Multivariate Event Detection