Rapid detection of significant spatial clusters

Rapid detection of significant spatial clusters Junfie, Rayala, Bairi

Introduction • Distinguishing patterns that are significant from those that are likely to have occurred by chance • Detection of regions with over densities according to some density measure • Performing statistical testing if the regions are significant.

Applications • Detecting clusters of disease cases, for purposes ranging from detection of bioterrorism to identification of environmental risks etc. • National hospital, pharmacy • Mining astronomical data • Medical Imaging

Previous work vs. present work • Goal of the present work is to detect the most significant rectangular region, as opposed to the most significant square region. • Time required for naïve approach in calculating maximum density region is O(N4) which is computationally infeasible • Multi resolution partitioning algorithm: Divide grid into overlapping regions, bound the maximum score of sub regions contained in each region and prune regions which cannot contain the maximum density region

There are other methods for finding dense clusters such as CLIQUE,MAFIA and STING. But the determination of statistical significance of the cluster is also important. • The present work differs from the other methods in three main ways • Determining the statistical significance of the cluster (whether it is a true over density or if it is likely to have occurred by chance) • Deals with non-uniform underlying populations. • Also applicable to a wide class of density measure D. • Optimize with respect to arbitrary non monotonic density measures (e.g. Kulldorff’s spatial scan statistic) as opposed to the other methods which use monotonic density measures.

Spatial scan statistic • Kulldorff’s spatial scan statistic(DK): A non monotonic density measure. • Used for finding significant spatial clusters • Majorly used by epidemiologists in detecting disease cases which are often indicative of an emerging outbreak • Assumes that counts cij are generated by inhomogeneous Poisson process with mean qpij, where q is underlying disease rate • Dk=

Randomization testing • Once we have found the maximum density region (mdr) of grid G according to our density measure, we must still determine the statistical significance of this region • Find ‘p’ value: we run a large number R of random replications, where a replica has the same underlying populations pi j as G, but assumes a uniform disease rateqrep for all squares

For each replica G’, we first generate all counts cijrandomly from an inhomogeneous Poisson distribution with mean qreppij, then compute the maximum regional density mrd of G’ and compare this to mrd(G). The number of replica G’ with mrd(G’) >= mrd(G), divided by the total number of replications R, gives us the p value for our maximum density region. If this p value is less than 0.05 we can conclude that the discovered region is statistically significant (unlikely to have occurred by chance) and is thus a “spatial over density”.

Finding maximum density-Naïve approach • The simplest method of finding the maximum density region is to compute the density of all rectangular regions of sizes k1xk2. • We can compute the density of any region S is O(1). Thus computing the mdr of NXN grid is O(N4). • Significance testing by randomization takes O(RN4) where R is the number of replicas which is typically 1000 • This is technically infeasible and takes nearly 45 days for 256x256 grid.

New approach • we only care about finding the maximum density region, we do not need to search over every single rectangular region: in particular, we do not need to search a set of regions if we can prove (based on other regions we have searched) that none of them can be the mdr. • These observations suggest a top-down, branch and bound approach: we maintain the current maximum score of the regions we have searched so far, calculate upper bounds on the scores of sub regions contained in a given region, and prune regions which cannot contain the mdr.

Overlap - Multires partitioning • A top-down approach to cluster detection • Search first at coarse resolutions (large) then at successively finer solutions (small) as necessary • Exhaustive search over all regions is too expensive so a different partitioning approach is taken. • Initial step towards partitioning: Divide rectangle first into right and left halves and then into top and bottom halves. But it would be exhaustive and contains O(N4) shared regions at the top level of tree.

Solution is to use a “overlap-multi resolution partitioning” in which we divide S into left, right, top and bottom children that contain more than half the area. For this we use fractions f1,f2. • The region sccommon to all four children is the centre of S. The size of Sc is (2f1-1)k1x(2f2-1)k2) and thus the centre has a non zero area • Any subregion of S is either contained entirely in one of s1…s4 or contains the center region Sc.

Now, we can search S by recursively searching S1…S4, then searching all of the regions contained in S which contain the center of Sc. • Thus the basic outline of the search procedure would be

Selection of fractions f1,f2: • The resulting set of calling overlap-search(S) is Φ of regions S. • S ϵΦ are called gridded regions. • S !ϵΦ are called outer regions. • Assumption: Grid G is square, and that its size N is a power of 2. • f1 = ¾ if k1= 2r, and f1 = if k1= 3x2r, for some int ‘r’. • On subdividing regions on this basis a structure is formed which we call an overlap-kd tree.

In the second level note that, even though grid G has four child regions, and each of its child regions has four children, G has only ten (not 16) distinct grandchildren, several of which are the child of multiple regions. • A nice property of overlap-kd tree is that the total number of gridded regions Φ is O(NlogN)2 rather than O(N)4. That means, if we can prune all outer regions, we can find the mdr of an NxN grid in O(NlogN)2 time or even less than that if we can prune some gridded regions as well.

An issue in searching overlap-kd trees is to ensure that each gridded region is examined only once, rather than being called recursively by each parent, since a child region may have multiple parents. • Lazy expansion: rather than calling overlap-search(Si) on all four children of a region S, we selectively expand only certain children at each stage, in such a way that there is exactly one path from the root of the overlap-kdtree to any node of the tree. See Fig 2. • A child is expanded if it has no other parents, or if the parent node has the highest priority of all the child’s parents.

Score Bounds • Used to prune regions during multi resolution search procedure. • Given some region S, upper bound on the scores D(S’) for regions S’ ⊂ S. • Two upper bounds: a bound on score of all sub regions S’ ⊂ S, and a bound on the score of the outer sub regions of S.

Proposed method • Overlapping of regions by Overlap kd-tree • Bounds each region and prunes region which cannot contain maximum density • significant (20-2000x) speedups on both real and simulated datasets and O(NlogN)2.

The algorithm • Basic structure is similar to top-down “overlap-search” routine. • Used best-first search (implementing a pair of priority queues q1 and q2. • It has two steps- examining gridding regions and then searching outer regions if necessary. • Tight bounds on sub regions is calculated and pruning is done whenever possible

In first search, all gridded regions are pruned and the current mdr is the gridded region with highest D(S). Queue 2 contains the subset of outer regions which have not been pruned yet.

Second stage is a series of “Screens” that an outer region has to pass • Whether a parent region is taken off q2 • If the parent passes quartering test • Whether the new parent region passes halving test • If the new parent region passes halve test only after area s1 is fixed.

Approximation • The approximate versions of the algorithms have more speed up than naïve approach with accuracy still over 90% • Uses very conservative bounds on the D1 densities of S’, S’-Sc, S-S’ derived from the global minimum and maximum density values. • Instead of using estimates which may not guarantee to be the bound, approximate lower bound which saves from finding incorrect region and minimal loss of accuracy is used

Approximation • Kulldorff’s statistics assume that there exists at least one Disease cluster (Sdc) and the rate of disease q is unique outside this. • If Sdc is contained entirely in region under consideration S then the disease rate is uniform in all the regions. • Since the actual value of the parameter q is not known, we use a conservative empirical estimate: q=Ctot/Ptot-pin • We omit several details in order to reduce the upper bound and reduce the likelihood of pruning a lower density region.

Problem with the assumption • We underestimate the maximum sub region score if the disease cluster Sdcis not contained entirely in S, since we are calculating doutbased on a region with high density • However, there is a risk that the presence of one significant cluster may result in missing another cluster.

Experimental analysis • An artificial grid is generated from a set of parameters (N, k1, k2, μ, s, q’, q’’) • The grid generator first creates an NxN grid, and randomly selects a k1xk2 “test region. • The population of each square is chosen randomly from normal distribution with mean μand standard deviation • Four test regions (k1xk2xrate q)- extreme, large, small, no disease clusters are taken

Results • Three different population distributions for testing: the “standard” distribution (μ=104, sig=103),and two types of “highly varying” populations are taken • A 10x10 city region is chosen randomly. • Algorithm was correct if it found a max density D(Sdc) or a region d(s) or d(s)>D(Sdc) • Separate results for original and replica grids scored. Also number of regions scored & the bounds is stored on each region

Results • Time= torig+R(trep) (Run time with R=1000) • Performed even on different population distributions. So average of over all performance is given. • Even for worst case it performed less randomizations. • Introduced two approximate variants for the algorithm, which failed rarely (density variance b=2,3) • First test database : Emergency dept. (ED) with patient home location 0.05 deg. away

Datasets • First test database : Emergency dept. (ED) with patient home location 0.05 deg. Away • Locations were mapped onto 3 grid sizes • For each grid, we tested for spatial clustering of “recent” disease cases: the “count” of a square was the number of ED visits in that square in the last two months, and the “population” of a square was the total number of ED visits in that square. • Second dataset is over the counter cold and cough in North-east regions based on zip codes

CONCLUSIONS AND FUTURE WORK • For detection of significant spatial over densities a fast multi resolution algorithm that has more speedups on real and artificially generated datasets is proposed • Extend overlap kd-tree and quartering to higher dimensions as nodes grow exponentially with a dimension e.g.: increased brain activity study through MRI scan • Multivariate density functions • Normalized counter functions • More powerful statistical tests.

References • [4] M. Kulldorff. A spatial scan statistic. Communications in Statistics: Theory and Methods, 26(6):1481–1496, 1997. • [5] M. Kulldorff. Spatial scan statistics: models, calculations, and applications. In J. Glaz and N. Balakrishnan, editors, Scan Statistics and Applications, pages 303–322. Birkhauser,1999.

Rapid detection of significant spatial clusters

Rapid detection of significant spatial clusters

Presentation Transcript

Rapid Pathogen Detection using Phage Technology

Rapid detection of sperm : comparison of two methods

Spatial Clusters and Pattern Analysis

Rapid Detection of Significant Spatial Clusters

Business Identification: Spatial Detection

Business Identification: Spatial Detection

Early Detection Rapid Response:

Rapid Detection of Varicella Zoster Virus

Early Detection Rapid Response:

Towards efficient prospective detection of multiple spatio -temporal clusters

Rapid detection of drugs for protein misfolding diseases.

Early Detection Rapid Response:

Business Identification: Spatial Detection

Deception Detection Techniques for Rapid Screening

Spatial Analysis of Engineering and IT Occupation Clusters

Early Detection/Rapid Response:

Discovery of Significant Usage Patterns from Clusters of Clickstream Data

Spatial Structure Evolution of Open Star Clusters

Groups, Clusters and Clusters of Clusters

Rapid and Accurate Spoken Term Detection

Discovery of Significant Usage Patterns from Clusters of Clickstream Data

rapid pesticide detection kit