1 / 33

Measuring Two-Event Structural Correlations on Graphs

Measuring Two-Event Structural Correlations on Graphs. Ziyu Guan, Nan Li, Xifeng Yan Department of Computer Science UC Santa Barbara. Outline. Motivations Measuring Two Event Structural Correlation (TESC) Efficient Computation Experiments Discussions and Future work. Intrusion.

peri
Download Presentation

Measuring Two-Event Structural Correlations on Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Measuring Two-Event Structural Correlations on Graphs Ziyu Guan, Nan Li, Xifeng Yan Department of Computer Science UC Santa Barbara

  2. Outline • Motivations • Measuring Two Event Structural Correlation (TESC) • Efficient Computation • Experiments • Discussions and Future work

  3. Intrusion Attraction Ping Sweep SMB Service Sweep

  4. Product Sales • How is the relationship between the sales of two products in a social network? Attraction Repulsion

  5. A New Notion of Correlation • Two-Event Structural Correlation (TESC) • Defined on graph structures • Capture relationships between distributions of two events on a graph • Events can be different things in different contexts: • Topics or products (social networks) • Virus (contact networks) • Intrusion alerts (computer networks)

  6. It Is A Nontrivial Problem • Simply computing average distance between occurrences of two events will not work • Distance for positive could be longer than that for negative • gScore cannot be adapted[Z. Guan et al., SIGMOD2011] • Significance cannot be assessed by randomization!

  7. Outline • Motivations • Measuring Two Event Structural Correlation (TESC) • Efficient Computation • Experiments • Discussions and Future work

  8. How To Measure? • Positive correlation: the presence of event A tend to imply the presence of event B. More A also tend to attract more B. • Negative correlation: the presence of one event is likely to imply the absence of the other one. More A means less B. • Our idea: employ reference nodes in the graph as observers to capture these characteristics quantitatively. Avoid randomization for significance testing.

  9. Preliminaries • A graph G = (V, E)and an event set Q ={qi}. Given two eventsaandbinQ, Va and Vbare sets of nodes having a and b, respectively. • Def. (Node h-hop neighborhood): given a node, subgraph induced by nodes within distance hfrom that node • Def. (Node Set h-hop neighborhood): given a node set, subgraph induced by the union of all nodes which are within distance h from at least one node in the set.

  10. Measuring Concordance • Concordance score • Density function If the density changes are consistent If the density changes are inconsistent Tie Fraction of nodes possessing event a in r’s h-hop neighborhood

  11. Kendall’s Tau as The Measure Density of a Density of b • Kendall’s Tau rank correlation is used to compute the overall concordance among reference nodes with regard to density changes of the two events: • : the number of all reference nodes • lies in [-1,1]. A higher positive value means a stronger positive correlation. A lower negative value means a stronger negative correlation. 0 means no correlation.

  12. Significance Testing • Impractical to compute directly • Testing: choose uniformly a sample of n reference nodes, and compute score over this sample • It is proved the distribution of under null hypothesis tends to the normal distribution with mean 0 and variance related to n • Thus, correlation significance (z-score) is

  13. Reference Nodes • The reasons of choosing to be the set of all reference nodes: • Nodes outside cannot reach any event nodes in h hops • Incorporating them can only increase the number of consistent pairs, and increase the size of ties (decrease variance in the null case), leading to unexpected high z-scores: Out-of-sight-nodes

  14. Outline • Motivations • Measuring Two Event Structural Correlation (TESC) • Efficient Computation • Experiments • Discussions and Future work

  15. Efficient Computation • The key problem in efficient computation is how to get a uniform sample of reference nodes from , but only have . • We explore three algorithms for reference node sampling • BFS, importance sampling, whole graph sampling

  16. Batch_BFS • Batch_BFS is just like a h-hop Breadth-first search, but with the queue initialized with a set of nodes. • Initialize the queue with all event nodes ( ) to enumerate all reference nodes ( ) Correctness can be easily verified by imagining we start with a virtual node which connects to all nodes in and then do a (h+1) BFS. Queue:

  17. Importance Sampling (1) • Sample size n is usually much smaller than . The idea is to directly sample nodes from , avoid enumerating . Time cost depends on n, rather than • The basic operation is peeking the h-hop neighborhood of an event node Difficulties: • different nodes have different sizes of h-hop neighborhoods • there could be many overlapped regions

  18. Importance Sampling (2) • Uniform sampling by rejection sampling Step 1: select an event node u with probability proportional to the size of its h-hop neighborhood Step 2: perform a h-hop BFS search to retrieve u’s h-hop neighborhood Step 3: randomly sample a node rfrom u’s h-hop neighborhood Step 4: Do a h-hop BFS search from r to see how many event nodes it can reach (say, c event nodes). Step 5: With probability 1 / c, accept r as a reference node. Otherwise get nothing from this run. r u v w Problem: heavy overlap leads to high fail probability!

  19. Importance Sampling (3) • Follow the same sampling scheme, but do not reject any node, resulting in a nonuniform distribution over all reference nodes where is proportional to the number of event nodes r can reach in h-hop • Intrinsically, is an estimator of . The goal is to design a proper estimator for , which can leverage samples from to estimate as a surrogate to • A consistent estimator Concordance scores Number of times rj is sampled

  20. Importance Sampling (4) • The importance sampling procedure Step 1: select an event node u with probability proportional to the size of its h-hop neighborhood Step 2: perform a h-hop BFS search to retrieve u’s h-hop neighborhood Step 3: randomly sample a node rfrom u’s h-hop neighborhood Step 4: if rhas been selected before, wr++; else add r to the sample set and set wr = 1.

  21. Whole Graph Sampling • When the set of all reference nodes, i.e. , is large enough, we simply sample nodes from the graph

  22. Complexity Comparison • Space cost is the same: • Reference node sampling • Batch_BFS: • Importance sampling: • Whole graph sampling: • Additional costs in common • Event density computation: • Z-score computation: Linear in the number of nodes and edges in the h-hop neighborhood of Inverse proportional to the size of the h-hop neighborhood of Average cost of a h-hop BFS search Do not need too many sample reference nodes since the variance of t(a,b) is upper bounded

  23. Outline • Motivations • Measuring Two Event Structural Correlation (TESC) • Efficient Computation • Experiments • Discussions and Future work

  24. Experiments – Datasets • DBLP • Co-author network • Events: keywords in paper titles • 964,677nodes, 3,547,014 edges, 0.19M keywords • Intrusion • Obtained from log of intrusion alerts in a computer network • Events: intrusion alerts • 200,858 nodes, 703,020 edges, 545 alerts • Twitter • 20 million nodes and 0.16 billion edges (Scalability)

  25. Experiments – Event Simulation (1) • Simulate positive and negative correlations (on DBLP graph) • Generate for three h levels: 1, 2, 3 • Positive: linked pair, Gaussian distributed distance • Negative: Every b is kept h+1 hop away from a. • Noises: break correlation structure by relocation a fraction of nodes

  26. Experiments – Event Simulation (2) • Results for positive case h = 1 h = 2 h = 3 Recall Noise Noise Noise

  27. Experiments – Real Events (DBLP) Treating nodes as baskets Highly positive pairs: Highly negative pairs:

  28. Experiments – Real Events (Intrusion) Highly positive pairs: Highly negative pairs:

  29. Experiments – Scalability h = 1 h = 3 Running time when increasing the number of event nodes ( ). Results are obtained from Twittter graph.

  30. Outline • Motivations • Measuring Two Event Structural Correlation (TESC) • Efficient Computation • Experiments • Discussions and Future work

  31. Discussions (based on constructive comments from Dr. Kaplan) • TESC as correlation of local densities • Why nonparametric statistic • No distribution assumption, no linear assumption • Nonparametric statistics are less powerful because they use less information • Model nonlinear correlation of data [Kaplan et al., JSTSP, 2009] • Kendall correlation and Spearman correlation • Both can be used • Choose Kendall’s Tau because • Intuitive interpretation • Facilitate importance sampling • Intra-correlation and inter-correlation? Rank

  32. Future Work • Structure help explain the distribution of events. Reversely, events could also help explain structure Discuss very similar topics Buy very similar products

  33. Thank You!!! Questions?

More Related