1 / 38

Using Structure Indices for Efficient Approximation of Network Properties

Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts Amherst Data Mining November 27, 2006 Deborah Stoffer. Using Structure Indices for Efficient Approximation of Network Properties. The Problem. Recent research works with very large networks Millions of nodes

josephine
Download Presentation

Using Structure Indices for Efficient Approximation of Network Properties

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts Amherst Data Mining November 27, 2006 Deborah Stoffer Using Structure Indices for Efficient Approximation of Network Properties

  2. The Problem Recent research works with very large networks Millions of nodes Calculating network statistics on very large networks can be difficult Shortest paths Betweenness centrality The proportion of all shortest paths in the network that run through a given node Closeness centrality The average distance from the given node to every other node in the network

  3. The Problem The most efficient known algorithms for calculating betweenness centrality and closeness centrality are O(ne + n2logn) n – number of nodes e – number of edges Calculations for path finding can have even higher complexity Require bidirectional breadth-first search

  4. The Problem Example - Rexa citation graph Papers in computer science and related fields Largest connected component contains 165,000 nodes (papers) and 321,000 edges (citations) Finding a path of length 15 requires the exploration of 65,000 nodes

  5. The Problem

  6. Network Structure Index (NSI) Similar to the type of index commonly used to speed queries in modern database systems Can be constructed once for a given graph and then used to speed the calculations of many measures on the graph Two components of a NSI Set of annotations on every node in the network that provide information about relative or absolute location For G(V,E) the annotations define A: V→ S, where S is an arbitrarily complex “annotation space” A distance function that uses the annotations to define graph distance between pairs of nodes by mapping pairs of node annotations to a positive real number D: S x S → R

  7. Types of Network Structure Indices All Pairs Shortest Path (APSP) Degree Landmark Global Network Positioning (GNP) Zone Distance to Zone (DTZ)

  8. All Pairs Shortest Path NSI Node annotations Consist of an n x n matrix (n = |V|) containing the optimal path distances between all pairs of nodes Distance function A simple lookup in the matrix

  9. Degree NSI Node annotations Annotate each node with its undirected degree within the graph Distance function between source node s and target node t DDegree (s, t) = 2n – degree (s) – degree (t)

  10. Landmark NSI Randomly designate a small number of nodes in the network to serve as navigational beacons Node annotations Annotate nodes in the graph by flooding out from each landmark and recording the graph distance to each node in the network Gives a vector of graph distances for each node Distance function

  11. Landmark NSI

  12. Global Network Positioning NSI Node annotation Annotation uses a nonlinear optimization algorithm to create a multidimensional coordinate system that encodes the location of each node within the network Distance function is the Manhattan distance between node pairs

  13. Zone NSI Node annotations Each node is annotated with a d-dimensional vector of zone labels Distance function

  14. Zone NSI Algorithm For d dimensions Randomly select k seed nodes, assign them zone labels 1 through k, and place them in the labeled set Place all other nodes in the unlabeled set While the unlabeled set is not empty Randomly select a node l from the labeled set Randomly select a node u from the unlabeled set that is a neighbor to l Assign u to the same zone as l and move it to the labeled set

  15. Zone NSI

  16. Distance to Zone (DTZ) NSI Hybrid between Landmark and Zone NSIs Node annotations Divide the graph into zones and for each node u and zone Z calculate the distance from u to the closest node in Z Distance function

  17. Distance to Zone (DTZ) NSI

  18. Complexity of Different NSIs

  19. Search Performance Optimality of the lengths of paths found Path ratio pf is the length of the found paths po is the length of the optimal paths r is the number of randomly selected pairs of nodes in the graph P = 1.0 indicates an NSI that finds optimal paths P >> 1.0 indicates a poor performing NSI

  20. Search Performance Performance gain Exploration ratio ef is the number of nodes explored by best-first search eb is the number of nodes that are explored using a bidirectional breadth-first search r is the number of pairs of nodes in the graph E values close to zero indicate good search performance E values greater than 1.0 indicate poor search performance

  21. Search Performance NSIs evaluated on synthetic graphs Random Rewired lattices Forest Fire

  22. Search Performance

  23. Search Performance

  24. Search Performance

  25. Search Performance

  26. Constant Time Distance Estimation • Can sometimes use an NSI to directly estimate the graph distance between any two nodes • Can use the DTZ annotation distance to estimate actual graph distances • Annotate the graph as described for the DTZ NSI • Randomly sample p pairs of nodes in the graph and perform breadth-first search to obtain their exact graph distance • Use linear regression to obtain an equation for estimated distance

  27. Constant Time Distance Estimation

  28. Constant Time Distance Estimation

  29. Constant Time Distance Estimation • Simple distance can be used to produce a wide variety of attributes on nodes, which can be used by data mining algorithms that analyze graphs • Label nodes with their distance to a particular node in a graph • How close is each actor to Kevin Bacon? • Label nodes with the minimum or maximum distance to one of a set of designated nodes • How close is each actor to an Academy Award winner?

  30. Closeness Centrality Measures the proximity of a given node in a network to every other node Important to social network dynamics Accurate estimates of closeness centrality often impossible to calculate for large data sets Using an NSI for path finding can estimate closeness centrality efficiently

  31. Closeness Centrality

  32. Closeness Centrality • A measure of centrality can be used to produce attributes on nodes that may be useful to knowledge discovery algorithms • Determine the closeness of every node to a collection of key nodes • Closeness to all winners of Academy Awards for best actor in the past 10 years • Constrain closeness calculations for members of clusters • Closeness rank of an actor within their movie industry • Weight closeness based on the attributes of the outlying nodes • Closeness to winners of Academy Awards weighted by how recent an award

  33. Betweenness Centrality Measures the number of short paths on which a given node lies Important to social network dynamics Accurate estimates of betweenness centrality often impossible to calculate for large data sets

  34. Betweenness Centrality Can estimate betweenness using the paths identified through NSI navigation Randomly sample pairs of nodes and discover the shortest path between them Count the number of times each node in the graph appears on one of these paths to obtain a betweenness ranking

  35. Betweenness Centrality

  36. Betweenness Centrality • A high betweenness score can indicate a bridge between two communities • An actor that has played in movies belonging to different movie industries • Betweenness centrality can be used to create features on nodes that are useful for data mining • Calculate betweenness centrality for particular groups of nodes • Actors that sit between winners of Academy Awards for best picture and the IMDb’s “Bottom 100”, the worst 100 movies as voted by users of the Internet Movie Database

  37. Conclusions • The NSIs Zone and DTZ allow efficient and accurate estimation of path lengths between arbitrary nodes in a network • Efficient calculations of network statistics allow a better range of potential approaches to knowledge discovery • All potential NSIs have not been exhaustively researched • NSIs could have other applications • Finding connection subgraphs • Approximating neighborhood functions

  38. Questions?

More Related