Eda with graphs
1 / 24

EDA with Graphs - PowerPoint PPT Presentation

  • Uploaded on

EDA with Graphs. Chris Volinsky Shannon Laboratory AT&T Labs-Research Workshop on Statistical Inference, Computing and Visualization for Graphs Stanford University August 2, 2003. Introduction. Some suggestions about looking at graphs Our way of analyzing graphs: COI

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'EDA with Graphs' - Jimmy

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Eda with graphs

EDA with Graphs

Chris Volinsky

Shannon Laboratory

AT&T Labs-Research

Workshop on Statistical Inference, Computing and Visualization for Graphs

Stanford University

August 2, 2003


  • Some suggestions about looking at graphs

  • Our way of analyzing graphs: COI

  • Two motivating examples

  • Challenges for the room

Main point – sometimes EDA is all you need!

Preaching to the choir
Preaching to the choir…

  • Visualize, even when you can’t

    • Speech example

  • Learn a little graph theory, even if you don’t want to

    • Expand your toolbox with:

      • bridges

      • cutpoints

      • centroids

      • pseudo cliques

      • strongly connected components

      • Etc.

  • Look at node and edge variables, even if they are not there

    • Variables induced by the graph itself are often useful (in-out degree, centrality, boundary)

Our data
Our data

  • Huge! Hundreds of millions of nodes and edges, mostly connected

  • Modelling, or even EDA, on the entire graph may not be possible

  • COI – Communities of Interest are one way of analyzing these data

    • Storage - Break it down

    • Analysis – Build up from signatures

    • Updating - Through time via exponential smoothing

Storage break it down
Storage - Break it down

  • Consider the atomic units of the graph, which we call a COI signature:

    • For every node in the graph, store

      • Top k numbers inbound

      • Top k numbers outbound

      • Weights on each edge

      • overflow bin

  • In short, we are storing a huge graph as many little graphs, which are easily accessible (via indexed storage) for analysis.

Analysis build up from signatures
Analysis – Build up from signatures

  • Fraud – we build signatures

    • When, how long, but not to whom

  • We use the COI signature to build a Community of Interest for everyone, and then use that for analysis

    • Example

  • Communities are everywhere (e.g. Amazon), but representing (and visualizing) as a graph gives a lot of insight.

Updating through time
Updating through time

  • our graph is dynamic

    • 3M new/old number per week!

  • We use an exponentially weighted moving average as a way to smoothly update through time…

Two motivating examples
Two motivating examples

  • Two examples where looking at local network behavior via COI helped answer the questions of interest, without modeling

  • Viral Marketing

  • Fraud

Viral marketing plans
Viral Marketing plans

  • Viral Marketing – let your customers sell for you

  • COI was the perfect tool to throw at this…by capturing the local neighborhood of the enrollees, we can test the viral hypothesis

  • We can also track through time

  • What did we do?

    • For the enrollees, find the induced subgraph from their COI

    • Look at a control group

Cluster results
Cluster results…

Lets look at some…

Rdd repetitive debtors database
RDD: Repetitive Debtors Database

  • Lots of people cant pay their bill, but they want phone service anyway:

Rdd process

Connect pool (30 Days)



RDD Process

  • A big matching problem….

  • Every day

    • we get restricted TNs, 4K / day

    • we get connected TNs 40K / day

    • Look over a 30 day period (possible 4B comparisons!)

    • Compare the COI graphs of the disconnected number and the new number…

    • We need a metric for graph distance

Matching strategy









Matching Strategy

  • We use a combination of:

    • Intersection > 2 (to pare down)

    • Name/address overlap (to weed out)

    • $$ owed (to prioritize)

    • Here’s where modeling could help…or maybe not

Wrap up
Wrap up

  • Viral Marketing

    • Used connected components of reduced data as ‘clusters’

    • Looked for ‘centers’ of clusters for retention

    • Visualized clusters for understanding

    • Used boundary to predict new customers

    • COI was the best predictive variable in a marketing study

  • Fraud

    • Attacked massive matching via simple measures of distance

    • Fraud reps use visualized clusters to work cases

    • We detected RDD with an 80% success rate

      Is this EDA?


  • Viewing graphs through time

    • What if I don’t know what is coming next?

  • Graph distance metrics

    • What does “distance between graphs” mean?

  • Tools for looking at many graphs

    • what do union and intersection mean?

  • Modelling and EDA go hand in hand

    • Viral marketing models define network value, feed this into graph to do EDA….

An answer for duncan
An answer for Duncan…

  • What do I want and who is going to do it?

    • Tools that combine:

      • Interactive capability

      • Graph operations

      • Statistical analysis

    • It’s happening

    • It’s great!!

    • It’s a little confusing

      This model works for me….do you agree?

  • What I want….

  • powerful ways to do union/intersection

    • unclear actually what that means

  • statistical measures of distances between graphs, what is the metric of interest, really?

  • use variables on nodes and edges to easily define new graphs, and automatically point me towards the interesting ones (largest, densest)

  • standard tools for finding graph theoretic concepts like cliques, pseudo cliques, density, bridge edges, boundary

  • ability to visualize the temporal component of graphs – is there another paradigm other than plot the ubergraph?

  • Points to make

  • if each tn is a graph, and we are looking for similar graphs, we could be doing millions or billions of these comparisons…sna stuff is great, but it doesn’t really work!

  • sometimes EDA is the answer, it is the best we can do, or perhaps it is sufficient for the user.

  • think graphs – and plot it! Even if you cant plot the whole thing, plot some of it – do speech example….

  • “network value” might be important – this might not be the same as density – it may be a sunburst, which is not a high density subgraph, or highest value – it may depend on tine

  • Modelling can be great – find pseudo edges, use latent space models,etc…

  • Visualize, even when you cant

    • always a way to subset or threshold, or something

    • Speech example

  • learn some graph theoretics

    • bridge nodes/edges

    • Density, defs of cliques and pseudo cliques

    • dfs/bfs minimal spanning trees….

    • Strongly conn comp

  • subset

Storing coi signatures
Storing COI Signatures

  • COI sigs are stored in Hancock, a C-based domain-specific language designed for large amounts of signature-type data (Rogers, Fisher, et al)

  • Indexed by TN, so it is easy and fast to get COI for large lists of TN, and use spiders for recursion.

  • e.g. cycling over all TNs to learn something about our customer base takes minutes. We could never do this before!

Informative overlap score





Informative overlap score

  • Calculate the “informative overlap” score:


wao = weight of edge from a to o

wob = weight of edge from o to b

wo= sum weight of edges to o

dao, dob are the graph distances from a and b to o




Selecting q
Selecting q

Calls fade out over time;

The larger q is , the longer the call has non-negligible weight