EDA with Graphs

1 / 24

# EDA with Graphs - PowerPoint PPT Presentation

EDA with Graphs. Chris Volinsky Shannon Laboratory AT&amp;T Labs-Research Workshop on Statistical Inference, Computing and Visualization for Graphs Stanford University August 2, 2003. Introduction. Some suggestions about looking at graphs Our way of analyzing graphs: COI

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'EDA with Graphs' - Jimmy

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### EDA with Graphs

Chris Volinsky

Shannon Laboratory

AT&T Labs-Research

Workshop on Statistical Inference, Computing and Visualization for Graphs

Stanford University

August 2, 2003

Introduction
• Some suggestions about looking at graphs
• Our way of analyzing graphs: COI
• Two motivating examples
• Challenges for the room

Main point – sometimes EDA is all you need!

Preaching to the choir…
• Visualize, even when you can’t
• Speech example
• Learn a little graph theory, even if you don’t want to
• bridges
• cutpoints
• centroids
• pseudo cliques
• strongly connected components
• Etc.
• Look at node and edge variables, even if they are not there
• Variables induced by the graph itself are often useful (in-out degree, centrality, boundary)
Our data
• Huge! Hundreds of millions of nodes and edges, mostly connected
• Modelling, or even EDA, on the entire graph may not be possible
• COI – Communities of Interest are one way of analyzing these data
• Storage - Break it down
• Analysis – Build up from signatures
• Updating - Through time via exponential smoothing
Storage - Break it down
• Consider the atomic units of the graph, which we call a COI signature:
• For every node in the graph, store
• Top k numbers inbound
• Top k numbers outbound
• Weights on each edge
• overflow bin
• In short, we are storing a huge graph as many little graphs, which are easily accessible (via indexed storage) for analysis.
Analysis – Build up from signatures
• Fraud – we build signatures
• When, how long, but not to whom
• We use the COI signature to build a Community of Interest for everyone, and then use that for analysis
• Example
• Communities are everywhere (e.g. Amazon), but representing (and visualizing) as a graph gives a lot of insight.
Updating through time
• our graph is dynamic
• 3M new/old number per week!
• We use an exponentially weighted moving average as a way to smoothly update through time…
Two motivating examples
• Two examples where looking at local network behavior via COI helped answer the questions of interest, without modeling
• Viral Marketing
• Fraud
Viral Marketing plans
• Viral Marketing – let your customers sell for you
• COI was the perfect tool to throw at this…by capturing the local neighborhood of the enrollees, we can test the viral hypothesis
• We can also track through time
• What did we do?
• For the enrollees, find the induced subgraph from their COI
• Look at a control group
Cluster results…

Lets look at some…

RDD: Repetitive Debtors Database
• Lots of people cant pay their bill, but they want phone service anyway:

Connect pool (30 Days)

T

restricts

RDD Process
• A big matching problem….
• Every day
• we get restricted TNs, 4K / day
• we get connected TNs 40K / day
• Look over a 30 day period (possible 4B comparisons!)
• Compare the COI graphs of the disconnected number and the new number…
• We need a metric for graph distance

TN-1

Connect

TN-2

Restrict

TN-3

TN-4

Connect

TN-5

Matching Strategy
• We use a combination of:
• Intersection > 2 (to pare down)
• Name/address overlap (to weed out)
• \$\$ owed (to prioritize)
• Here’s where modeling could help…or maybe not
Wrap up
• Viral Marketing
• Used connected components of reduced data as ‘clusters’
• Looked for ‘centers’ of clusters for retention
• Visualized clusters for understanding
• Used boundary to predict new customers
• COI was the best predictive variable in a marketing study
• Fraud
• Attacked massive matching via simple measures of distance
• Fraud reps use visualized clusters to work cases
• We detected RDD with an 80% success rate

Is this EDA?

Challenges
• Viewing graphs through time
• What if I don’t know what is coming next?
• Graph distance metrics
• What does “distance between graphs” mean?
• Tools for looking at many graphs
• what do union and intersection mean?
• Modelling and EDA go hand in hand
• Viral marketing models define network value, feed this into graph to do EDA….
• What do I want and who is going to do it?
• Tools that combine:
• Interactive capability
• Graph operations
• Statistical analysis
• It’s happening
• It’s great!!
• It’s a little confusing

This model works for me….do you agree?

What I want….

• powerful ways to do union/intersection
• unclear actually what that means
• statistical measures of distances between graphs, what is the metric of interest, really?
• use variables on nodes and edges to easily define new graphs, and automatically point me towards the interesting ones (largest, densest)
• standard tools for finding graph theoretic concepts like cliques, pseudo cliques, density, bridge edges, boundary
• ability to visualize the temporal component of graphs – is there another paradigm other than plot the ubergraph?

Points to make

• if each tn is a graph, and we are looking for similar graphs, we could be doing millions or billions of these comparisons…sna stuff is great, but it doesn’t really work!
• sometimes EDA is the answer, it is the best we can do, or perhaps it is sufficient for the user.
• think graphs – and plot it! Even if you cant plot the whole thing, plot some of it – do speech example….
• “network value” might be important – this might not be the same as density – it may be a sunburst, which is not a high density subgraph, or highest value – it may depend on tine
• Modelling can be great – find pseudo edges, use latent space models,etc…

Visualize, even when you cant

• always a way to subset or threshold, or something
• Speech example
• learn some graph theoretics
• bridge nodes/edges
• Density, defs of cliques and pseudo cliques
• dfs/bfs minimal spanning trees….
• Strongly conn comp
• subset
Storing COI Signatures
• COI sigs are stored in Hancock, a C-based domain-specific language designed for large amounts of signature-type data (Rogers, Fisher, et al)
• Indexed by TN, so it is easy and fast to get COI for large lists of TN, and use spiders for recursion.
• e.g. cycling over all TNs to learn something about our customer base takes minutes. We could never do this before!

B

Z

A

O

Informative overlap score
• Calculate the “informative overlap” score:

Where:

wao = weight of edge from a to o

wob = weight of edge from o to b

wo= sum weight of edges to o

dao, dob are the graph distances from a and b to o

wob

wao

wo

Selecting q