eda with graphs
Download
Skip this Video
Download Presentation
EDA with Graphs

Loading in 2 Seconds...

play fullscreen
1 / 24

EDA with Graphs - PowerPoint PPT Presentation


  • 163 Views
  • Uploaded on

EDA with Graphs. Chris Volinsky Shannon Laboratory AT&T Labs-Research Workshop on Statistical Inference, Computing and Visualization for Graphs Stanford University August 2, 2003. Introduction. Some suggestions about looking at graphs Our way of analyzing graphs: COI

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'EDA with Graphs' - Jimmy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
eda with graphs

EDA with Graphs

Chris Volinsky

Shannon Laboratory

AT&T Labs-Research

Workshop on Statistical Inference, Computing and Visualization for Graphs

Stanford University

August 2, 2003

introduction
Introduction
  • Some suggestions about looking at graphs
  • Our way of analyzing graphs: COI
  • Two motivating examples
  • Challenges for the room

Main point – sometimes EDA is all you need!

preaching to the choir
Preaching to the choir…
  • Visualize, even when you can’t
    • Speech example
  • Learn a little graph theory, even if you don’t want to
    • Expand your toolbox with:
      • bridges
      • cutpoints
      • centroids
      • pseudo cliques
      • strongly connected components
      • Etc.
  • Look at node and edge variables, even if they are not there
    • Variables induced by the graph itself are often useful (in-out degree, centrality, boundary)
our data
Our data
  • Huge! Hundreds of millions of nodes and edges, mostly connected
  • Modelling, or even EDA, on the entire graph may not be possible
  • COI – Communities of Interest are one way of analyzing these data
    • Storage - Break it down
    • Analysis – Build up from signatures
    • Updating - Through time via exponential smoothing
storage break it down
Storage - Break it down
  • Consider the atomic units of the graph, which we call a COI signature:
    • For every node in the graph, store
      • Top k numbers inbound
      • Top k numbers outbound
      • Weights on each edge
      • overflow bin
  • In short, we are storing a huge graph as many little graphs, which are easily accessible (via indexed storage) for analysis.
analysis build up from signatures
Analysis – Build up from signatures
  • Fraud – we build signatures
    • When, how long, but not to whom
  • We use the COI signature to build a Community of Interest for everyone, and then use that for analysis
    • Example
  • Communities are everywhere (e.g. Amazon), but representing (and visualizing) as a graph gives a lot of insight.
updating through time
Updating through time
  • our graph is dynamic
    • 3M new/old number per week!
  • We use an exponentially weighted moving average as a way to smoothly update through time…
two motivating examples
Two motivating examples
  • Two examples where looking at local network behavior via COI helped answer the questions of interest, without modeling
  • Viral Marketing
  • Fraud
viral marketing plans
Viral Marketing plans
  • Viral Marketing – let your customers sell for you
  • COI was the perfect tool to throw at this…by capturing the local neighborhood of the enrollees, we can test the viral hypothesis
  • We can also track through time
  • What did we do?
    • For the enrollees, find the induced subgraph from their COI
    • Look at a control group
cluster results
Cluster results…

Lets look at some…

rdd repetitive debtors database
RDD: Repetitive Debtors Database
  • Lots of people cant pay their bill, but they want phone service anyway:
rdd process

Connect pool (30 Days)

T

restricts

RDD Process
  • A big matching problem….
  • Every day
    • we get restricted TNs, 4K / day
    • we get connected TNs 40K / day
    • Look over a 30 day period (possible 4B comparisons!)
    • Compare the COI graphs of the disconnected number and the new number…
    • We need a metric for graph distance
matching strategy

TN-1

Connect

TN-2

Restrict

TN-3

TN-4

Connect

TN-5

Matching Strategy
  • We use a combination of:
    • Intersection > 2 (to pare down)
    • Name/address overlap (to weed out)
    • $$ owed (to prioritize)
    • Here’s where modeling could help…or maybe not
wrap up
Wrap up
  • Viral Marketing
    • Used connected components of reduced data as ‘clusters’
    • Looked for ‘centers’ of clusters for retention
    • Visualized clusters for understanding
    • Used boundary to predict new customers
    • COI was the best predictive variable in a marketing study
  • Fraud
    • Attacked massive matching via simple measures of distance
    • Fraud reps use visualized clusters to work cases
    • We detected RDD with an 80% success rate

Is this EDA?

challenges
Challenges
  • Viewing graphs through time
    • What if I don’t know what is coming next?
  • Graph distance metrics
    • What does “distance between graphs” mean?
  • Tools for looking at many graphs
    • what do union and intersection mean?
  • Modelling and EDA go hand in hand
    • Viral marketing models define network value, feed this into graph to do EDA….
an answer for duncan
An answer for Duncan…
  • What do I want and who is going to do it?
    • Tools that combine:
      • Interactive capability
      • Graph operations
      • Statistical analysis
    • It’s happening
    • It’s great!!
    • It’s a little confusing

This model works for me….do you agree?

slide18

What I want….

  • powerful ways to do union/intersection
    • unclear actually what that means
  • statistical measures of distances between graphs, what is the metric of interest, really?
  • use variables on nodes and edges to easily define new graphs, and automatically point me towards the interesting ones (largest, densest)
  • standard tools for finding graph theoretic concepts like cliques, pseudo cliques, density, bridge edges, boundary
  • ability to visualize the temporal component of graphs – is there another paradigm other than plot the ubergraph?
slide19

Points to make

  • if each tn is a graph, and we are looking for similar graphs, we could be doing millions or billions of these comparisons…sna stuff is great, but it doesn’t really work!
  • sometimes EDA is the answer, it is the best we can do, or perhaps it is sufficient for the user.
  • think graphs – and plot it! Even if you cant plot the whole thing, plot some of it – do speech example….
  • “network value” might be important – this might not be the same as density – it may be a sunburst, which is not a high density subgraph, or highest value – it may depend on tine
  • Modelling can be great – find pseudo edges, use latent space models,etc…
slide20

Visualize, even when you cant

    • always a way to subset or threshold, or something
    • Speech example
  • learn some graph theoretics
    • bridge nodes/edges
    • Density, defs of cliques and pseudo cliques
    • dfs/bfs minimal spanning trees….
    • Strongly conn comp
  • subset
storing coi signatures
Storing COI Signatures
  • COI sigs are stored in Hancock, a C-based domain-specific language designed for large amounts of signature-type data (Rogers, Fisher, et al)
  • Indexed by TN, so it is easy and fast to get COI for large lists of TN, and use spiders for recursion.
  • e.g. cycling over all TNs to learn something about our customer base takes minutes. We could never do this before!
informative overlap score

B

Z

A

O

Informative overlap score
  • Calculate the “informative overlap” score:

Where:

wao = weight of edge from a to o

wob = weight of edge from o to b

wo= sum weight of edges to o

dao, dob are the graph distances from a and b to o

wob

wao

wo

selecting q
Selecting q

Calls fade out over time;

The larger q is , the longer the call has non-negligible weight

ad