1 / 32

Topic 13 Network Models

Topic 13 Network Models. Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes. Social Networks. Network: A collection of inter-connected things Also called “ graph mining ” Data consisting of nodes and edges

toya
Download Presentation

Topic 13 Network Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic 13Network Models Credits: C. Faloutsos and J. Leskovec Tutorial E. Kolaczyk Notes Data Mining - Volinsky - 2011 - Columbia University

  2. Social Networks • Network: A collection of inter-connected things • Also called “graph mining” • Data consisting of nodes and edges • Note: different than “graphical models” (graphical representation of dependence of random variables) • Edges represent: • Relationship between nodes • Behavior observed between nodes • High similarity between nodes • Edges typically weighted • Nodes and edges both can have attributes associated • Can be directed or undirected • Directed: phone calls, emails • Undirected: collaboration, physical networks, friendship Data Mining - Volinsky - 2011 - Columbia University

  3. Examples Data Mining - Volinsky - 2011 - Columbia University

  4. Networks are everywhere! Data Mining - Volinsky - 2011 - Columbia University

  5. Layout • Layout matters! • Especially with directed graphs Data Mining - Volinsky - 2011 - Columbia University

  6. Facebook “Friend Wheel” Data Mining - Volinsky - 2011 - Columbia University

  7. LinkedIn LinkedIN community from LinkedIn labs Data Mining - Volinsky - 2011 - Columbia University

  8. Networks: A Matter of Scale Data Mining - Volinsky - 2011 - Columbia University

  9. Measurements on networks: Nodes and Edges • Node degree (node) • Number of edges coming in and out of a node is its degree • If directed, in-degree and out-degree are different • Degree centrality (node): • How ‘central’ is a given data point • How many times does it appear in a ‘shortest path’ • Centrality = importance • Centrality (edge): • How central is an edge? • Similar ‘shortest path’ definition • Does removing it create more clusters? Data Mining - Volinsky - 2011 - Columbia University

  10. Measurements on networks (graph) • Degree Distribution • The distribution of all edge degrees characterizes the graph • Normal or highly skewed? • Clustering Coefficient (graph): • How “dense” is the graph? • Given n nodes, how many possible edges? • Density = #Edges/Possible edges • How likely is it that your friends are friends • Count: how many triangles • Diameter (graph) • Largest shortest path • Shortest paths (graph) • Histogram of shortest paths • Connectivity (graph) • Fully connected? • Connected components • For directed: strongly connected components Data Mining - Volinsky - 2011 - Columbia University

  11. Models on networks • Random (Erdos-Renyi) • All edges occur randomly w probability p • Degree distribution follows Poisson distribution • Exponential (p*) models • Statistical model: Extension of Erdos-Renyi • Defines a probability distribution over graph properties • Preferential attachment • Generative Model: New nodes create m links (based on Poisson) • attach to existing nodes proportional to degree of that node • Rich get richer Data Mining - Volinsky - 2011 - Columbia University

  12. Real-world networks • Degree distributions in real-world networks are heavily skewed to the right • preferential attachment fits this model • Long tail of values above the mean • Large mean, small median, small diameter • Leads to a “power law” • Let k = degree and pk = the number of nodes that have that degree • A plot of log k vs. log pk should be linear. • Many real world data sets follow a power law: • Online sales • Word length distributions • Number of friends on Facebook! Data Mining - Volinsky - 2011 - Columbia University

  13. More Power Law Data Mining - Volinsky - 2011 - Columbia University

  14. Erdos-Renyi vs. Power-law From Leskovec & Faloutsos Data Mining - Volinsky - 2011 - Columbia University

  15. Small World • Real-world data sets tend to have power-law distributions • Also, tend to have a “small world” property • Everyone is reachable via a small number of edges • Small diameters • Stanley Milgram experiment 1967 • People given letter, asked to forward to one friend • source: random residents of Omaha • target: stockbroker in Boston • Of completed chains, averaged 6 hops • hence, Data Mining - Volinsky - 2011 - Columbia University

  16. Small World Networks • Watts and Strogatz [1998] introduced small-world. • Navigable Social Networks [Kleinberg 2000] • Showed how small world networks are created • put n people on a k-dimensional grid • connect each to its immediate neighbors • add one long-range link per person • Everyone will be connected via a short path • This is the way the real world works!!! Data Mining - Volinsky - 2011 - Columbia University

  17. Small World Networks • Another look Data Mining - Volinsky - 2011 - Columbia University

  18. Sampling Networks • How do you sample from a massive network? • Simplest method – Induced Subgraph • Randomly sampled nodes and edges between them • Not so great! Yellow nodes randomly sampled but don’t have the same graph properties! Data Mining - Volinsky - 2011 - Columbia University

  19. Sampling Networks • Snowball Sampling: • Pick a random sample and then follow their ‘tree’ for a set number of ‘hops’ Still not perfect but better Other ideas abound but little agreement Great area for research! Data Mining - Volinsky - 2011 - Columbia University

  20. Network Problems of Interest • Link Prediction: • can we use existing network data to infer links where they don’t exist? • Links in the future? • Missing data • Simple methods • Look for many common neighbors • Complex methods • Stochastic Blockmodels • Similar to using SVD to ‘fill in’ a matrix • Agarwal and Pregibon ‘04 Data Mining - Volinsky - 2011 - Columbia University

  21. Network Problems of Interest • Graph Matching / Similarity • Fraud (‘repetitive debtors’) • Citation de-noising • Need a metric to define difference between graphs • Collective Inference • What can you learn about someone from their network? • Fraud (‘guilt by association’) • Viral marketing • Following example courtesy of Sofus MacSkassy Data Mining - Volinsky - 2011 - Columbia University

  22. ? A Relational Neighbor Classifier (wvRN)

  23. A Relational Neighbor Classifier (wvRN) ? ? ? ?

  24. A Relational Neighbor Classifier (wvRN) ? ? ? ?

  25. Collective wvRN Classify all entities in the network simultaneously, because (if done well) inferences about neighbors can reduce statistical bias (cf. Jensen et al. KDD-04) ? ? ? ? ? ? ? ? ? ?

  26. Collective wvRN Classify all entities in the network simultaneously, because (if done well) inferences about neighbors can reduce statistical bias (cf. Jensen et al. KDD-04) ? ? ? ? ? ? ? ? ? ?

  27. Collective wvRN Classify all entities in the network simultaneously, because (if done well) inferences about neighbors can reduce statistical bias (cf. Jensen et al. KDD-04) ? ? ? ? ? ? ? ? ? ?

  28. Collective wvRN Classify all entities in the network simultaneously, because (if done well) inferences about neighbors can reduce statistical bias (cf. Jensen et al. KDD-04) ? ? ? ? ? ? ? ? ? ?

  29. Collective wvRN Classify all entities in the network simultaneously, because (if done well) inferences about neighbors can reduce statistical bias (cf. Jensen et al. KDD-04) ? ? ? ? ? ? ? ? ? ?

  30. Collective wvRN Classify all entities in the network simultaneously, because (if done well) inferences about neighbors can reduce statistical bias (cf. Jensen et al. KDD-04) ? ? ? ?

  31. Network Problems of Interest • Diffusion • Information or virus diffusion • Community Detection • Subgroups have a higher density within the subgroup • Can remove edges with high centrality to try and find communities • Understanding of Social Networks • Facebook Data Mining - Volinsky - 2011 - Columbia University

  32. References • Leskovec / Faloutsos Tutorial (mostly part 1) • Eric Kolacyzk Notes and book • Watts and Strogatz: “Collective dynamics of `small-world' networks”: Nature 393 p.440-442 • Networks. MEJ Newman book. • Linked: How Everything Is Connected to Everything Else and What It Means : Albert Barabasi • Enron Data • Tools • Graphviz.org for visualization • Igraph (R package) Data Mining - Volinsky - 2011 - Columbia University

More Related