1 / 73

Graph mining techniques applied to blogs

Graph mining techniques applied to blogs. Mary McGlohon Seminar on Social Media Analysis- Oct 2 2007. Last week… Lots of methods for graph mining and link analysis. Last week… Lots of methods for graph mining and link analysis. This week…

erna
Download Presentation

Graph mining techniques applied to blogs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Graph mining techniques applied to blogs Mary McGlohon Seminar on Social Media Analysis- Oct 2 2007

  2. Last week… Lots of methods for graph mining and link analysis.

  3. Last week… Lots of methods for graph mining and link analysis. This week… A few examples of these methods applied to blogs.

  4. Paper #1 • Jure Leskovec, Mary McGlohon, Christos Faloutsos, Natalie Glance, and Matthew Hurst. Patterns of Cascading Behavior in Large Blog Graphs, SDM 2007. • What temporal and topological features do we observe in a large network of blogs?

  5. Blogosphere network Representing blogs as graphs slashdot boingboing MichelleMalkin Dlisted 5

  6. Blogosphere network Representing blogs as graphs slashdot boingboing MichelleMalkin Dlisted slashdot boingboing 1 MichelleMalkin Dlisted Blog network 6

  7. Representing blogs as graphs slashdot boingboing Blogosphere network MichelleMalkin Dlisted slashdot boingboing 1 MichelleMalkin Dlisted Blog network Post network

  8. Extracting subgraphs: Cascades We gather cascades using the following procedure: Find all initiators (out-degree 0). a b c d e 8

  9. Extracting subgraphs: Cascades We gather cascades using the following procedure: Find all initiators (out-degree 0). Follow in-links. a a b b c c d d e e 9

  10. Extracting subgraphs: Cascades We gather cascades using the following procedure: Find all initiators (out-degree 0). Follow in-links. Produces directed acyclic graph. a a a c b d b b c c e d d e e e 10

  11. Paper #1,2 Dataset (Nielsen Buzzmetrics) Gathered from August-September 2005* Used set of 44,362 blogs, traced cascades 2.4 million posts, ~5 million out-links, 245,404 blog-to-blog links Number of posts Time [1 day]

  12. Temporal Observations Does blog traffic behave periodically? • Posts have “weekend effect”, less traffic on Saturday/Sunday.

  13. Temporal Observations How does post popularity change over time? Popularity on day 1 Popularity on day 40 Number in-links (log) Monday post dropoff- days after post

  14. Temporal Observations How does post popularity change over time? How does post popularity change over time? Post popularity dropoff follows a power law identical to that found in communication response times in [Vazquez2006]. Number in-links (log) Number of in-links Monday post dropoff- days after post 14 Days after post

  15. Temporal Observations How does post popularity change over time? How does post popularity change over time? Post popularity dropoff follows a power law identical to that found in communication response times in [Vazquez2006]. The probability that a post written at time tp acquires a link at time tp +  is: p(tp+) 1.5 Number of in-links 15 Days after post

  16. Topological Observations What graph properties does the blog network exhibit?

  17. Topological Observations What graph properties does the blog network exhibit? • 44,356 nodes, 122,153 edges • Half of blogs belong to largest connected component.

  18. Topological Observations What power laws does the blog network exhibit? Count (log scale) Count (log scale) Number of blog in-links (log scale) Number of blog out-links (log scale) Both in- and out-degree follows a power law distribution, in-link PL exponent -1.7, out-degree PL exponent near -3. This suggests strong rich-get-richer phenomena.

  19. Topological Observations What graph properties does the post network exhibit?

  20. Topological Observations What graph properties does the post network exhibit? Very sparsely connected: 98% of posts are isolated. Inlinks/outlinks also follow power laws.

  21. Topological Observations How do we measure how information flows through the network? Common cascade shapes are extracted using algorithms in [Leskovec2006].

  22. Topological Observations How do we measure how information flows through the network? Number of edges increases linearally with cascade size, while effective diameter increases logarithmically, suggesting tree-like structures. Number of edges Effective diameter Cascade size (# nodes) Cascade size

  23. More on cascades • Cascade sizes, including sizes of particular shapes (stars, chains) also follow power laws. • This paper also presents a model for influence propagation that generates cascades based on SIS model of epidemiology. The topic of influence propagation has been reserved for a later date. 

  24. Paper #2 Mary McGlohon, Jure Leskovec, Christos Faloutsos, Matthew Hurst, and Natalie Glance. Finding patterns in blog shapes and blog evolution, SDM 2007. Do different kinds of blogs exhibit different properties? What tools can we use to describe the behavior of a blog over time? 24

  25. Suppose we wanted to characterize a blog based on the properties of its posts. • Obtain a set of post features based on its role in a cascade. • Use PCA for dimensionality reduction.

  26. Post features There are several terms we use to describe cascades: In-link, out-link Green node has one out-link Yellow node has one in-link. Depth downwards/upwards Pink node has an upward depth of 1, downward depth of 2. Conversation mass upwards/downwards Pink node has upward CM 1, downward CM 3 26 26

  27. Dimensionality reduction Post features may be correlated, so some information may be unnecessary. Principal Component Analysis is a method of dimensionality reduction. Hypothetically, for each blog... Depth upwards Conversation mass upwards 27

  28. Dimesionality reduction Post features may be correlated, so some information may be unnecessary. Principal Component Analysis is a method of dimensionality reduction. Hypothetically, for each blog... Depth upwards Conversation mass upwards 28

  29. Dimensionality reduction Post features may be correlated, so some information may be unnecessary. Principal Component Analysis is a method of dimensionality reduction. Hypothetically, for each blog... Depth upwards Conversation mass upwards 29

  30. Setting up the matrix slashdot-p001 4.5 slashdot-p002 .3 2.2 … .2 4.5 … 1.2 2.4 boingboing-p001 4.2 6.2 boingboing-p002 .6 1.1 … .1 .6 log(# in-links) log(#out-links) log(CM up) log(CM down) log(depth up) log(depth down) Run PCA… ~2,400,000 posts 30 30

  31. PostFeatures: Results • Observation: Posts within a blog tend to retain similar network characteristics. • PC1 ~ CM upward • PC2 ~ CM downward 31

  32. PostFeatures: Results • Observation: Posts within a blog tend to retain similar network characteristics. • PC1 ~ CM upward • PC2 ~ CM downward MichelleMalkin Dlisted 32

  33. Suppose we want to cluster blogs based on content. What features do we use? Get set of features based on cascade shapes. Run PCA to reduce dimensionality.

  34. PCA on a sparse matrix slashdot 4.6 2.1 .09 boingboing 3.2 1.1 3.4 .07 … 4.2 … 5.1 … 2.1 1.1 … .67 .07 … .01 • This time, each blog is one row. • Use log(count+1) • Project onto 2 PC… ~9,000 cascade types ………… ~44,000 blogs

  35. CascadeType: Results Observation: Content of blogs and cascade behavior are often related. • Distinct clusters for “conservative” and “humorous” blogs (hand-labeling). 35

  36. CascadeType: Results Observation: Content of blogs and cascade behavior are often related. • Distinct clusters for “conservative” and “humorous” blogs (hand-labeling). 36

  37. What about time series data? How can we deal with that? Problem: time series data is nonuniform and difficult to analyze. in-links over time

  38. BlogTimeFractal: Definitions Fortunately, we find that behavior is often self-similar. The 80-20 law describes self-similarity. For any sequence, we divide it into two equal-length subsequences. 80% of traffic is in one, 20% in the other. Repeat recursively. 38

  39. Self-similarity The bias factor for the 80-20 law is b=0.8. Details 20 80

  40. Self-similarity The bias factor for the 80-20 law is b=0.8. Details 20 80 Q: How do we estimate b?

  41. Self-similarity The bias factor for the 80-20 law is b=0.8. Details 20 80 Q: How do we estimate b? A: Entropy plots!

  42. BlogTimeFractal An entropy plot plots entropy vs. resolution. From time series data, begin with resolution R= T/2. Record entropy HR 42

  43. BlogTimeFractal An entropy plot plots entropy vs. resolution. From time series data, begin with resolution R= T/2. Record entropy HR Recursively take finer resolutions. 43

  44. BlogTimeFractal An entropy plot plots entropy vs. resolution. From time series data, begin with resolution r= T/2. Record entropy Hr Recursively take finer resolutions. 44

  45. BlogTimeFractal: Definitions Entropy measures the non-uniformity of histogram at a given resolution. We define entropy of our sequence at given R : where p(t) is percentage of posts from a blog on interval t, R is resolution and 2R is number of intervals. Details

  46. BlogTimeFractal For a b-model (and self similar cases), entropy plot is linear. The slope s will tell us the bias factor. Lemma: For traffic generated by a b-model, the bias factor b obeys the equation: s= - b log2b – (1-b) log2 (1-b) 46

  47. Entropy Plots Linear plot  Self-similarity Entropy Resolution

  48. Entropy Plots Linear plot  Self-similarity Uniform: slope s=1. bias=.5 Point mass: s=0. bias=1 Entropy Resolution

  49. Entropy Plots Linear plot  Self-similarity Uniform: slope s=1. bias=.5 Point mass: s=0. bias=1 Michelle Malkin in-links, s= 0.85 By Lemma 1, b= 0.72 Entropy Resolution

  50. BlogTimeFractal: Results Observation: Most time series of interest are self-similar. Observation: Bias factor is approximately 0.7-- that is, more bursty than uniform (70/30 law). Entropy plots: MichelleMalkin in-links, b=.72 conversation mass, b=.76 number of posts, b=.70 50

More Related