On the Bursty Evolution of Blogspace

# On the Bursty Evolution of Blogspace

## On the Bursty Evolution of Blogspace

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. On the Bursty Evolution of Blogspace Ravi Kumar, Jasmine Novak, Prabhakar Raghavan and Andrew Tomkins IBM Almaden Research Center, Verity Inc. WWW 2003

2. Main contributions • Time graph and blog graph • Communities in Blogspace • Temporal bursts : from a sequence of document to sets of blogs • Link blogs topically and temporally focused • Blogspace evolution

3. Community Extraction of Blogspace • Communities are collections of pages which provide information on a similar topic or share a point of view. • Kleinberg (2000), co-citation, dense bipartite subgraph (signature) • Flake (2000) network flow

4. Bursts • Event: model bursts • A large number of short spurious bursts vs. fragmenting long bursts into many smaller bursts • E.g. email: NSF grant (Kleinberg 2002) • Relevant events and irrelevant events • Bursty: fraction of relevant events from large to small

5. Bursty communities of blogs • A given topic within a community: within a time interval • One member of blog poets posts a series of daily poems about other bloggers • A blogger Dawn hosts a poll to determine the funniest and sexiest blogger

6. Approach • Community Extraction • Burst Analysis

7. Time Graph • A set V of nodes where each node v 2 V has an associated interval D(v) on the time axis (called the durationof v) • A set E of edges where each e 2 E is a triple (u; v; t) where u and v are nodes in V and t is a point in time in the interval D(u) D(v) • Gt = (Vt , Et)

8. Community Extraction NP-hard to find dense subgraph • 1.Preprocessing: remove all pages that contain more than a certain number of in-links (too famous) • 2.Pruning: degree 1,2 are removed, degree3 are checked (K3). They are seeds • 3.Expansion: determines the vertex that contains most links to the current community by tk threshold.

9. Burst analysis • Arrival of edges in the blog graph as an event stream • Kleinberg algorithm, obtain the weight of every burst in C • Apply on each extracted community in the graph

10. Data acquisition • From 7 blog sites: • http://www.blogger.com • http://www.memepool.com • http://www.globeofblogs .com • http://www.metafilter.com • http://blogs.salon.com • http://www.blogtree.com • Web_Logs subtree of Yahoo

11. Resulting blog graph • 750 K links among 25K blogs • 22,299 nodes, 70,472 unique edges, 777,653 multiple edges, average 11 multiple edges every blog • Generate time graph

12. Results – Degree Distribution

13. Results - Connectivity • Strongly connected components

14. Results – Distribution of community sizes

15. Results – Community Evolution

16. Conclusion • Present a detailed picture of a web publishing phenomenon • Around the end of 2001, Blogspace began a dramatic increase in connectedness, and in local-scale community structure • Dramatic increases of bursty link creation behavior • Tools are applicable to other evolving hyperlinked corpora.