On the Bursty Evolution of Blogspace Ravi Kumar, Jasmine Novak, Prabhakar Raghavan and Andrew Tomkins IBM Almaden Research Center, Verity Inc. WWW 2003
Main contributions • Time graph and blog graph • Communities in Blogspace • Temporal bursts : from a sequence of document to sets of blogs • Link blogs topically and temporally focused • Blogspace evolution
Community Extraction of Blogspace • Communities are collections of pages which provide information on a similar topic or share a point of view. • Kleinberg (2000), co-citation, dense bipartite subgraph (signature) • Flake (2000) network flow
Bursts • Event: model bursts • A large number of short spurious bursts vs. fragmenting long bursts into many smaller bursts • E.g. email: NSF grant (Kleinberg 2002) • Relevant events and irrelevant events • Bursty: fraction of relevant events from large to small
Bursty communities of blogs • A given topic within a community: within a time interval • One member of blog poets posts a series of daily poems about other bloggers • A blogger Dawn hosts a poll to determine the funniest and sexiest blogger
Approach • Community Extraction • Burst Analysis
Time Graph • A set V of nodes where each node v 2 V has an associated interval D(v) on the time axis (called the durationof v) • A set E of edges where each e 2 E is a triple (u; v; t) where u and v are nodes in V and t is a point in time in the interval D(u) D(v) • Gt = (Vt , Et)
Community Extraction NP-hard to find dense subgraph • 1.Preprocessing: remove all pages that contain more than a certain number of in-links (too famous) • 2.Pruning: degree 1,2 are removed, degree3 are checked (K3). They are seeds • 3.Expansion: determines the vertex that contains most links to the current community by tk threshold.
Burst analysis • Arrival of edges in the blog graph as an event stream • Kleinberg algorithm, obtain the weight of every burst in C • Apply on each extracted community in the graph
Data acquisition • From 7 blog sites: • http://www.blogger.com • http://www.memepool.com • http://www.globeofblogs .com • http://www.metafilter.com • http://blogs.salon.com • http://www.blogtree.com • Web_Logs subtree of Yahoo
Resulting blog graph • 750 K links among 25K blogs • 22,299 nodes, 70,472 unique edges, 777,653 multiple edges, average 11 multiple edges every blog • Generate time graph
Results - Connectivity • Strongly connected components
Conclusion • Present a detailed picture of a web publishing phenomenon • Around the end of 2001, Blogspace began a dramatic increase in connectedness, and in local-scale community structure • Dramatic increases of bursty link creation behavior • Tools are applicable to other evolving hyperlinked corpora.