1 / 40

Studying Blogspace

Studying Blogspace. Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com. Etymology. From the OED new ed. (draft entry, Mar 2003) … blog intr. To write or maintain a weblog. Also: to read or browse through weblogs, esp. habitually.

ailsa
Download Presentation

Studying Blogspace

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

  2. Etymology From the OED new ed. (draft entry, Mar 2003) … blog intr. To write or maintain a weblog. Also: to read or browse through weblogs, esp. habitually. web¢logn. 2. A frequently updated web site consisting of personal observations, excerpts from other sources, etc., typically run by a single person, and usually with hyperlinks to other sites; an online journal or diary. From WWW 2003 (Kumar, Novak, Raghavan, Tomkins) … blog¢spacen. The collection of weblogs; = blogosphere, blogsphere, blogistan, …

  3. Blogs 101 • Characteristics • Pages with reverse chronological sequences of dated entries • Usually contain a persistent sidebar containing profile (and other blogs read by the author – “blogroll”) • Usually maintained and published by one of the common variants of public-domain blog software • From Slashdot, 1999 “… a new, personal, and determinedly non-hostile evolution of the electric community. They are also the freshest example of how people use the Net to make their own, radically different new media”

  4. Look and feel • Quirky • Highly personal • Consumed by a small number of regular repeat visitors • Often updated multiple times each day • Highly interwoven into a network of small but active micro-communities • Eg: LiveJournal, Xanga, DeadJournal, Blogger, Memepool, …

  5. The blog era • Blogs began in 1996, but exploded in popularity in 1999 • Proliferation of authoring tools • Newsweek 2002 estimates ~500K • LiveJournal 2005 estimates ~3.5M • Annual Blogathon for charity • Bloggers update their Blogs every 30m for 24h • Sponsors pay … • Impact of blogs • “Miserable failure” on Google

  6. Structural study(Kumar, Novak, Raghavan, Tomkins, CACM 2004)

  7. Livejournal blogspace • Livejournal.com: popular blog site • 1.3M bloggers (Feb 2004) • 3.5M bloggers (Apr 2005) • Each blogger has a profile • Name, age, … • Geographic information (city, state, zip, …) • Friends and friend of • Interests/communities

  8. Eg, LiveJournal user “bill”

  9. LJ bloggers in US < 1K < 5K < 10K < 25K < 50K ~ 100K

  10. LJ bloggers world-wide < 1K < 2K < 5K ~ 25K ~ 50K ~ 75K

  11. Who are they? Age % Representative interests

  12. Friendship graph • Directed • 80% mutual • Average degree ~ 14 • Power law degrees • Clustering coeff. ~ 0.2 • Most friendships explained by age, location, interest Age 1% 5% 16% Location 20% Interest 16% 22%

  13. Evolutionary study(Kumar, Novak, Raghavan, Tomkins, WWW 2003)

  14. Blogs and evolution • Every blog contains a dated record of • Every word ever written to the blog • Every link ever added in the blog • Blogs are an increasingly important medium, but • Few systematic studies have been performed • Such study should take an evolutionary perspective [Brewington et al] [Bharat et al] [Fetterly et al] [Cho et al] • Tools for understanding evolution not fully understood

  15. Time graphs Jan v1 v2 Feb Mar v3 v1 v2 Apr time May v4 v3 v4 Jun Jul Aug Underlyinggraph Time graph

  16. Community evolution in blogs • What are the communities within the time graph? • Community definition, extraction • Graph-based methods (trawling) [Kumar Raghavan Rajagopalan Tomkins, WWW 99] • How active are these communities, and over what timeframe? • Burst analysis[Kleinberg, KDD 02]

  17. Community extraction • Community analysis based on graph structure • Idea: there are many subgraphs that would never occur in a random graph – if we find such subgraphs, there must be some reason • In blogspace, we enumerate dense subgraphs using a greedy heuristic

  18. Dense subgraph enumeration(heuristic) • Scan edges, find triangles • For each triangle, greedily grow its neighbor set • Growth is allowable based on a measure of connectivity to the current dense subgraph • Extracted “communities” are not unique

  19. Bursts: Static to dynamic communities • Phenomenon to characterize: A topic in a temporal stream occurs in a “burst of activity” • Model source as multi-state • Each state has certain emission properties • Traversal between states is controlled by a Markov model • Determine most likely underlying state sequence over time, given observable output

  20. An example State 2: Output rate: very high State 1: Output rate: very low 0.01 1 2 0.005 I’ve been thinking about your idea with the asparagus… Uh huh I think I see… Uh huh Yeah, that’s what I’m saying… So then I said “Hey, let’s give it a try” And anyway she said maybe, okay? Time Most likely “hidden” sequence

  21. Some experiments • Crawled 24,109 blogs from popular sites (2003) • Extract archive links from blogs • Extract all dates on blog pages, and tag each word and link with a date • Simple heuristics to automatically extract time-stamps from entries (regular expressions, training, …) • Obtained dates for ~90% of edges

  22. Experiments (contd.) • The time graph • 22,299 nodes, 70,472 unique edges • 0.77M multiedges (average edge multiplicity = 11) • Consider graphs formed by prefixes from Jan 1, 1999 to some later month – generate 47 “prefix graphs” for analysis • Enumerate communities and analyze their burstiness

  23. SCC growth Largest SCC as fraction of all nodes 2nd and 3rd largest SCCs as fraction of all nodes

  24. Connectivity in Blogspace Number of nodes participating in a community Number of communities Fraction of nodes participating In some community

  25. Burstiness of communities Number of communities in “high state” during each time period

  26. Are these results fluke? • “Randomized Blogspace”: A distribution over time graphs that look very much like the time graph of Blogspace, but remove some of the locality of the true graph • Vertices and edges arrive at the same times, each edge has the same source, but a randomly-chosen destination • If randomized blogspace behaves like blogspace, then community structure is a fake

  27. SCC evolution Blogspace Randomized Blogspace Randomized Blogspace forms an SCC much earlier

  28. Community evolution Blogspace Randomized Blogspace Blogspace has many more communities

  29. Exogenous events Number of blog pages that belong to a community Number of blog communities Number of communities identified automatically as exhibiting “bursty” behavior – measure of cohesiveness of the blogspace Wired magazine publishes an article on weblogs that impacts the tech community NewsWeek magazine publishes an article that reaches the population at large, responding to emergence, and triggering mainstream adoption

  30. Some questions … • Modeling • Edge arrivals • `Interesting’ events • Algorithms • Prediction • Information percolation • Search • o(t ¢ T(n)) • Studies • Sociological • Effect on search and ranking

  31. Prediction via blogs(Gruhl, Guha, Kumar, Novak, Tomkins, 2005)

  32. Blogs as trend indicators • Can blogs be used to predict trends? • Data • Amazon sales rank of some books • Blog chatter in an index • Questions • How well do they correlate? • Can sales rank be predicted using blogs?

  33. The Lance Armstrong Performance Program Query: Lance Armstrong OR Tour de France

  34. Vanity Fair

  35. Cross-correlation for Lance Armstrong

  36. Simple inferences • How to formulate queries automatically • Depends on the object (book, movie, …) • Simple heuristics work well • Predicting sales motion is hard • Predicting spikes appears relatively easier • More to be done …

  37. Blogs and social networks(Kumar, Liben-Nowell, Novak, Raghavan, Tomkins, 2005)

  38. Social networks • Blog friendship graph is a social network • Is there a simple model to describe this network? • Desiderata • Fit experimental observations • Exhibit “six-degrees of separation” • Theoretically tractable

  39. RBF: Rank-Based Friendship • Population network model • Each person has a geographic location • d(¢, ¢) = measures geographic distance • rankA(B) = #{ C : d(A, C) < d(A, B) } • Pr[A “befriends” B] / 1/rankA(B) • Independent of distance • Works with arbitrary population densities • Plus local links to neighbors

  40. RBF: Preliminary results • Fits LiveJournal friendship experimental graph data (using geo data in the profile) • Greedy routing: Is able to route messages from source to destination most of the time, just using geographic information • Theoretical analysis: Can show that this model guarantees geographic routing to work

More Related