1.61k likes | 1.84k Views
Blogosphere: Research Issues, Tools and Applications. Huan Liu and Nitin Agarwal {Huan.Liu, Nitin.Agarwal.2}@asu.edu Computer Science and Engineering Arizona State University. An updated version could be downloaded from
E N D
Blogosphere: Research Issues, Tools and Applications Huan Liu and Nitin Agarwal {Huan.Liu, Nitin.Agarwal.2}@asu.edu Computer Science and Engineering Arizona State University An updated version could be downloaded from www.public.asu.edu/~huanliu/KDD08BlogosphereTutorial.pdf or www.public.asu.edu/~nagarwa6/KDD08BlogosphereTutorial.pdf
Acknowledgments • We would like to express our sincere thanks to Magdiel Oliveras Galan, John J. Salerno, Shankar Subramanya, Sanjay Sundarajan,Lei Tang, Philip S. Yu , and Alan Zheng Zhao for collaboration, discussion, and valuable comments. • This work is, in part, sponsored by AFOSR and ONR grants in 2008. • This agreement covers the use of all slides of this tutorial. • You may use these slides freely for teaching if you send us an email stating the university name and class/course number in advance, and cite this tutorial. • If you wish to use these slides in any other ways, please contact (or email) us. The ppt version contains notes with additional information such as various sources in addition to References.
Outline Background: Web 2.0 and Social Networks Blogosphere: Definition, Types, and Comparison Blogosphere Research Issues Tools and APIs Data Collection Measures, Models, and Methods Performance, Evaluation, and Metrics Case Studies References
Characteristics of Web 2.0 • Rich Internet Applications • User generated contents • User enriched contents • User developed widgets • Collaborative environment: Participatory Web, Citizen journalism • Thus, it leverages the power of the Long Tail with user generated data as the driving force • More of a paradigm shift than a technology shift
Web 2.0 Services (examples) • Blogs • Blogspot • Wordpress • Wikis • Wikipedia • Wikiversity • Social Networking Sites • Facebook • Myspace • Orkut • Digital media sharing websites • Youtube • Flickr • Social Tagging • Del.icio.us • Others • Twitter • Yelp
Top 20 Most Visited Websites Internet traffic report by Alexa on July 29th 2008 40% of the top 20 websites are Web 2.0 sites
Social Networks • A social structure made of nodes (individuals or organizations) that are related to each other by various interdependencies like friendship, kinship, like, ... • Graphical representation • Nodes = members • Edges = relationships
Social Networks • A social structure made of nodes (individuals or organizations) that are related to each other by various interdependencies like friendship, kinship, like, ... • Graphical representation • Nodes = members • Edges = relationships • Various realizations • Social bookmarking (Del.icio.us) • Friendship networks (facebook, myspace) • Blogosphere • Media Sharing (Flickr, Youtube) • Folksonomies
Some Related CFPs A Little Detour • ACM TKDD Special Issue on Social Computing http://www.public.asu.edu/~huanliu/acm-tkdd-sbp • Second International Conference on Social Computing, Behavioral Modeling, and Prediction (SBP09) http://www.public.asu.edu/~huanliu/sbp09 • SIAM International Conf on Data Mining (SDM) Sparks (Reno area), Nevada, April 30 - May 2, 2009. http://www.siam.org/meetings/sdm09
Definitions, Types, and Comparison BLOGOSPHERE
Blogging Phenomenon It’s growing fast as a new means for online communications and interactions A blogger could gain instant fame via his blogs A blogger may make a good living with her blogs Abundant, lucrative business opportunities A new political arena
“The site, chock full of advertising, is a moneymaking machine – so much so that Ms. Armstrong and her husband have both quit their regular jobs.“ The reason? The advertisers are eager to influence her 850,000 readers. Arnold Kim, founder and senior editor of MacRumors.com. “The site places MacRumors No. 2 on a list of the ‘25 most valuable blogs,’ …” What is the potential value? “Two of the other tech-oriented blogs on its list, …, were sold earlier this year, reportedly for sums in excess of $25 million.” Source: The New York Times
Blogosphere Growth “36 million women participate in the blogosphere each week, and 15 million have their own blogs” – A Study by BlogHer Today Front Page NY Times The Year of the Political Blogger Has Arrived … both parties understand the need to have greater numbers of bloggers attend. … to bring down the walls of the convention … “In January 2004, there were about 1 million blogs on the Internet. As of mid-2006, the population of the ‘blogosphere’ was well past 50 million and climbing.” – Paul Gillin, The New Influencers, 2007
Understanding Blogosphere • Blogosphere • Blog sites • Bloggers • Blog posts • Reverse chronologically ordered entries • Blogroll • Permalinks • Trackback • Everyone can publish, but few are heard • Many interesting questions to address • How to build traffic • How to find niche online • How to increase influence • How to … • Fertile research domain
Types of Blogs • Individual vs. community • Single authored (Individual blog sites) • Multi authored (Community blog sites) • Regulated vs. anonymous
Blogosphere • Complex Social Networks • Vertices (Nodes): Bloggers/ Blog posts/Blog sites • Edges: Relationships/Links • In-Degree: Number of inlinks • Out-Degree: Number of outlinks
Social Networks Orkut, Facebook, LinkedIn, Classmates.com, etc. LiveJournal, MySpace, etc. TUAW, Blogger, Windows Live Spaces, etc. Friendship Networks vs. Blogosphere
Citation Networks vs. Blogosphere • Citation links • DBLP: strict notion of links. People cite what they refer to • Blogs: links are casual and often missing • Social networks • DBLP: inferred from co-authorship, citation networks • Blogs: people explicitly specify their social network or inferred from links, comments, etc. • Communities • DBLP: conference venues, journals, (relatively static) • Blogs: community blogs, inferred from blog roll (related blogs), topic taxonomy, blog-blog interaction, (very dynamic)
Understanding Blogosphere • Understand structures and properties of Blogosphere • Gain insights into the relationships between bloggers, readers, blog posts, comments, different blog sites in Blogosphere • Models help generate artificial data, tune the parameters to simulate special scenarios, and compare various studies and different algorithms • Study peculiarities in Blogosphere and infer latent patterns and structures that could explain certain phenomena like influence, diffusion, splogs, community discovery.
Modeling Web and Blogosphere • Some key differences between Web and Blogosphere • Models developed for Web assume dense graph structure due to a large number of interconnecting hyperlinks within webpages. This assumption does not hold true. Blogosphere is shown to have a very sparse hyperlink structure [Kritikopoulos et al. 2006]. • The level of interaction in terms of comments and replies to blog posts makes Blogosphere different from Web • The highly dynamic and “short-lived” nature of the blog posts could not be simulated by the web models. Web models do not consider dynamicity in the web pages • Web models assume webpages accumulate links over time. However, this is not true with Blogosphere • “Categories” and “tags” gives blogs flexibility that conventional websites typically don’t have • Descriptive filenames used in permalinks of blogs as compared to webpage filenames
Modeling Blogosphere • Preferential attachment • Probability of a new edge to a node to be added depends on its degree • “The rich get richer” • Power law distribution or scale free distribution
Modeling Blogosphere • Preferential attachment • Probability of a new edge to a node to be added depends on its degree • “The rich get richer” • Power law distribution or scale free distribution
Modeling Blogosphere • Preferential attachment • Probability of a new edge to a node to be added depends on its degree • “The rich get richer” • Power law distribution or scale free distribution • Hybrid model • Mixture of both preferential attachment model and random model • Give a lucky poor guy some chance to get rich • To solve irreducibility (strong connectedness with few isolated subgraphs) random walk on a graph model proposes a random jump with a fixed probability • Leskovec et al. 2007 studied temporal patterns • How often people create blog posts • Busrtiness and popularity • How these posts are linked and what is the link density • Developed a SIS based model • Kumar et al. 2003 use blogrolls on the blog posts to construct a network of blog posts assuming that blogrolls contain similar blog posts
Blog Clustering • Dynamic and automatic organization of the content • Convenient accessibility • Optimizing search engines by reducing search space • Search only the relevant cluster • Focused crawling • Summarization • Topic identification • Reduce information overload • 175,000 blog posts per day, i.e., 2 blog posts per second – Dec 2006 • Extraction and analysis of the trends
Blog Clustering (2) • Brooks and Montanez 2006, used tf-idf and picked top 3 keywords for blog posts • Clustered blogs based on these keywords • Reported improved clustering as compared to that using tags • Li et al. 2007 assigned different weights to title, body, and comments of blog posts • Need to address high dimensionality and sparsity due to their keyword-based approach • Agarwal et al. 2008 proposed a collective-wisdom based approach • Generate a category relation graph based on user assignments • Compute similarity matrix from this graph
Blog Mining • Interactions between producers and consumers improved with blogs • Consumers not only speak their mind but also broadcast their opinions • Blogs are invaluable information sources • consumers’ beliefs and opinions, • initial reaction to a launch, • understand consumer language, • track trends and buzzwords, and • fine-tune information needs • Blog conversations leave behind the trails of links, useful for understanding how information flows and how opinions are shaped and influenced • Tracking blogs also help in gaining deeper insights
Blog Mining for Opinion • A prototype system called Pulse [Gamon et al. 2005]uses a Naive Bayes classifier trained on manually annotated sentences with positive/negative sentiments and iterates until all unlabeled data is adequately classified. • Another system presented in [Attardi and Simi 2006] improves the blog retrieval by using opinionated words acquired from WordNet in the query proximity. • Some well-known opinion mining and sentiment analysis techniques [B. Liu 2006] could also be borrowed from text mining domain due to high textual nature of blogs. • LingPipe (http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html) is another open source software which performs sentiment analysis on text corpora. • Subjective (opinion) vs. Objective (fact) sentences • Positive (favorable) vs. Negative (unfavorable) movie reviews
Influence Market Movers: “word-of-mouth”, trust and reputation Sway opinions: Government policies, campaign Customer Support and Troubleshooting Market research surveys: “use-the-views” Representative articles: 18.6 new blog posts per sec Advertising
Blog Influence • Two types of influence • Influential blog sites and site networks [Gill 2004, Gruhl et al 2004, Java et al 2006] • Influential bloggers in a community [Agarwal et al. 2008] • Blogosphere vs. Friendship Networks • Implicit vs. Explicit links • Blog statistics vs. Centrality measures • “influencing” vs. “could influence” • Loosely vs. Strictly defined graph structures • Blog vs. Webpage Ranking • Blog sites too sparse for webpage ranking algorithms to work [Kritikopoulos et al 2006] • Webpage acquires authority over time, blog posts’ influence diminishes • Greedy approach works better than PageRank, HITS to maximize influence flow [Kempe et al 2003, Richardson & Domingos 2002]
Issue of Trust • Open standards and low barriers to publishing have created overwhelming amount of collective wisdom • Yet more difficult for readers to discern whom to trust in some cases • Similar to WWW • Authoritative webpages e.g., HITS [Kleinberg et al. 1998], PageRank [Page et al. 1999] • Blogosphere allow mass to create and edit content compromising the sanctity of the original content • Some work exists for social friendship network domain, not many researchers have explored Blogosphere • Huge potential for trust study in Blogosphere domain
Trust Mi+1 = Mi * Ci – Perform till convergence M = Belief Matrix; Ci = Atomic Propagation Ci = M + MT*M + MT + M*MT • Kale et al. 2007 transformed the problem of trust in blogosphere to the one in social friendship networks • Studied propagation of trust among different blog sites • Mined sentiments from a window of words around hyperlinks • Identified positive, negative, or neutral sentiments towards the linked blog site • Constructed a network of blog sites using hyperlinks • Used Gruhl et al. 2004 trust propagation algorithm • Some concerns • These blog sites have to be linked for trust propagation • Trust is computed between blog sites based on how much one blog agrees or disagrees with the other
Community Extraction • Blogosphere doesn’t have an explicit notion of communities except community blogs • Discovering communities among individual blogs based on interaction • Different from blog clustering • Blog Clustering uses textual similarity • Community extraction taps interaction and link analysis
Community Extraction • Blogosphere doesn’t have an explicit notion of communities • Different from blog clustering • Researchers identify communities based on • Links: network of hyperlinks allows identification of virtual communities • Several studies on finding community of webpages like Kleinberg 1998 and Kumar et al. 1999 • While Kleinberg used authority and hubs idea to explore communities of webpages, Kumar et al. extended the idea of hubs and authorities and included co-citations as a way to extract all communities on the web and used graph theoretic algorithms to identify all instances of graph structures that reflect community characteristics. • Content: blogs with similar content or inspired by the same event form a virtual community • Kumar et al. 2003, Efimova and Hendrick 2005, Blanchard 2004
Community Extraction • Chin and Chignell 2006 proposed a model for finding communities taking the blogging behavior of bloggers into account • They aligned behavioral approaches through blog reader survey in studying blog community. • Blanchard and Marcus 2004 studied a multiple sport newsgroup “Virtual Settlement” and analyzed the possibility of emerging virtual communities • Newsgroups and discussion forums are similar in terms of interaction patterns to Blogosphere • More person-to-group interaction rather than person-to-person interaction
Spam blog (Splogs) Filtering One of the major rising concerns on Blogosphere Spammers make most of their money by getting viewers to click on ads that run adjacent to their nonsensical text Open standards and low barriers to publishing escalates the problem and challenges while solving Besides degrading search quality, affects the network resources
Spam blog (Splogs) Filtering • One of the major rising concerns on Blogosphere • Open standards and low barriers to publishing escalates the problem and challenges while solving • Besides degrading search quality, affects the network resources • Initial researches applied web spam link detection approaches • Ntoulas et al. 2006, distinguish between normal web pages and spam webpages based on the statistical properties like • number of words, average length of words, anchor text, title keyword frequency, tokenized URL • Gyongyi et al. 2004, Gyongyi et al. 2006 use PageRank to compute the spam score of a webpage • Kolari et al. 2006, consider each blog post as a static webpage and use both content and hyperlinks to classify a blog post as spam using a SVM based classifier
Spam blog (Splogs) Filtering • Some critical differences between web spam detection and splog detection • The content on blog sites is very dynamic as compared to that of web pages, so content based spam filters are ineffective • Moreover, spammers can copy the content from some regular blog posts to evade content based spam filters • Link based spam filters can easily be beaten by creating legitimate links • Lin et al. 2007, consider the temporal dynamics of blog posts and propose a self similarity based splog detection algorithm based on characteristic patterns found in splogs like, • Regularities or patterns in posting times of splogs, • Content similarity in splogs, and • Similar links in splogs.
Opinion and Sentiment Analysis • BLEWS (http://research.microsoft.com/projects/blews/blews.aspx) • Using Blogs to Provide Context for News Articles • Political views: Liberal vs. Conservative • Emotional charge
Opinion and Sentiment Analysis • BLEWS (http://research.microsoft.com/projects/blews/blews.aspx) • Using Blogs to Provide Context for News Articles • Political views: Liberal vs. Conservative • Emotional charge • SKEWS (http://www.skewz.com/) • Reveal bias in news story (articles) • Users rate the story on a scale from Liberal to Conservative • Readers vote