390 likes | 483 Views
Topical Semantics of Twitter Links. Michael J. Welch, Uri Schonfeld Yahoo! Inc. , UCLA Computer Science Dept WSDM`11. MARCH 23, 2011 In- seok An SNU Internet Database Lab. Outline. Introduction twitter Modeling Twitter Analysis of The Graph Exploring Link Semantics
E N D
Topical Semantics of Twitter Links Michael J. Welch, Uri Schonfeld Yahoo! Inc. , UCLA Computer Science Dept WSDM`11 MARCH 23, 2011 In-seok An SNU Internet Database Lab.
Outline • Introduction • twitter • Modeling Twitter • Analysis of The Graph • Exploring Link Semantics • Experiments on Link Semantics • Conclusion
Introduction • Twitter • Microblogging site • 10th world wide in total traffic • 28 million unique monthly visitors • Provider of information for breaking news events
Introduction • Simple graphical modeling for Web • Text-based pages connected by hyperlinks ( directed edges ) • Will fail to capture all that this information has to offer • Produce less than ideal results • A rich graphical model for Twitter • Multiple semantic edges • Follow, RT, Mention, List • Not all edges are created equal • In this paper • Web graph vs. Twitter graph • Follow link vs. Retweet link
IntroductionTwitter • Twitter • Blogging platform • Maximum of 140 characters • Micro-blogging platform • Multiple interfaces • Web, SMS, mobile application, instant messaging, etc.
IntroductionTwitter • Dual role • Reader • A user may choose to follow another user’s posts • Accessible via a private stream ( timeline ) • Sorted by their publication timestamp • Friends / follower • Writer • Posting messages • Retweet messages • Reply or Mention other twitterian
IntroductionTwitter • Mention • User is referred to by their username prefixed with the character “@” • Retweet • A user chooses to repeat another user’s post • New style retweet • Old style retweet
IntroductionTwitter • List • Added in late 2009 • Allows users to construct and organize a group of users referred to as a list • Help a user to focus on the posts of certain subsets of their friends • Two broad categories • Topical lists • Centered around the discussion of common interests or subjects • “politics” • Classification lists • Formed to group users who share a common trait • “Celebrities”, “professional athletes” • Lists generate meaningful manually-created categorizations of users
Outline • Introduction • Modeling Twitter • The Full Twitter Graph Model • Additional Twitter Information • The Simplified Twitter Graph • Analysis of The Graph • Exploring Link Semantics • Experiments on Link Semantics • Conclusion
Modeling Twitter • Web graph model • Nodes • Web pages • Edges • Hyperlinks connecting them • Enables the application of many graph analysis techniques • Inlink & outlink distributions • PageRank • N by N matrix M • The Web graph is commonly represented as matrix • N is the number of pages on the web
Modeling TwitterThe Full Twitter Graph Model • The Twitter graph is inherently more complex • At least two different types of entities ( nodes ) • Users and Tweets • At least four types of relationships ( edges ) • Follows, Publish, Retweets and Mentions • Twitter Graph Edges • Follow edge • User a follows the posts of user b • Publish edge • Authorship of the post • Retweet edge • Post a is a retweet of post b • Mention edge • Post a mentions user b
Modeling TwitterThe Full Twitter Graph Model • Matrix representation of the Twitter graph • Identical to the Web graph • |U| + |P| by |U| + |P|matrix • |U| : the number of users • |P| : the number of posts • A non-zero value in • Represents an edge between node i and node j
Modeling TwitterAdditional Twitter Information • Time • Twitter includes timestamp information • When each post was written • When accounts were created • When a follow link was created • No explicit way to determine • Can be approximated with repeated crawling • Valuable for studying factors • Evolution of the graph • Charting popularity over time
Modeling TwitterAdditional Twitter Information • Hyperlinks • Standard hyperlinks embedded in the posts • Third node type • Web page • Uniquely identified by a URL • Difficulty modeling hyperlinks in Twitter • Common use of URL shortening services • TinyURL and bit.ly • Prevents making use of keywords or other interesting artifacts the URL may contain directly • Makes additional processing of the data necessary
Modeling TwitterAdditional Twitter Information • Post Content • Use the content of a post • To extract metadata • User name mention • Identification of retweets • Remaining textual content of a post • Determining the topics of interest to a user as well • Difficulties • Small size of the posts • Sparsity of data • Sparsity of tokens • Frequent use of nonstandard shorthand notation
Modeling TwitterThe Simplified Twitter Graph • Simplified Twitter Graph • Only includes user nodes • Still capturing the most important information • From the original representation as it pertains to the users • The user-user follow links remain • As they are from the Full Twitter graph • Add retweet edges to the simplified Twitter Graph • If user a retweets user b at least one time • There is retwet edge from user a to user b
Outline • Introduction • Modeling Twitter • Analysis of The Graph • Link Distributions • Graph Formation • Exploring Link Semantics • Experiments on Link Semantics • Conclusion
Analysis of The Graph • Data specification • Collected between October 2009 and January 2010 • 1.1 million Twitter users • More than 273 million follow edges • 2.9 million retweet edges • Crawling method • Beginning with an initial seed set of the top 1000 users in twitterholic.com • Crawling in a BFS manner • Traversing the follow links in a forward direction
Analysis of The GraphLink Distributions • Follow Edges • Power-law distribution • Two abnormal spikes in Outlink distribution • 20-friend • Twitter provides an initial a set of 20 “recommended” users to follow • 2000-friend • The restrictions Twitter places on following more than 2000 users
Analysis of The GraphLink Distributions • Retweet Edges • RetweetInlink • Power-law distribution • RetweetOutlink • Does not follow power-law distribution • While the number of friends one has is generally power-law, the number of users one finds truly interesting does not appear to scale in a similar fashion
Analysis of The GraphLink Distributions • PostingFrequency • 417,613 users who publish at least one tweet • Most recent 200 posts per user • 58,000 users published only a single post during the month • A large number of users wrote more than 100 posts
Analysis of The GraphGraph Formation • Readers and Writers • Three potential scenarios • A user acts primarily as reader • No or little posts • A user frequently retweets posts • Writes little to no original content • A user contributes significant new content • User’s reading and writing behavior • Each dot : unique user • X-axis : # of posts published by friends • Y-axis : # of posts published by user • Shade : originality • The lighter shades indicate less originality • Size : PageRank of each user ( based on follow-edge )
Analysis of The GraphGraph Formation • General trend • For users who post very frequently • A larger fraction of their posts are actually retweets • Many users retweeted at least one post which they did not read from one of their friends • Despite the explicit friendship links available in the site structure, it is still not possible to know exactly what a user reads • Many websites are adding modules which display Twitter results
Outline • Introduction • Modeling Twitter • Analysis of The Graph • Exploring Link Semantics • Retweet vs. Follow based Ranking • LinkVirality • Experiments on Link Semantics • Conclusion
Exploring Link Semantics • Web graph • A link from page a to page b • Endorsement of the quality of page b • Extent its relevance to page a • Twitter graph • Follow link • Endorsement of quality or interest • The actual semantics of the link • User a , acting as a reader, is interested in user b acting as writer • Retweet link • Endorsement of quality • User is interested in the topic • User expects his readers to be interested in this post • Retweet edge signifies a connection from user a as a writer to user b as a writer
Exploring Link SemanticsRetweet vs. Follow based Ranking • PageRank based on two edges • Retweet-based • Simple power-law distribution • Follow-based • Two different segments with different power-law coefficients
Exploring Link SemanticsRetweet vs. Follow based Ranking • PageRank over Retweet links vs. Follow links • Follow links • Twitter recommended celebrities ( barackobama ) • Rich get richer phenomenon • Top ranker has lower rank in RT-based PageRank • Retweet links • Tweetmeme • Social bookmarking site • Top ranker has lower rank inFollow-based PageRank
Exploring Link SemanticsRetweet vs. Follow based Ranking • Follow-based • Public figure or celebrities • Retweet-based • News generating entities • Aplusk is the only user who appears in the top 10 for both rankings • These rank can be affected byspam or marketing techniques • ddlovatoRT simply retweet all posts mentioning DemiLovato • Twitter’s research team estimates thatless than 1% of Tweets are now spam
Exploring Link SemanticsLinkVirality • RetweetVirality • Follow Virality • RoF(u) : theuserswho u has seen at least on post from via a retweet • FoF(u) : the set of all users who are reachable by traversing exactly two directed follow edges • Fr(u) : the set of users whom user u follows • RetweetViriality is consistently higher than Follow Virality • Retweets demonstrate a stronger notion of importance or influence to users • Users are more likely to follow people they see retweeted than those who are merely “Friends of Friends”
Outline • Introduction • Modeling Twitter • Analysis of The Graph • Exploring Link Semantics • Experiments on Link Semantics • Empirical Results • Topic Sensitive PageRank • Conclusion
Experiments on Link Semantics • Topical relevance • Follow links quickly diffuse into a broad range of topics • Retweet links remain more concentrated on the original topic • Data • 1.1 million users • 273 million follow edges • 2.9 million retweet edges
Experiments on Link SemanticsEmpirical Results • Empirical evaluation • Starting from a seed set of users • Members of the same topical list • photography and design • Generate two sets of users • At least one seed member follows them • At least one seed member has retweeted one of their posts • Random sample of 25 users from each of these sets • Manually assessed them for topical relevance • Result • # of relevant users in the follow-generated samples were 4 and 5 • # of relevant users in the retweet-generated samples were 19 and 20
Experiments on Link SemanticsTopic Sensitive PageRank 1 [1] T.H. Haveliwala. Topic-sensitive PageRank, www 2002. • PageRank • Recursive ranking formula • Page is as important as the pages pointing to it • Topic Sensitive PageRank( TSPR ) • Quantify the difference in topical relevance carried by follow and retweet links • Biased PageRank • Generate query-specific importance scores for pages at query time • We use topic sensitive PageRank to quantify the difference in topical relevance carried by follow and retweet link
Experiments on Link SemanticsTopic Sensitive PageRank • Experiments • Beginning with a topical Twitter list • Compute topic sensitive PageRank for • Follow edges • Retweet edges • If the links carry the topicality well • The high-ranking users are likely to be topically relevant to the original seed topic • Evaluate the resulting highest ranked users for relevance to the original topic with a user survey
Experiments on Link SemanticsTopic Sensitive PageRank • Experimental Setup • Collected 9 topical lists from listorious.com • 19 ~ 437 users • Average 155, median 49 • Seed users have average 14,284 followers • Compute personalized PageRank • Selected the 30 highest ranking non-seed users • Conduct a survey • Participants were shown a topic description and the 30 highest raned users for either a follow-based or a retweet-based PageRank • Ordered randomly • Mixed with a random set of 10 of the seed users for that topic • Make a binary judgment of each user’s relevance • A total of 12 people participated in the survey • Each list was evaluated by at least 2 people
Experiments on Link SemanticsTopic Sensitive PageRank • Accuracy of the highly ranked users • Precision • The average relevancy of a set of users • Relevance • The fraction of users who were judged relevant by at least on survey taker • the set of users from U judged relevant in evaluation k of a paricular list
Experiments on Link SemanticsTopic Sensitive PageRank • Result • Precision can be improved by simply using retweet links instead of following links • Precision of top ranked user improved by over 30%
Experiments on Link SemanticsTopic Sensitive PageRank • Cohesiveness of Seed • To verify the seed users • Include 10 randomly selected seed users for each evaluation • Result • Average Precision : 0.931 • Minimum of 0.838 • Maximum of 1.9 • The seed users represented their topics well • Our survey takers understood and agreed upon the topic definitions
Conclusion • We have described a detailed model of Twitter as a graph • Key statistics about the graph • Provided some initial insights as to how the graph forms • important distinctions between edge types in the graph • Follow and retweet • The varying semantics and properties of these edges will have significant implication on graph algorithms such as PageRank • Retweet edges preserve topical relevance • Better than follow edges