1 / 32

Tag-based Social Interest Discovery

Xin Li, Lei Guo, Eric Zhao Yahoo! International Social Search. Tag-based Social Interest Discovery. Internet Social Networks Are Emerging!. Internet social networks are self-organized by online users Del.icio.us, facebook, flickr, MySpace, YouTube Users are driven by their interests

shilah
Download Presentation

Tag-based Social Interest Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Xin Li, Lei Guo, Eric Zhao Yahoo! International Social Search Tag-based Social Interest Discovery

  2. Internet Social Networks Are Emerging! • Internet social networks are self-organized by online users • Del.icio.us, facebook, flickr, MySpace, YouTube • Users are driven by their interests • Fetch and bookmark contents • Create new contents • Share contents • Interest discovery is crucial to a social network • Discover interests of users in different contents • Locate users with similar interests • Link people with similar interests to form communities

  3. Important Features of Social Networks • Organize users and contents • Cluster users into communities • Categorize contents into interesting topics • Provide search functions • Given a topic, locate all matching contents and all users that are interested in the topic • Given a user, locate all his fetched/created contents and the topics of his interests • Given a user, locate all other users that have similar interests

  4. The Problem: Social Interest Discovery • Questions to answer • How to discover a user’s interests based on his fetched/created contents? • How to use individual users’ interests to find interesting topics shared by users? • How to use the topics to create interest-based user communities?

  5. Existing Solutions and Limitations • User-centric • Using social network graph to discover users with common interests • Problem: online/offline user connections are hard to identify • Object-centric • Detect common interests based on the common objects fetched by users • Problem: discovered interests are object-base, non-descriptive and implicit • Predefined categorization • Not flexible, cannot catch most recent popular or hot user interests • Cannot reflect various user interest groups which may keep changing over time

  6. Our approach • Leverage user-generated tags • Compute frequent co-occurrences of tag patterns • Use the tag patterns as topics of interests • Cluster users and content around the topics to build communities

  7. Overview • Motivation and Problem • Analysis of tags in a social network • ISID system design • Evaluation • Conclusion

  8. Tags in Social Networks • User-generated labels for annotating the contents • Descriptive, summary, reflecting human judgment • Meta data between users and contents • Widely used in social networks • Del.icio.us: http://del.icio.us/help/tags • Youtube: http://www.google.com/support/youtube/bin/answer.py?hl=en&answer=55769 • Facebook: http://www.facebook.com/help.php?hq=tag

  9. del.icio.us Social Network • A pioneer social bookmark system • http://del.icio.us/ • Our Data Set • Dump for a limited period of time • 4.3 M public, tagged bookmarks, 0.2 M users, 1.4 M bookmarked URLs

  10. URL Popularity Follows Power Law The distribution of URL bookmarking frequency. Most URLs are unpopular.

  11. User Activity Follows Heavy-tail The distribution of user bookmarking frequency. Most users are less active.

  12. Tags vs. Keywords

  13. Tag Vocabulary Tag coverage for tf keywords Tag coverage for tf-idf keywords User tags missed ≤ 20% of tf keywords for ≥ 98% docs and ≤ 10% of tf-idf keywords for ≥ 90% docs. Tags covered most important keywords. But the total number of unique tags are ~10x smaller than that of keywords.

  14. Tag Convergence The total number of different tags users can use for a given document is limited no matter how popular the URL is.

  15. Tags Capture Concepts of Contents • Nearly 50% of all URLs have tag match ratio 1 • 70% of all URLs have a tag match ratio > 0.5 • Only 10% of the URLs have no matched tags

  16. From Tags to User Interests • Bookmarks reflect user interests • Tags summarize/describe bookmarked contents • Meta data between users and contents • Connect users and bookmarked contents • Frequently used tag patterns reflect user interests • The key is the co-occurrences of tags

  17. Overview • Motivation and Problem • Analysis of tags in a social network • ISID system design • Evaluation • Conclusion

  18. System Design • Find topics of interests • For a given set of tagged bookmarks, find all topics of interests, i.e., frequent co-occurrences of tags • Clustering • For each topic, find all the URLs and the users such that those users have labeled each of the URLs with all the tags in the topic. • Indexing • Import the topics and their user and URL clusters into an indexing system for application queries.

  19. Posts Data Source Clustering Indexing Topic Discovery Topics, posts Topics, Clusters ISID Architecture

  20. Topic Discovery • Use the association rule algorithms to discover co-occurring tag patterns • Was invented for identifying frequently bought items in supermarkets • E.g., bread and milk • Use a support number to define the frequency threshold • Efficient in finding frequent patterns out of a large set transactions for given support number (threshold) • The rule building part is not used • One more step: remove pattern A if A is a sub-pattern of some other pattern B, and both A & B have the same support number • To remove duplicate clusters

  21. Clustering

  22. Indexing • Find all URLs that contain a topic, i.e. tagged with same sets of tags • Find all users interested in a topic • Find all topics containing a tag • Find all topics for a user • Find all topics for a URL • Combination of the above

  23. Overview • Motivation and Problem • Analysis of tags in a social network • ISID system design • Evaluation • Conclusion

  24. Content Similarity of Topic Clusters • Similarity of two documents • Inner product of tf-idf document vectors • Keyword-based vector • Tag-based vector (comparison) • Intra-topic similarity • Average cosine similarity of every document pairs • Inter-topic similarity • Similarity of two topics • Average similarity of one topic to all other topics

  25. Tag based (tf-idf) Inter- and Intra- Topic Similarity Keyword based (tf-idf) • Intra-topic similarity is significantly higher than inter-topic similarity • Tag co-occurrence can well cluster similar content • Tag-based similarity is quite close to keyword-based similarity

  26. Tag-based (tf-idf) Inter-topic Similarity Similarity of two topics with different number of overlapped tags Keyword-based (tf-idf) Inter-topic similarity increases with number of co-occurring tags. Tag co-occurrences capture similar contents.

  27. 90% users have ≥ 90% top 5 tags covered 87% users have ≥ 90% top 10 tags covered 90% users have ≥ 80% tags covered User Interest Coverage The topics discovered by ISID capture the interests of users.

  28. Human Reviews Scores: 1, Highly unrelated 2, Unrelated 3, Not sure 4, Related 5, Highly related From the human being’s judgment, ISID indeed clusters related URLs into clusters for each topic defined by user tags.

  29. Cluster Properties Cluster size follows power-law  User interests follows power-law. There exists really hot topics!

  30. Cluster Properties Most topics have less than 6 tags. Beyond 6, the number of clusters quickly drops.

  31. Overview • Motivation and Problem • Data and Their Properties • ISID system • Evaluation • Conclusion

  32. Conclusion • Tags reflect human judgments on contents • Co-occurring tags are effective to represent user interests • Reflect human understanding for different but similar web contents • Consensus of judgments among users • ISID system • Topic discovery, Clustering, Indexing • Evaluation results are promising

More Related