community centric information exploration on social content sites l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Community-Centric Information Exploration on Social Content Sites PowerPoint Presentation
Download Presentation
Community-Centric Information Exploration on Social Content Sites

Loading in 2 Seconds...

play fullscreen
1 / 44

Community-Centric Information Exploration on Social Content Sites - PowerPoint PPT Presentation


  • 210 Views
  • Uploaded on

Community-Centric Information Exploration on Social Content Sites. Sihem Amer-Yahia and Cong Yu Yahoo! Research New York M3SN, Shanghai, China March 29 th , 2009. Acknowledgement (thanks to Facebook). Acknowledgement. Michael Benedikt, Oxford Alban Galland, INRIA Jian Huang, Penn State

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Community-Centric Information Exploration on Social Content Sites' - Jimmy


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
community centric information exploration on social content sites

Community-CentricInformation Explorationon Social Content Sites

Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

M3SN, Shanghai, China

March 29th, 2009

acknowledgement

Acknowledgement (thanks to Facebook)

Acknowledgement
  • Michael Benedikt, Oxford
  • Alban Galland, INRIA
  • Jian Huang, Penn State
  • Laks Lakshmanan, UBC
  • Julia Stoyanovich, Columbia
social content sites scs
Social Content Sites (SCS)
  • Web destinations that let users:
    • Consume content-oriented information: videos, photos, news articles, etc.
    • Engage in social activities with their friends and people of similar interests
  • Two major driving factors:
    • Incorporating social activities improves the attractiveness of traditional content sites
      • The “similar traveler” feature improves the user duration on Yahoo! Travel significantly
    • Incorporating content is critical to the value of social networking sites
      • A significant amount of user time is spent on browsing other people’s photos, notes, etc.
  • Privacy: important but not the focus here
main challenges in scs
Main Challenges in SCS
  • Social Content Management
    • Data must be maintained and analyzed at web-scale: the underlying social content graph can be enormous. (Edward Chang’s keynote later)
    • Information has to be integrated from various places.
  • Information Discovery
    • Current: pure content based search or simple popularity/rating based browsing
    • Future: effectively and efficiently leverage the underlying social graph in more sophisticated ways
  • Information Presentation
    • Current: a single ranked list based on relevance
    • Future: metrics other than relevance / groups and clusters / explanations

Information

Exploration

talk outline
Talk Outline
  • Emerging Trend of Social Content Sites (SCS)
  • Information Exploration on SCS
    • Community Centric Model
    • Information Discovery: Improving Search Performance through Clustering
    • Information Presentation: Diversification via Explanation
  • SocialScope and Jelly
  • Further Challenges
  • Conclusion
information exploration on scs
Information Exploration on SCS
  • Major Paradigms
    • User-based: browsing the content of your friends or other users you simply follow (Facebook, twitter)
    • Search: content based keywords matching (most traditional content sites)
    • Recommendations: hotlists, tag-based hotlists (Amazon, many collaborative tagging sites)
    • Query: database-style querying with complex conditions (not many so far)

But which ones are more frequent?

search vs recommendation
Search vs. Recommendation
  • Majority of those searches are in fact recommendations in disguise!
    • Users are usually NOT looking for specific items
    • Majority of the sessions are seeking recommendations with geographical andtopical constraints
  • Ranking the paradigms (for neurotic people who insist on orders):

“Recommendation” ~ User-based > Search > Query

recommendation 2 0 or 1 5
Recommendation 2.0 (or 1.5)
  • Users demand a lot more from us
    • Be able to specify the topic: “what are the good destinations for a romantic trip in Asia?”
    • Be able to specify the community: “what’s hot among my tech-savvy friends?”
    • Be able to present the results in ways that can maximize the understanding of those results
  • But we do know a lot more about the users and the contents (through the users)
    • Richer user profiles
    • Friendships and relationships
    • Content activities (rating movies, tagging travel destinations, etc.)
    • Communication activities (IM, email, twitter follows, Facebook wall messages, etc.)
data model social content graph

id: 301

destination

state: NY

id: 103

people

Joe

Yankees

NYC

friend

visit

id: 101

people

John

Football

fun,

club

id: 302

destination

state: MI

college: umich

AA

id: 102

people

Amy

Data Model: Social Content Graph

A social content graph is a logical graph structure where the labeled nodes represent users and objects, and the labeled edges represent relations between users and items, as well as activities users perform on items or other users. Both nodes and edges can have structural attributes.

a community centric model
A Community-Centric Model
  • Communities of users as the central concept
    • Community: a group of “similar” or “related” users
    • Recommendations are performed per community and results from each community are assembled together in the end
    • A user can belong to multiple communities:
      • A vector-based representation: user:<c1:w1, c2:w2, ...>
  • Challenge I: Community Generation
    • Topic based
    • Relationship based
    • Activity based
  • Challenge II: Community Selection
    • Given a user session, determine the appropriate set of communities to work with
data in yahoo travel
Data in Yahoo! Travel
  • Users
    • Yahoo! users
  • Destinations
    • cities, attractions, etc.
  • Tagging activities:
    • Destination tags: editorial + custom
    • Self tags: mostly custom
  • Goal:
    • Given a user session (user + tags + region), recommending travel destinations to the user
  • Approach:
    • Discover topics for users, tags and destinations based on tagging actions
    • Construct communities based on topics
topic discovery through lda
Topic Discovery through LDA
  • Joint work with Jian Huang
  • Intuition
    • Each destination is modeled as a document consists of all tags being used on it
    • Tags are treated as words in the document
  • Apply Latent Dirichlet Allocation [BNJ03]
    • A probabilistic generative model for document topic discovery
    • Produces the following conditional probabilities
      • Prob(w|t): given a latent topic, the probability of the tag word belongs to that topic
      • Prob(t|d): given a document, the probability of the document being about a certain latent topic
  • Prob(t|u) = (Prob(t|di)), where u tagged di
examples of topic discovery
Examples of Topic Discovery

Topic 0: Romantic/Luxury

Tokens: romantic, 4-star, shopping, spa, golf, luxury, honeymoon …

Top cities: Cancun, San Juan, Grand Canyon, Bangkok …

Paris

Topic 2: Arts/Historical/Cultural

Tokens: architecture, sightseeing, art, history, culture, cathedral, castle …

Top cities: Paris, London, Boston, Istanbul Amsterdam, Rome, Hong Kong …

Cancun

Topic 1: Seaside/Water related

Tokens: beach, scuba, summer, diving, fishing, snorkeling, island, lake …

Top cities: San Diego, Honolulu,Chicago, Miami, Seattle …

Topic 3: Nightlife, City-life

Tokens: nightlife, gambling, wine, drinking, exciting, casino …

Top cities: Las Vegas, New York City, Los Angeles, San Francisco …

San Diego

Las Vegas

community generation
Community Generation
  • Identify destinations belong to the same topic
    • For each topic t, identify destination d belongs to t:

D(t) = { d | Prob(t|d) > threshold }

  • Generate topical communities
    • D(u) = { d | user u visited destination d }
    • Interests-based topical community

Cinterest(t) = { u | > threshold }

    • Expertise-based topical community

Cexpert(t) = { u | > threshold }

D(u)

D(t)

community selection
Community Selection
  • Each user session consists of three pieces of information
    • (user, tag, region)
    • Region is treated as a Boolean constraint
    • We combine and convert user and tag into a single topical vector

Prob(t|(u,w)) =

w1·Prob(t|u) + w2·(Prob(w|t)·Prob(t)/Prob(w))

  • A community C(t) is selected if Prob(t|(u,w)) is greater than a threshold
  • The choice between Interests or Experts Communities is heuristically decided
    • User is an expert: leverage interests-based community
    • User is not an expert: leverage experts-based community
talk outline20
Talk Outline
  • Emerging Trend of Social Content Sites (SCS)
  • Information Exploration on SCS
    • Community Centric Model
    • Information Discovery: Improving Search Performance through Clustering
    • Information Presentation: Diversification via Explanation
  • SocialScope and Jelly
  • Further Challenges
  • Conclusion
querying items within social network
Querying Items within Social Network
  • Problem Definition:
    • Given a user and a query, retrieve items relevant to the query based only on the user’s network
    • The score of the item depends on keyword and user
    • E.g., score(u,q,i) = cnt(v),

where tag(v,i,q) and friend(u,v)

    • A common scenario on social tagging sites like del.icio.us
  • Naïve Approach
    • One inverted index for each (user,keyword)
    • Apply top-k threshold algorithms at query time
    • Optimal efficiency
    • But, high storage overhead
space overhead of per user indexing

score

score

53

99

80

36

30

78

15

75

14

72

tag = music

tag = news

item

item

score

score

item

item

10

63

10

60

i5

i5

i1

i1

30

73

5

50

i2

i9

65

i2

i8

29

i2

i8

27

62

i3

i4

i7

i6

40

25

i4

i2

i5

i1

i3

i5

39

23

i6

i8

i6

i6

20

18

i7

i4

i7

i7

15

16

i3

i3

i9

i8

13

16

user Jane

user Jane

user Ann

user Ann

Space Overhead of Per-User Indexing
  • Conservative Example:
    • 100K users, 1M items, 1K tags
    • 20 tags/item from 5% of the taggers
    • 10 bytes per inverted list entry
    • 1 Terabyte index!
clustering to the rescue
Clustering to the Rescue!
  • Joint work with Michael Benedikt, Laks Lakshamanan, Julia Stoyanovich
  • Clustered Approach
    • Group users based on shared behavior into communities
      • Shared items
      • Shared items with same tags
    • One inverted index for each (group,keyword)
    • Replace item score with score upper bound (among all users in the group)
    • Sacrifice a bit of query performance for indexing space efficiency
  • Community-based (seeker clustering):
    • Groups are keyword-independent
    • E.g., shared items, shared tags
  • Behavior-based (tagger clustering):
    • Groups are keyword-specific
    • E.g., Jane may be similar to John on “travel”, but very different from him on “sports”.
user clustering

Sarah

Ann

User Clustering

one user, one cluster

one inverted index per cluster

upper-bound score per item

Behavior-based

Community-based

Jane

Chris

one user, many clusters

one inverted index per cluster

upper-bound score per item

talk outline27
Talk Outline
  • Emerging Trend of Social Content Sites (SCS)
  • Information Exploration on SCS
    • Community Centric Model
    • Information Discovery: Improving Search Performance through Clustering
    • Information Presentation: Diversification via Explanation
  • SocialScope and Jelly
  • Further Challenges
  • Conclusion
information presentation
Information Presentation
  • A broad definition:

Given the set of relevant items, identify the appropriate subset of items to be shown to the user(s) in the right organization andwith the right amount of information.

  • Appropriate Subset:
    • Timeliness
    • Geographical closeness
    • Diversity
    • Etc.
  • Right Organization:
    • Ranking
    • Grouping
    • Faceted Navigation
  • Right Amount of Information
    • Explanation
a case study in recommendation
A Case Study in Recommendation
  • Joint work with Laks Lakshmanan
  • While relevance is important to recommendation, others are critical too:
    • Novelty: avoid returning results that users are likely to know already.
    • Serendipity: aim to return less relevant results that might give users a pleasant surprise.
    • Diversity: avoid returning results that are too similar to each other.
  • Recommendation Diversification:

From the pool of candidate items, identify a list of items that are dissimilar to each other while maintaining a high cumulative relevance, i.e., strike a good balance between relevance and diversity.

existing solutions for diversification
Existing Solutions for Diversification
  • Attribute-Based Diversification
    • Diversity semantics: pair-wise distance functions based on item attributes (e.g., movie attributes).
    • Combining with relevance:
      • Threshold either relevance or distance, maximize the other
      • Optimize an overall score as a weighted combination of relevance and distance
    • Algorithm
      • Perform traditional recommendation
      • Obtain the attributes of each candidate item and compute the pair-wise distance
      • Ad-hoc methods follow how diversity and relevance are combined
  • A Major Problem:
    • Lack of attributes suitable for estimating distance between pairs of items: e.g., URLs in del.icio.us, photos on Flickr, videos on Vimeo.
explanation based diversification
Explanation-Based Diversification
  • Intuition
    • Explanation is the set of objects because of which a particular item is recommended to the user.
    • Two items share similar explanations are likely to be similar to each other.
  • Explanation for Item-Based Strategies
  • Explanation for Collaborative Filtering Strategies (social!)
explanation based diversity
Explanation-Based Diversity
  • Pair-wise diversity distance between two recommended items
    • Standard similarity measures like Jaccard similarity and cosine similarity
    • E.g. (Distance based on Jaccard similarity)
  • Diversity for the set of recommended results (S)
benefits and practicality of explanation based diversification
Benefits and Practicality of Explanation-Based Diversification
  • Applicable to items without attributes or whose attributes are difficult to analyze
    • Common on social content sites
  • Explanations are by-products of many recommendation processes
    • They can be maintained with little overhead

See our poster on Monday for details

talk outline34
Talk Outline
  • Emerging Trend of Social Content Sites (SCS)
  • Information Exploration on SCS
    • Community Centric Model
    • Information Discovery: Improving Search Performance through Clustering
    • Information Presentation: Diversification via Explanation
  • SocialScope and Jelly
  • Further Challenges
  • Conclusion
socialscope platform

User

Social Content Admin

Facebook

Y! IM

Y! Sports

SocialScope Platform

Information Presentation

Social Content Grouper

Social Content Ranker

User Interface

Information Discovery

Query /

Result

Social Content Analyzer

Social Query Evaluator

Activity

Manager

Data Manager

Content Management

Social

Content

Graph

Activities

Content Integrator

OpenSocial API

OpenSocial API

Important Information Flow

(1) Raw/derived social nodes and links

(2) Social content sub-graph

(3) Relevant social content sub-graph

(4) Relevant social nodes and links

a graph based logical algebra framework
A Graph Based Logical Algebra Framework
  • Designed for information discovery on social content sites
  • Aim to provide a declarative way of specifying analysis and query tasks
    • Uniformity and flexibility
    • Opportunities for performance optimization
  • Basic operators:
    • Node Selection (σN), Link Selection (σL)
    • Composition, Semi-Join
    • Node Aggregation, Link Aggregation
    • Details in [CIDR09]
a simple search task
A Simple Search Task

John’s friends

People visited Denver

John’s friends

who visited Denver

Their activities

jelly a language over social content sites
Jelly:A Language Over Social Content Sites
  • Designed with a focus on community-centric information exploration applications
    • Most useful applications
  • A restricted implementation of the SocialScope algebra
    • Based on nested relation model, instead of full graph model
  • Built-in primitives for topic and community generation
    • Topic Generation, Community Extraction
    • Recommendation Generation
    • Group Generation, Explanation Generation
a simple incomplete example
A Simple (Incomplete) Example
  • Topic Generation and Community Extraction
  • Recommendation

Generation

  • Information

Presentation

generate topics for item into topics

from tagging R

using LDA (seed, th=0.8)

seed R.item group R.tag weight-with count()

generate communities into experts

from topics T, tagging R

where T.*.item = R.item

using jaccard-similarity (seed, th=0.7)

seed (R.user, T.topic) list R.item

generate recommendations into candidates

given user u, query q

from experts T

where Selected (T.topic, u, q)

using count-users (seed)

seed T.*.item list T.user

generate explanations into results

from candidates C, tagging T

where C.item = T.item

using identity (seed)

seed C.item list T.user weight-with count()

talk outline40
Talk Outline
  • Emerging Trend of Social Content Sites (SCS)
  • Information Exploration on SCS
    • Community Centric Model
    • Information Discovery: Improving Search Performance through Clustering
    • Information Presentation: Diversification via Explanation
  • SocialScope and Jelly
  • Further Challenges
  • Conclusion
scalability challenges
Scalability Challenges
  • Social content graphs are large
    • 10’s of millions of users
    • millions (movies, destinations) or billions (web links) of items
    • Challenge 1: being able to analyze the graph efficiently
      • We need to design Map-Reduce frameworks that are suitable for those analyses
  • Activities are being generated at a very high rate
    • millions of status updates per day (Facebook, twitter)
    • Similar amount of content activities (browsing, tagging, etc.)
    • Challenge 2: being able to incrementally incorporate new activities into the existing model
semantic challenges and questions
Semantic Challenges and Questions
  • Connections are multi-dimensional
    • Two main categories on the surface: explicit (friendship) vs. implicit (shared activities)
    • However: not all friendship links are the same and not all activities are the same
    • Challenge: for a given user on a given topic, identify the appropriate set of explicit and implicit links
      • expert versus interest based communities is one step toward that goal
  • Temporal, geographical, and other external influences
    • How does a community evolve at time goes on and as members moves around?
  • Social graph interactions
    • How does one social graph (say Facebook) impact another (say MySpace)?
conclusion
Conclusion
  • The emerging trend of social content sites present many challenges
    • Social information integration
    • Information discovery
    • Information presentation
  • The notion of communities and topics are crucial for effective and efficient information exploration on those sites
    • Community discovery and analysis
    • Community-based information exploration
  • Lots more!
  • At Yahoo!, we are focusing on building systems that can address those challenges