Understanding Social Tagging Networks

Ralf Schenkel Informationssuche in sozialen Netzen Joint work with Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Parreira, Marc Spaniol, Gerhard Weikum

Social Tagging Networks Common examples: • Flickr (images) • YouTube (videos) • del.icio.us (bookmarks) • Librarything (books) • Discogs (CDs) • CiteULike (papers) • Facebook • Myspace (media) Definition: Social Tagging Network Website where people • publish + tag information • review + rate information • publish their interests • maintain network of friends • interact with friends Perspektivenvorlesung

Some Statistics Flickr: (as of Nov 2008) • 3+ billion photos, 3 million new photos per day Facebook: (as of Nov 2008) • 10+ billion photos, 30+ million new photos per day • 120 million active users • 150,000 new users per day Myspace: (as of Apr 2007) • 135 million users (6th largest country on Earth) • 2+ billion images (150,000 req/s), millions added daily • 25 million songs • 60TB videos StudiVZ.net: (as of Nov 2008) • 11 million users • 300 million images, 1 million added daily Huge volume of highly dynamic data Perspektivenvorlesung

Showcase: librarything.com Tags Ratings Others Books Perspektivenvorlesung

librarything.com: Social Interaction Similar Users Comments Explicit Friends Perspektivenvorlesung

librarything.com: Tag Clouds Perspektivenvorlesung

librarything.com: Search Search results independent of the querying user(and the social context) Perspektivenvorlesung

librarything.com: Search Search automatically expanded with similar tags(synonyms) Perspektivenvorlesung

Librarything.com: Recommendations Recommendations depend on user and tags(but not on social context) Perspektivenvorlesung

Librarything.com: Recommendations Explanation for the recommendation Perspektivenvorlesung

Librarything.com: Explanations Perspektivenvorlesung

Outline • Search in Social Tagging Networks • Graph Model • Different Information Needs • Effective Query Scoring • Efficient Query Evaluation • Summary & Further Challenges Perspektivenvorlesung

Querying Social Tagging Networks travelnorway travelvldb Perspektivenvorlesung

Querying Social Tagging Networks travelnorway travelnorway travelvldb travelvldb travel travelmexico traveltrip travelicde harrypotter harrypotter harrypotter harrypotter probabilitydata miningfoundations Perspektivenvorlesung

Information Need 1: Globally Popular travelnorway travelnorway travelvldb travelvldb travel travelmexico travelicde traveltrip harrypotter or ? harrypotter harrypotter harrypotter probabilitydata miningfoundations harry potter Most frequently tagged items „best“Tags by all users equally important Perspektivenvorlesung

Information Need 2: Similar Users travelnorway travelnorway travelvldb travelvldb travel travelmexico travelicde traveltrip harrypotter harrypotter harrypotter harrypotter or ? probabilitydata miningfoundations travel Perspektivenvorlesung

Information Need 2: Similar Users travelnorway travelnorway travelvldb travelvldb travel travelmexico travelicde traveltrip harrypotter harrypotter harrypotter harrypotter or ? probabilitydata miningfoundations travel Tags by users with similar tags/items(„brothers in spirit“)more important Perspektivenvorlesung

Information Need 3: Trusted Friends probabilityselling probabilityselling probabilityselling travelnorway travelnorway travelvldb travelvldb travel travelmexico traveltrip travelicde harrypotter or ? harrypotter harrypotter harrypotter probabilitydata miningfoundations probability Perspektivenvorlesung

Information Need 3: Trusted Friends probabilityselling probabilityselling probabilityselling travelnorway travelnorway travelvldb travelvldb travel travelmexico traveltrip travelicde harrypotter or ? harrypotter harrypotter harrypotter probabilitydata miningfoundations probability Tags by closely related and well-known users more important Perspektivenvorlesung

Towards Social-Aware Social Search Search results may depend on • Global popularity of items • Spiritual context of the querying user(users with similar books and/or tags) • Social context of the querying user(known and trusted friends) Perspektivenvorlesung

Outline • Search in Social Tagging Networks • Effective Query Scoring • Quantifying Friendship Strengths • User-specific Scoring Functions • Experimental Evaluation • Efficient Query Evaluation • Summary & Further Challenges Perspektivenvorlesung

Notation U set of users T set of tags I set of items tags(u): tags used by user u items(u): items tagged by user u items(t): items tagged with tag t by at least one user df(t): number of items tagged with tag t tfu(i,t): number of times user u tagged item i with tag t tf(i,t): number of times item i was tagged with tag t Perspektivenvorlesung

Quantifying Friendship Strengths • Global „friendship“ strength: • Spiritual friendship strength • Social friendship strength • Integrated friendship strength Perspektivenvorlesung

Spritual Friendship Strength u‘ u overlap in interests of u and u‘ • Several alternatives: • based on overlap of tag usage: harrypotterwizard deathlyhallows philosopherstone u‘ u • based on overlap of tagged items: • overlap of behavior (tagging, searching, rating, …) • For all: • Pspirit(u,u):=0 • normalization such that tags(u): tags used by user u items(u): items tagged by user u Perspektivenvorlesung

Graph-Based Friendship Strength • set Psocial(u,u):=0 • normalization such that distance of u and u‘ in user network u1 u5 u3 u7 u2 u6 Psocial( ,u‘) u4 u2 u‘ u3 u4 u5 u6 u7 Perspektivenvorlesung

Integrated Friendship Strength Query-dependent mixture of • spiritual friendship strength • social friendship strength • background model (global) (0,1; +1) Pint(u,u‘) Perspektivenvorlesung

Excursion: Scoring in Text Retrieval Hand-tuned instance: Okapi BM25 Linear combination for query scores General scoring framework: Importance of t in the collection(the less frequent, the better) Importance of t for item i(the more frequent, the better) Perspektivenvorlesung

Towards a User-specific Score global friendship strength Convert into user-specific social frequency: Compute user-specific social score [SIGIR 2008] Perspektivenvorlesung

Including Tag Expansion Problem: Users use different tags for similar things  poor recall (missing relevant results) Example:MPI, MPII, MPI-INF, MPI-CS, Max-Planck-Institut, D5, AG5, DB&IS, MMCI, UdS, Saarland University, … Solution: 1. Define notion of similar tags 2. Expand queries with similar tags 3. Modify scoring function for expanded queries Perspektivenvorlesung

Heuristics for finding similar tags Specialization heuristics: Tag t2specialization of t1 if t1 occurs (almost) whenever t2 occurs Example: t1=Europe, t2=Germany Co-Occurrence heuristics: Tags t1 and t2similar if they occur (almost) always together Perspektivenvorlesung

Scoring Expanded Queries Naive approach: For query tag t, add similar tags t‘ with sim(t,t‘)>δ to query But: „transportation disaster“ expanded by „train car bus plane …“ „international crime“ expanded by „mafia camorra yakuza …“ Result quality drops due to topic drift Better: auto-tuning incremental expansion For query tag t, consider only expansion with highest combined score per item Perspektivenvorlesung

Experimental Evaluation: Effectiveness Systematic evaluation of result quality difficult Three possible setups: • Manual queries + human assessments • Queries+assessments derived from external info (ex: DMOZ categories) • Automated assessments from context of user • Items tagged by friends • Items tagged in the future   ? Perspektivenvorlesung

Prototype [VLDB/SIGIR 2008 demo] Perspektivenvorlesung

Preliminary User Study LibraryThing user study: [Data Engineering Bulletin, June 2008] • 6 librarything users with reasonably large library and friend sets • Overall 49 queries like „mystery magic“, „wizard“, „yakuza“ • Crawled (part of) librarything: ~1,3 mio books, ~15 mio tags, ~12,000 users, ~18,000 friends • Measured NDCG[10]  (spiritual) α(social) • Result quality generally very high • Combination of spiritual and social friends is best Perspektivenvorlesung

Outline • Search in Social Tagging Networks • Effective Query Scoring • Efficient Query Evaluation • Threshold Algorithms • ContextMerge • Experimental Evaluation • Summary & Further Challenges Perspektivenvorlesung

Algorithmic Overview • Input: query q={t1…tn} for user u, α,  • Output: k items with highest scores • Goals: • Avoid computing all results • Minimize disk I/O and CPU load • Utilize precomputed information on disk + „harry potter“ …………………….. Perspektivenvorlesung

Excursion: Threshold Algorithms for Text IR Input: • query q={t1…tn} • lists L(tp) with pairs <i,score(i,tp)>, sorted by score(i,tp)↓ Output: k items with highest aggregated score Family of Threshold Algorithms: • scan lists in parallel • maintain partial candidate results with score bounds • terminate as soon as top-k results are stable Perspektivenvorlesung

Example: Top-1 for 2-term query (NRA) L1 L2 top-1 item min-k: candidates Perspektivenvorlesung

Example: Top-1 for 2-term query (NRA) 0.9 ? A: ?: ? ? score: [0.9;1.9] score: [0.0;1.9] L1 L2 top-1 item min-k: 0.9 candidates Perspektivenvorlesung

Example: Top-1 for 2-term query (NRA) ? 0.9 ? ?: A: D: ? ? 1.0 score: [1.0;1.9] score: [0.0;1.9] score: [0.9;1.9] L1 L2 top-1 item 1.0 min-k: 0.9 candidates Perspektivenvorlesung

Example: Top-1 for 2-term query (NRA) ? ? 0.9 0.3 A: ?: G: D: ? ? 1.0 ? score: [0.3;1.3] score: [0.0;1.3] score: [0.9;1.9] score: [1.0;1.3] L1 L2 top-1 item 1.0 min-k: candidates Perspektivenvorlesung

Example: Top-1 for 2-term query (NRA) 0.3 ? ? 0.9 D: G: A: ?: ? ? ? 1.0 score: [0.9;1.6] score: [1.0;1.3] score: [0.0;1.0] score: [0.3;1.0] L1 L2 top-1 item 1.0 min-k: candidates No more new candidates considered Perspektivenvorlesung

Example: Top-1 for 2-term query (NRA) 0.9 ? ? ? 0.9 0.9 ? 0.9 D: A: A: D: D: A: A: D: ? ? 0.4 1.0 1.0 1.0 1.0 ? score: [1.0;1.25] score: [0.9;1.5] score: [0.9;1.6] score: [1.3;1.3] score: [1.0;1.3] score: [1.0;1.2] score: [0.9;1.55] score: [1.0;1.2] L1 L2 top-1 item 1.0 min-k: 1.3 candidates Algorithm safely terminates Perspektivenvorlesung

Can we reuse this here? No, scores specific to querying user and parameter setting! : harry (=0.2,=0.5) : harry (=0.2,=0.5) : harry (=0.2,=0.5) : harry (=0.2,=0.5) : harry (=0.0,=0.8) : harry (=1.0,=0.0) : harry (=0.0,=1.0) : harry (=0.0,=1.0) : harry (=0.5,=0.5) : harry (=0.0,=0.8) : harry (=1.0,=0.0) : harry (=0.5,=0.5) : harry (=0.0,=0.8) : harry (=0.0,=1.0) : harry (=0.0,=0.8) : harry (=0.5,=0.5) : harry (=1.0,=0.0) : harry (=1.0,=0.0) : harry (=0.0,=1.0) : harry (=0.5,=0.5) 0.98 0.98 0.98 0.98 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.84 0.84 0.84 0.84 0.89 0.89 0.86 0.89 0.89 0.89 0.89 0.89 0.86 0.89 0.86 0.89 0.89 0.89 0.89 0.86 0.45 0.45 0.45 0.45 0.56 0.64 0.56 0.56 0.64 0.56 0.56 0.56 0.64 0.56 0.56 0.64 0.56 0.56 0.56 0.56 harry travel 0.87 0.95 0.82 0.85 0.69 0.51 Number of lists to precompute would explode!(#tags  #users  parameter space) Perspektivenvorlesung

Revisiting the Social Frequency independent of user u dependent of user u Compute sfu(i,t) on the fly from tf(i,t), friends of u and their tagged documents Perspektivenvorlesung

Top-K in Social Networks: ContextMerge Precomputed lists: • ITEMS(t): pairs <i,tf(i,t)>, sorted by tf(i,t)↓ • USERITEMS(u‘,t): pairs <i,tfu‘(i,t)>, unsorted • FRIENDS(u): pairs <u‘,F(u,u‘)>, sorted by F(u,u‘)↓ ITEMS(harry): alreadyexist insystems 32 26 47 … USERITEMS( , harry): FRIENDS( ): 0.085 0.12 0.10 … Perspektivenvorlesung

ContextMerge Adapted Threshold Algorithm for query u,t: • Scan ITEMS(t) and FRIENDS(u) in parallel • pick „best“ list • If ITEMS(t): read next entry • If FRIENDS(u): read USERITEMS(u‘,t) for next friend u‘ • Maintain candidates with bounds for min and max score and current results ITEMS(harry): FRIENDS( ): 47 0.12 0.10 32 0.085 26 … … Perspektivenvorlesung

ContextMerge computemin score bound compute max score bound Adapted Threshold Algorithm for query u,t: • Scan ITEMS(t) and FRIENDS(u) in parallel • pick „best“ list • If ITEMS(t): read next entry • If FRIENDS(u): read USERITEMS(u‘,t) for next friend u‘ • Maintain candidates with bounds for min and max score and current results ITEMS(harry): FRIENDS( ): User-indeppart of sf: 47 User-specpart of sf: 47 0.12 ?  |U| 0.10 32 0.085 26 … … Perspektivenvorlesung

ContextMerge User-indeppart of sf: ? User-specpart of sf: 0.12·|U| Adapted Threshold Algorithm for query u,t: • Scan ITEMS(t) and FRIENDS(u) in parallel • pick „best“ list • If ITEMS(t): read next entry • If FRIENDS(u): read USERITEMS(u‘,t) for next friend u‘ • Maintain candidates with bounds for min and max score and current results ITEMS(harry): FRIENDS( ): User-indeppart of sf: 47 User-specpart of sf: 47 0.12  0.88·|U|  |U| ? 0.10 32  47 0.085  |U| 26 … … Perspektivenvorlesung

Understanding Social Tagging Networks

Understanding Social Tagging Networks

Presentation Transcript

Lanes – Ein Overlay zur Dienstsuche in Ad-hoc-Netzen

Informationssuche in sozialen Netzen