1 / 91

Taming Information Overload

Taming Information Overload. Dafna Shahaf. Khalid El-Arini. Carlos Guestrin . Yisong Yue. Search Limitations. Input. Output. Interaction. New query. Our Approach. Input. Output. Structured , annotated output. Phrase complex information needs. Interaction.

aviva
Download Presentation

Taming Information Overload

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Taming Information Overload DafnaShahaf Khalid El-Arini Carlos Guestrin YisongYue

  2. Search Limitations Input Output Interaction New query

  3. Our Approach Input Output • Structured, annotated output • Phrase complex information needs Interaction Richer forms of interaction New query

  4. Millions of blog posts published every day • Some stories become disproportionately popular • Hard to find information you care about

  5. Our goal: coverage • Turn down the noise in the blogosphere • select a small set of posts that covers the most important stories January 17, 2009

  6. Our goal: coverage • Turn down the noise in the blogosphere • select a small set of posts that covers the most important stories

  7. Our goal: personalization • Tailor post selection to user tastes posts selected without personalization But, I like sports!I want articles like: After personalization based on Zidane’s feedback

  8. TDN outline [El-Arini, Veda, Shahaf, G. ‘09] • Coverage: • Formalize notion of covering the blogosphere • Near-optimal solution for post selection • Evaluate on real blog data and compare against: • Personalization: • Learn a personalized coverage function • Algorithm for learning user preferences using limited feedback • Evaluate on real blog data

  9. Approach Overview Blogosphere Post Selection Coverage Function Feature Extraction … Personalization

  10. Document Features • Low level • Words, noun phrases, named entities • e.g., Obama, China, peanut butter • High level • e.g., Topics • Topic = probability distribution over words Inauguration Topic National Security Topic

  11. Coverage Function Some features moreimportant than otherse.g., weigh by frequency • cover ( ) = amount by which covers • cover ( ) = amount by which { , } covers … Features Posts … A post never covers a feature completely use soft notion of coverage, e.g., prob. at least one post in A covers feature f Post d Feature f coverd(f) Feature f Set A coverA(f)

  12. Objective Function for Post Selection • Want to select a set of posts that maximizes • Maximizing is NP-hard! Posts shown to user weights on features probability that set A covers feature f feature set is submodular Greedy algorithm  (1-1/e)-approximation CELF Algorithm  very fast, same guarantees Advantage( ) Advantage( )

  13. Approach Overview Blogosphere Submodular function optimization Post Selection Coverage Function Feature Extraction Personalization

  14. User study: Are the stories important that day? Higher is better TDN+NE TDN +LDA Named entities and common noun phrases as features LDA topics as features We do as well as Yahoo! and Google

  15. User study: Are the stories redundant with each other? Lower is better TDN +LDA TDN+NE Google performs poorly We do as well as Yahoo!

  16. User study summary • Google: good topicality, high redundancy • Yahoo!: performs well on both, but uses rich features • CTR, search trends, user voting, etc. Lower is better Higher is better TDN +LDA TDN +NE Topicality Redundancy TDN +LDA TDN +NE TDN performs as well as Yahoo! using only post content

  17. TDN outline • Coverage: • Formalize notion of covering the blogosphere • Near-optimal solution for post selection • Evaluate on real blog data and compare against: • Personalization: • Learn a personalized coverage function • Algorithm for learning user preferences using limited feedback • Evaluate on real blog data

  18. Personalization • People have varied interests • Our Goal: Learn a personalized coverage function using limited user feedback Barack Obama Britney Spears

  19. Personalize postings learn your coverage functiononline learning of submodular function problemalgorithm with strong theoretical guarantees Blogosphere personalizedpostselection personalizedcoverage fn. coveragefunction post selection personalization

  20. Modeling User Preferences • represents user preference for feature f • Want to learn preference over the features Importance of feature in corpus User preference for a politico for a sports fan

  21. Exploration vs Exploitation • Goal: want to recommend content that users like • Exploiting previous feedback from users • However: users only provide feedback on recommended content • Explore to collect feedback for new topics • Solution: algorithm to balance exploration vs exploitation • Linear Submodular Bandits Problem

  22. Balancing Exploration & Exploitation[Yue & Guestrin, 2011] • For each slot, maximize trade-off • (pick article about Tennis) Uncertainty of estimate Estimated coverage gain Uncertainty of Estimate Mean Estimate by Topic

  23. Learning User Preferences (Approach 2)[Yue & Guestrin, 2011] Explore-Exploit Tradeoff Explore-Exploit Tradeoff Explore-Exploit Tradeoff Theorem: For Bandit Alg, True π* Least Squares Update Least Squares Update (1-1/e)avg( ) – avg( ) 0 Learned π i.e., we achieve no-regret Learns a good approximation of the true π* Rate: 1/sqrt(kT), recommending k documents, T rounds Before any feedback After 1 day of personalization After 2 days of personalization

  24. User Study: Online Learning for News Recommendation Submodular Bandits Wins Submodular Bandits Wins Submodular Bandits Wins 27 users in study Ties Ties Losses Losses StaticWeights Multiplicative Updates (no exploration) No-DiversityBandits

  25. Our Approach Input Output • Structured, annotated output • Phrase complex information needs what about more complex information needs? Interaction Richer forms of interaction

  26. A Prescient Warning As long as the centuries…unfold, the number of books will grow continually… as convenient to search for a bit of truth concealed in nature as to find it hidden away in an immense multitude of bound volumes -Dennis Diderot, Encyclopédie (1755) • Today: 107 papers in 105 conferences/journals* How do we cope? * Thomson Reuters Web of Knowledge

  27. Motivation (1) • Is there an approximation algorithm for the submodular covering problem that doesn’t require an integral-valued objective function? Any recent papers influenced by this?

  28. Motivation (2) • It’s 11:30pm Samoa Time. Your “Related Work” section is a bit sparse. Here are some papers we’ve cited so far. Anything else?

  29. Given a set of relevant query papers, what else should I read?

  30. An example seminal/background paper? Cited byall query papers However, unlikely to find papers directly connected to entire query set. We need something more general… query set Cites all query papers a competing approach?

  31. Select a set of papers A with maximum influence to/from the query set Q

  32. Modeling influence • Ideas flow from cited papers to citing papers

  33. Influence context Why do I cite this paper? generative model of text variational inference EM … we call these concepts

  34. Concept representation • Words, phrases or important technical terms • Proteins, genes, or other advanced features Our assumption: Influence always occurs in the context of concepts

  35. Influence by Word Influence with respect to word: bayesian Probabilistic Latent Semantic Indexing Introduction to Variational Methods for Graphical Models Modern Information Retrieval Text Classification… Using EM Latent Dirichlet Allocation

  36. Influence by Word Influence with respect to word: simplex Probabilistic Latent Semantic Indexing Introduction to Variational Methods for Graphical Models Modern Information Retrieval Text Classification… Using EM Latent Dirichlet Allocation

  37. Influence by Word • Weight on edge (u,v) in Gcrepresents strength of • direct influence from u to v • What is the chance that a random instance of concept c • in v came about as a result of u? Influence with respect to word: mixture Probabilistic Latent Semantic Indexing Introduction to Variational Methods for Graphical Models Modern Information Retrieval Text Classification… Using EM mixture Latent Dirichlet Allocation

  38. Influence: Definition Edge (u, v) in Gc active with probability … … … x Influencec(xy): probability exists path in Gc between x and y, consisting of active edges • #P-complete, but efficient approximation using sampling or dynamic programming y

  39. Putting it all together selected papers • Maximize FQ(A) s.t. |A| ≤ k (output k papers) • Submodular optimization problem, solved efficiently and near-optimally probability of influence between q and at least one paper in A query set concept set prevalence of c in paper q

  40. But should all users get the same results?

  41. Personalized trust • Different communities trust different researchers for a given concept • Goal: Estimate personalized trust from limited user input e.g., network Pearl Kleinberg Hinton

  42. Specifying trust preferences • Specifying trust should not be an onerous task • Assume given (nonexhaustive!) set of trusted papers B, e.g., • a BibTeX file of all the researcher’s previous citations • a short list of favorite conferences and journals • someone else’s citation history! a committee member? journal editor? someone in another field? a Turing Award winner?

  43. Computing trust How much do I trust Jon Kleinberg with respect to the concept “network”? Kleinberg’s papers An author is trusted if he/she influences the user’s trusted set B B

  44. Personalized Objective probability of influence between q and at least one paper in A Extra weight in Influence:Does user trust at least one of authors of d with respect to concept c?

  45. Personalized Recommendations generalrecommendations: personalized recommendations

  46. User Study Evaluation • 16 PhD students in machine learning • For each: • Selected a recent publication of participant — the study paper — for which we find related work • Two variants of our methodology (w/ and w/o trust) • Three state-of-the-art alternatives: • Relational Topic Model (generative model of text and links) [Chang, Blei ‘10] • Information Genealogy (uses only document text) [Shaparenko, Joachims ‘07] • Google Scholar (based on keywords provided by coauthor) • Double blind study where participant provided with title/author/abstract of one paper at a time, and asked several questions

  47. Usefulness our approach higher is better On average, our approach provides more useful and more must-read papers than comparison techniques

  48. Trust our approach higher is better On average, our approach provides more trustworthy papers than comparison techniques, especially when incorporating participant’s trust preferences

  49. Novelty our approach On average, our approach provides more familiar papers than comparison techniques, especially when incorporating participant’s trust preferences

More Related