1 / 31

Filtering Semi-Structured Documents Based on Faceted Feedback

Filtering Semi-Structured Documents Based on Faceted Feedback. Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz. Personalized Information Filtering. Identify user-desired documents from a document stream

marlo
Download Presentation

Filtering Semi-Structured Documents Based on Faceted Feedback

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Filtering Semi-Structured DocumentsBased on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz

  2. Personalized Information Filtering • Identify user-desired documents from a document stream • Two families of filtering approaches • Collaborative Filtering (CF) • Content-Based Filtering (CBF) • Applications: news feeder, email spam filter, etc. Emails Passed documents News Filtering System Blogs …

  3. Semi-Structured Documents • Increasingly prevalent over the Internet • Emails, news, movies, tweets, etc. • Plenty of metadata available

  4. Definitions • Facet: a metadata field • Date, Topic, Location, Director, Genre, etc. • Facet-Value Pair (FVP): a metadata field assigned with a particular value • Topic: Royal wedding • Date: 04-29-2011 • Location: London, UK Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  5. Motivation • Existing filtering approaches learn user interests based on users’ relevance judgments of documents • Users may have prior knowledge on which facet-value pairs are relevant • English-only readers • “Language: English” • Social network analysts • “Company: Facebook” Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  6. Can we exploit users’ prior knowledge on facet-value pairs for filtering? Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  7. A New User Interaction Mechanism:Faceted Feedback FVP candidates: • Lang: … • Topic: … • Date: … Filtering System Relevant FVPs: • Topic: … • Lang: …

  8. Research Questions • Question 1 • How to select facet-value pair candidates? • Question 2 • How to learn user profiles based on faceted feedback? Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  9. Q1: Possible Methods • Feature selection methods for text classification • E.g., Mutual Information, Chi-Square measure, etc. • Usually a large number of labeled documents available • Query expansion methods for retrieval • E.g., TFIDF score on pseudo relevant documents • No labeled documents available Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  10. FVP Selection: Our Approach • In a filtering task • A large number of unlabeled documents • Possibly a small number of labeled documents • We rank facet-value pairs by Pseudo relevant (positively classified) documents User-labeled relevant documents Intuition: features that occur frequently among relevant docs while rarely in the whole corpus are very likely to be relevant Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  11. Research Questions • Question 1 • How to select facet-value pair candidates? • Question 2 • How to learn user profiles based on faceted feedback? Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  12. Content-Based Filtering (CBF) • Treated as a binary text classification task • User profile: a feature vector that represents a user’s information needs (interests/preferences) • Given the user profile θ, a document can be determined as relevant or not according to: Document label Document vector The core of CBF is learning the user profile! Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  13. Q2: Possible Methods • Simple methods • Boolean strategy (AND, OR) • Feature selection • Pseudo relevant document • Sophisticated methods • Bayesian logistic regression with an adjusted prior (Dayanik et al. 06) • Generalized Expectation Criteria (Druck et al. 08) Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  14. Our Approach • The assumption • A feature is selected by a user since it has a high correlation with the document label (R/NR) • Generalized Constraint Model (GCM) Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  15. Correlation Decomposition • Sufficiency • The probability of a document being relevant given that the feature has occurred: P(R+|f=1) • P(R+|f=1)=1 : sufficient features • E.g., “Company: Facebook” for social network analysts • Necessity • The probability of the feature having occurred given that a document is relevant: P(f=1|R+) • P(f=1|R+)=1 : necessary features • E.g., “Language: English” for English-only readers Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  16. Examples: Highly-Correlated Features The whole corpus f2=1 R+ f1=1 f3=1 1) f1 is a sufficient feature since P(R+|f1=1)=1 2) f2 is a necessary feature since P(f2=1|R+)=1 3) f3 is neither necessary nor sufficient, but both its sufficiency and necessity are high (>0.5) Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  17. Estimating Sufficiency Document label User profile vector The feature Estimation of the label of document di The set of documents covered by feature f Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  18. Estimating Necessity Bayes’ Theorem! Feature sufficiency Prior distribution Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  19. Reference Distributions • Our assumption • User selects a feature since it has a high sufficiency and/or a high necessity • Reference distributions: two Bernoulli dist’ns • The sufficiency/necessity of a user-selected feature should be close to the reference distribution • KL-divergence for similarity measure Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  20. User Profile Learning • The unified loss function to combine two types of feedback: User-labeled documents Sufficient features Necessary features Ts , Tn:reference dist’ns Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  21. User Interaction Mechanisms Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Two mechanisms • Mechanism 1: ask users to select features they think are relevant • Mechanism 2: ask users to specifically select features they think are sufficient and necessary respectively 21

  22. Outline • Introduction • Faceted Feedback • Facet-Value Pair Candidate Selection • Learning from Faceted Feedback • Experiments • Settings • Results • Summary

  23. Data Sets • Use two data sets from TREC filtering track • TREC 2000: OHSUMED (348566 medical articles) + 63 topics (information needs) • Metadata field: MeSH (Medical Subject Headings) • TREC 2002: RCV1 (~800,000 news articles) + 50 topics defined by human assessors • Metadata fields: Topic, Industry, Region • Split each topic set into two equal-size subsets • One for parameter tuning, the other for testing Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  24. Faceted Feedback Collection • Recruit subjects on Mechanical Turk • Five subjects per topic • The average performances will be reported • For each topic, we show subjects • The topic description (information need) • A group of facet-value pair candidates Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  25. Evaluation Metrics • Precision (macro) • Recall (macro) • T11U = 2 * Nrd – Nnd • Nrd: the number of relevant docs delivered • Nnd: the number of non-relevant docs delivered • T11SU = • MinNU = -0.5 • MaxU: the maximum possible utility (T11U) Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  26. Outline • Introduction • Faceted Feedback • Facet-Value Pair Candidate Selection • Learning from Faceted Feedback • Experiments • Settings • Results • Summary

  27. Results 1: w/wo Faceted Feedback (FF) # relevant docs initially known Faceted feedback improves filtering performances, especially when fewer relevant documents are initially known. Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  28. Results 2: Different Learning Algorithms Existing approaches Our approach BOOL(A), BOOL(O): Boolean strategy FS: feature selection based on FF Pseudo-D/Q: pseudo relevant doc/query Prior: logistic regression with Bayesian prior GEC: generalized expectation criteria

  29. Outline • Introduction • Faceted Feedback • Facet-Value Pair Candidate Selection • Learning from Faceted Feedback • Experiments • Settings • Results • Summary

  30. Summary • Faceted feedback is useful for filtering, especially in the cold-start scenarios • The Generalized Constraint Model (GCM) is a robust user profile learning algorithm • In future work, we will evaluate our methods on data sets where faceted features are more important • Movie, music, product, etc. Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

  31. Questions? Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz lanbo@soe.ucsc.edu yiz@soe.ucsc.edu xingqianli@gmail.com

More Related