1 / 18

Techniques for Collaboration in Text Filtering

Techniques for Collaboration in Text Filtering. Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore County ian@cs.umbc.edu. Overview. Text filtering and collaborative filtering Finding collaboration among content profiles

hubert
Download Presentation

Techniques for Collaboration in Text Filtering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Techniques for Collaboration inText Filtering Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore County ian@cs.umbc.edu

  2. Overview • Text filtering and collaborative filtering • Finding collaboration among content profiles • Experimental results • Ongoing work

  3. Information Filtering • Given • a stream of documents (news articles, movies) • a set of users (with stable and specific interests) • Recommend documents to users who will be interested in them • "Tell me when a jazz CD comes out that I'll like." • "Tell me when an earthquake is reported."

  4. Content Filtering • Construct profiles from example documents • vector of weights for terms in documents • can use known relevant and nonrelevant docs • can use external resources such as a home page, job description, or research papers • Match new documents against content profiles

  5. Filtering in a Community • Many people will be watching the same stream • Some of them may have overlapping interests • earthquakes, mideast politics, building codes, Turkey • Charles Mingus, Duke Ellington, Kenny G • Want to take advantage of group effort

  6. "Pure" Collaborative Filtering • collect users' ratings for documents • thumbs up/down, or 1-5 scale • compute correlations among users • predict ratings for new/unseen items using existing ratings and correlation values

  7. Pure CF Example Comedies Dramas Alice 5 7 Bob ? 9 7 ? 2 9 Carmen 4 9 7 8 1 8 Doug ? 9

  8. Combining Content and Collaboration • Pure collaborative filtering • can recommend anything • must have ratings to give predictions • don't know much about documents or ratings • Adding content to collaboration • content filtering can recommend an unrated document • exploit common themes among content profiles

  9. One Approach to CBCF • Construct content profiles • Documents are vectors of weighted features • Build profiles from known relevant and nonrelevant documents • Collaborative step • Combine profile vectors into single matrix • Compute latent semantic index of profile collection • Route new documents in profiles' "LSI space"

  10. Latent Semantic Indexing  DT wtd T = r  r r  d t  d t  r • Compute singular value decomposition of a content matrix • D, a representation of M in r dimensions • T, a matrix for transforming new documents •  gives relative importance of dimensions

  11. Collaborating with LSI • LSI dimensions are ... • based on term co-occurrence patterns between documents (profiles) • ordered by their prominence in collection • LSI space built from profiles • highlights common patterns among profiles • "noisy" dimensions can be pruned • project new documents into a collaborative space for routing

  12. Experiments with Cranfield • Cranfield, a standard (if small) IR collection • 1398 documents, 255 scored queries • Profiles: selected Cranfield queries • 26 queries with ³ 15 relevant documents • 70% of profile's relevant docs used in each profile • Results shows improvement for using LSI of profiles • compared to using profiles alone • compared to using LSI of all of Cranfield

  13. Results: Average Precision k-value Set 1 Set 2 0.2894 0.2705 - Content (log-tfidf) Content LSI 25 0.2656 0.1980 50 0.3136 0.2686 (LSI of all of Cranfield) 100 0.3251 0.3053 0.3314 200 0.3144 0.3149 500 0.3302 Collaborative LSI 8 0.3136 0.2583 0.4151 0.3745 15 (LSI of profiles) 18 0.3600 0.3615

  14. Results: Precision-Recall

  15. Experiments with TREC • TREC-8 routing task • Profiles: 50 topics (351-400) • Test Documents: Financial Times 1993-4 • Training Documents: FT 92, LA Times 89-90, FBIS • Building profiles • short topic description • known relevant documents in training set • sample of non-relevant documents from training set

  16. Average Precision in TREC • Average precision... • with profiles alone = 0.4464 • with profile LSI = 0.3971 • LSI shows no improvement over original profiles • Some topics conceivably have common interests • "hydrogen energy"; "hydrogen fuel automobiles"; "hybrid fuel cars" • "clothing sweatshops"; "human smuggling" • But too little training overlap?

  17. Conclusions • LSI can improve filtering performance • but might not, if SVD can't find anything to work with • LSI of profiles is much cheaper to compute than LSI of a whole collection (or even a sample!)

  18. Current and Future Work • Looking at other collections • More TREC! • Reuters-21578 • Collaborative filtering collections... such as? • Looking at other techniques • Comparison to collaboration alone? • Other methods of combining content and collaboration

More Related