Techniques for Collaboration in Text Filtering

Techniques for Collaboration inText Filtering Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore County ian@cs.umbc.edu

Overview • Text filtering and collaborative filtering • Finding collaboration among content profiles • Experimental results • Ongoing work

Information Filtering • Given • a stream of documents (news articles, movies) • a set of users (with stable and specific interests) • Recommend documents to users who will be interested in them • "Tell me when a jazz CD comes out that I'll like." • "Tell me when an earthquake is reported."

Content Filtering • Construct profiles from example documents • vector of weights for terms in documents • can use known relevant and nonrelevant docs • can use external resources such as a home page, job description, or research papers • Match new documents against content profiles

Filtering in a Community • Many people will be watching the same stream • Some of them may have overlapping interests • earthquakes, mideast politics, building codes, Turkey • Charles Mingus, Duke Ellington, Kenny G • Want to take advantage of group effort

"Pure" Collaborative Filtering • collect users' ratings for documents • thumbs up/down, or 1-5 scale • compute correlations among users • predict ratings for new/unseen items using existing ratings and correlation values

Pure CF Example Comedies Dramas Alice 5 7 Bob ? 9 7 ? 2 9 Carmen 4 9 7 8 1 8 Doug ? 9

Combining Content and Collaboration • Pure collaborative filtering • can recommend anything • must have ratings to give predictions • don't know much about documents or ratings • Adding content to collaboration • content filtering can recommend an unrated document • exploit common themes among content profiles

One Approach to CBCF • Construct content profiles • Documents are vectors of weighted features • Build profiles from known relevant and nonrelevant documents • Collaborative step • Combine profile vectors into single matrix • Compute latent semantic index of profile collection • Route new documents in profiles' "LSI space"

Latent Semantic Indexing  DT wtd T = r  r r  d t  d t  r • Compute singular value decomposition of a content matrix • D, a representation of M in r dimensions • T, a matrix for transforming new documents •  gives relative importance of dimensions

Collaborating with LSI • LSI dimensions are ... • based on term co-occurrence patterns between documents (profiles) • ordered by their prominence in collection • LSI space built from profiles • highlights common patterns among profiles • "noisy" dimensions can be pruned • project new documents into a collaborative space for routing

Experiments with Cranfield • Cranfield, a standard (if small) IR collection • 1398 documents, 255 scored queries • Profiles: selected Cranfield queries • 26 queries with ³ 15 relevant documents • 70% of profile's relevant docs used in each profile • Results shows improvement for using LSI of profiles • compared to using profiles alone • compared to using LSI of all of Cranfield

Results: Average Precision k-value Set 1 Set 2 0.2894 0.2705 - Content (log-tfidf) Content LSI 25 0.2656 0.1980 50 0.3136 0.2686 (LSI of all of Cranfield) 100 0.3251 0.3053 0.3314 200 0.3144 0.3149 500 0.3302 Collaborative LSI 8 0.3136 0.2583 0.4151 0.3745 15 (LSI of profiles) 18 0.3600 0.3615

Results: Precision-Recall

Experiments with TREC • TREC-8 routing task • Profiles: 50 topics (351-400) • Test Documents: Financial Times 1993-4 • Training Documents: FT 92, LA Times 89-90, FBIS • Building profiles • short topic description • known relevant documents in training set • sample of non-relevant documents from training set

Average Precision in TREC • Average precision... • with profiles alone = 0.4464 • with profile LSI = 0.3971 • LSI shows no improvement over original profiles • Some topics conceivably have common interests • "hydrogen energy"; "hydrogen fuel automobiles"; "hybrid fuel cars" • "clothing sweatshops"; "human smuggling" • But too little training overlap?

Conclusions • LSI can improve filtering performance • but might not, if SVD can't find anything to work with • LSI of profiles is much cheaper to compute than LSI of a whole collection (or even a sample!)

Current and Future Work • Looking at other collections • More TREC! • Reuters-21578 • Collaborative filtering collections... such as? • Looking at other techniques • Comparison to collaboration alone? • Other methods of combining content and collaboration

Techniques for Collaboration in Text Filtering