Loading in 2 Seconds...

A Review of Information Filtering Part II: Collaborative Filtering

Loading in 2 Seconds...

- 240 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'A Review of Information Filtering Part II: Collaborative Filtering' - Samuel

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

A Review of Information FilteringPart II: Collaborative Filtering

Chengxiang Zhai

Language Technologies Institiute

School of Computer Science

Carnegie Mellon University

Outline

- A Conceptual Framework for Collaborative Filtering (CF)
- Rating-based Methods (Breese et al. 98)
- Memory-based methods
- Model-based methods
- Preference-based Methods (Cohen et al. 99 & Freund et al. 98)
- Summary & Research Directions

What is Collaborative Filtering (CF)?

- Making filtering decisions for an individual user based on the judgments of other users
- Inferring individual’s interest/preferences from that of other similar users
- General idea
- Given a user u, find similar users {u1, …, um}
- Predict u’s preferences based on the preferences of u1, …, um

CF: Applications

- Recommender Systems: books, CDs, Videos, Movies, potentially anything!
- Can be combined with content-based filtering
- Example (commercial) systems
- GroupLens (Resnick et al. 94): usenet news rating
- Amazon: book recommendation
- Firefly (purchased by Microsoft?): music recommendation
- Alexa: web page recommendation

CF: Assumptions

- Users with a common interest will have similar preferences
- Users with similar preferences probably share the same interest
- Examples
- “interest is IR” => “read SIGIR papers”
- “read SIGIR papers” => “interest is IR”
- Sufficiently large number of user preferences are available

CF: Intuitions

- User similarity
- If Jamie liked the paper, I’ll like the paper
- ? If Jamie liked the movie, I’ll like the movie
- Suppose Jamie and I viewed similar movies in the past six months …
- Item similarity
- Since 90% of those who liked Star Wars also liked Independence Day, and, you liked Star Wars
- You may also like Independence Day

Collaborative Filtering vs. Content-based Filtering

- Basic filtering question: Will user U like item X?
- Two different ways of answering it
- Look at what U likes
- Look at who likes X
- Can be combined

=> characterize X => content-based filtering

=> characterize U => collaborative filtering

Rating-based vs. Preference-based

- Rating-based: User’s preferences are encoded using numerical ratings on items
- Complete ordering
- Absolute values can be meaningful
- But, values must be normalized to combine
- Preferences: User’s preferences are represented by partial ordering of items
- Partial ordering
- Easier to exploit implicit preferences

A Formal Framework for Rating

Objects: O

o1 o2 … oj … on

3 1.5 …. … 2

2

1

3

Users: U

Xij=f(ui,oj)=?

u1

u2

…

ui

...

um

The task

- Assume known f values for some (u,o)’s
- Predict f values for other (u,o)’s
- Essentially function approximation, like other learning problems

?

Unknown function

f: U x O R

Where are the intuitions?

- Similar users have similar preferences
- If u u’, then for all o’s, f(u,o) f(u’,o)
- Similar objects have similar user preferences
- If o o’, then for all u’s, f(u,o) f(u,o’)
- In general, f is “locally constant”
- If u u’ and o o’, then f(u,o) f(u’,o’)
- “Local smoothness” makes it possible to predict unknown values by interpolation or extrapolation
- What does “local” mean?

Two Groups of Approaches

- Memory-based approaches
- f(u,o) = g(u)(o) g(u’)(o) if u u’
- Find “neighbors” of u and combine g(u’)(o)’s
- Model-based approaches
- Assume structures/model: object cluster, user cluster, f’ defined on clusters
- f(u,o) = f’(cu, co)
- Estimation & Probabilistic inference

Memory-based Approaches (Breese et al. 98)

- General ideas:
- Xij: rating of object j by user i
- ni: average rating of all objects by user i
- Normalized ratings: Vij = Xij - ni
- Memory-based prediction
- Specific approaches differ in w(a,i) -- the distance/similarity between user a and i

User Similarity Measures

- Pearson correlation coefficient (sum over commonly rated items)
- Cosine measure
- Many other possibilities!

Improving User Similarity Measures (Breese et al. 98)

- Dealing with missing values: default ratings
- Inverse User Frequency (IUF): similar to IDF
- Case Amplification: use w(a,I)p, e.g., p=2.5

Model-based Approaches (Breese et al. 98)

- General ideas
- Assume that data/ratings are explained by a probabilistic model with parameter
- Estimate/learn model parameter based on data
- Predict unknown rating using E [xk+1 | x1, …, xk], which is computed using the estimated model
- Specific methods differ in the model used and how the model is estimated

Probabilistic Clustering

- Clustering users based on their ratings
- Assume ratings are observations of a multinomial mixture model with parameters p(C), p(xi|C)
- Model estimated using standard EM
- Predict ratings using E[xk+1 | x1, …, xk]

Bayesian Network

- Use BN to capture object/item dependency
- Each item/object is a node
- (Dependency) structure is learned from all data
- Model parameters: p(xk+1 |pa(xk+1)) where pa(xk+1) is the parents/predictors of xk+1 (represented as a decision tree)
- Predict ratings using E[xk+1 | x1, …, xk]

Three-way Aspect Model(Popescul et al. 2001)

- CF + content-based
- Generative model
- (u,d,w) as observations
- z as hidden variable
- Standard EM
- Essentially clustering the joint data
- Evaluation on ResearchIndex data
- Found it’s better to treat (u,w) as observations

Evaluation Criteria (Breese et al. 98)

- Rating accuracy
- Average absolute deviation
- Pa = set of items predicted
- Ranking accuracy
- Expected utility
- Exponentially decaying viewing probabillity
- ( halflife )= the rank where the viewing probability =0.5
- d = neutral rating

Results

- BN & CR+ are generally better

than VSIM & BC

- BN is best with more training data

- VSIM is better with little training data

- Inverse User Freq. Is effective

- Case amplification is mostly effective

Summary of Rating-based Methods

- Effectiveness
- Both memory-based and model-based methods can be effective
- The correlation method appears to be robust
- Bayesian network works well with plenty of training data, but not very well with little training data
- The cosine similarity method works well with little training data

Summary of Rating-based Methods (cont.)

- Efficiency
- Memory based methods are slower than model-based methods in predicting
- Learning can be extremely slow for model-based methods

Preference-based Methods(Cohen et al. 99, Freund et al. 98)

- Motivation
- Explicit ratings are not always available, but implicit orderings/preferences might be available
- Only relative ratings are meaningful, even if when ratings are available
- Combining preferences has other applications, e.g.,
- Merging results from different search engines

A Formal Model of Preferences

- Instances: O={o1,…, on}
- Ranking function: R: (U x) O x O [0,1]
- R(u,v)=1 means u is strongly preferred to v
- R(u,v)=0 means v is strongly preferred to u
- R(u,v)=0.5 means no preference
- Feedback: F = {(u,v)}, u is preferred to v
- Minimize Loss:

Hypothesis space

The Hypothesis Space H

- Without constraints on H, the loss is minimized by any R that agrees with F
- Appropriate constraints for collaborative filtering
- Compare this with

The Hedge Algorithm for Combining Preferences

- Iterative updating of w1, w2, …, wn
- Initialization: wi is uniform
- Updating: [0,1]
- L=0 => weight stays
- L is large => weight is decreased

Some Theoretical Results

- The cumulative loss of Ra will not be much worse than that of the best ranking expert/feature
- Preferences Ra => ordering => R L(R,F) <= DISAGREE(,Ra)/|F| + L(Ra,F)
- Need to find that minimizes disagreement
- General case: NP-complete

A Greedy Ordering Algorithm

- Use weighted graph to represent preferences R
- For each node, compute the potential value, I.e., outgoing_weights - ingoing_weights
- Rank the node with the highest potential value above all others
- Remove this node and its edges, repeat
- At least half of the optimal agreement is guaranteed

Improvement

- Identify all the strongly connected components
- Rank the components consistently with the edges between them
- Rank the nodes within a component using the basic greedy algorithm

Evaluation of Ordering Algorithms

- Measure: “weight coverage”
- Datasets = randomly generated small graphs
- Observations
- The basic greedy algorithm works better than a random permutation baseline
- Improved version is generally better, but the improvement is insignificant for large graphs

Metasearch Experiments

- Task: Known item search
- Search for a ML researchers’ homepage
- Search for a university homepage
- Search expert = variant of query
- Learn to merge results of all search experts
- Feedback
- Complete : known item preferred to all others
- Click data : known item preferred to all above it
- Leave-one-out testing

Metasearch Results

- Measures: compare combined preferences with individual ranking function
- sign test: to see which system tends to rank the known relevant article higher.
- #queries with the known relevant item ranked above k.
- average rank of the known relevant item
- Learned system better than individual expert by all measure (not surprising, why?)

Direct Learning of an Ordering Function

- Each expert is treated as a ranking feature fi: O R U {0} (allow partial ranking)
- Given preference feedback : X x X R
- Goal: to learn H that minimizes the loss
- D (x0,x1): a distribution over X x X(actually a uniform dist. over pairs with feedback order) D (x0,x1) = c max{0, (x0,x1) }

The RankBoost Algorithm

- Iterative updating of D(x0,x1)
- Initialization: D1= D
- For t=1,…,T:
- Train weak learner using Dt
- Get weak hypothesis ht: X R
- Choose t >0
- Update
- Final hypothesis:

How to Choose t and Design ht ?

- Bound on the ranking loss
- Thus, we should choose t that minimizes the bound
- Three approaches:
- Numerical search
- Special case: h is either 0 or 1
- Approximation of Z, then find analytic solution

Efficient RankBoost for Bipartite Feedback

X0

Bipartite feedback:

Essentially binary classification

X1

Complexity at each round: O(|X0||X1|) O(|X0|+|X1|)

Evaluation of RankBoost

- Meta-search: Same as in (Cohen et al 99)
- Perfect feedback
- 4-fold cross validation

Summary

- CF is “easy”
- The user’s expectation is low
- Any recommendation is better than none
- Making it practically useful
- CF is “hard”
- Data sparseness
- Scalability
- Domain-dependent

Summary (cont.)

- CF as a Learning Task
- Rating-based formulation
- Learn f: U x O -> R
- Algorithms
- Instance-based/memory-based (k-nearest neighbors)
- Model-based (probabilistic clustering)
- Preference-based formulation
- Learn PREF: U x O x O -> R
- Algorithms
- General preference combination (Hedge), greedy ordering
- Efficient restricted preference combination (RankBoost)

Summary (cont.)

- Evaluation
- Rating-based methods
- Simple methods seem to be reasonably effective
- Advantage of sophisticated methods seems to be limited
- Preference-based methods
- More effective than rating-based methods according to one evaluation
- Evaluation on meta-search is weak

Research Directions

- Exploiting complete information
- CF + content-based filtering + domain knowledge + user model …
- More “localized” kernels for instance-based methods
- Predicting movies need different “neighbor users” than predicting books
- Suggesting using items similar to the target item as features to find neighbors

Research Directions (cont.)

- Modeling time
- There might be sequential patterns on the items a user purchased (e.g., bread machine -> bread machine mix)
- Probabilistic model of preferences
- Making preference function a probability function, e.g, P(A>B|U)
- Clustering items and users
- Minimizing preference disagreements

References

- Cohen, W.W., Schapire, R.E., and Singer, Y. (1999) "Learning to Order Things", Journal of AI Research, Volume 10, pages 243-270.
- Freund, Y., Iyer, R.,Schapire, R.E., & Singer, Y. (1999). An efficient boosting algorithm for combining preferences. Machine Learning Journal. 1999.
- Breese, J. S., Heckerman, D., and Kadie, C. (1998). Empirical Analysis of Predictive Algorithms for Collaborative Filtering. In Proceedings of the 14th Conference on Uncertainty in Articial Intelligence, pp. 43-52.
- Alexandrin Popescul and Lyle H. Ungar, Probabilistic Models for Unified Collaborative and Content-Based Recommendation in Sparse-Data Environments, UAI 2001.
- N. Good, J.B. Schafer, J. Konstan, A. Borchers, B. Sarwar, J. Herlocker, and J. Riedl. "Combining Collaborative Filtering with Personal Agents for Better Recommendations." Proceedings AAAI-99. pp 439-446. 1999.

Download Presentation

Connecting to Server..