220 likes | 297 Views
This paper investigates iterative scaling to improve the precision of inter-document similarity measurement, addressing issues with Singular Value Decomposition (SVD) and outlier documents. The algorithm involves creating basis vectors and probabilistic models to reduce dimensions while preserving document information. Experimental results show a significant increase in precision over baseline algorithms.
E N D
Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement. In the 23th Annual International ACM SIGIR Conference (SIGIR'2000), 2000. Presenter: 游斯涵
Introduction • Some studies of applying modified or generalized SVD: (improving the precision of similarities) • SDD ( semi discrete decomposition) • T. G. Kolda and D. P. O’Learly • Proposed to reduce the storage and computational costs of LSI. • R-SVD ( Riemannian SVD) • E. p. Jiang and M. W. Berry • User feedback can be integrated into LSI models. (theoretical of LSI) • MDS (Multidimensional Scaling)、Bayesian regression model、Probabilistic models.
Introduction • Find the problem with SVD • SVD: • The topics underlying outlier documents tend to be lost as we chose lower number of dimensions. • Dimensional reduction comes from two sources: • outlier document • minor term • The thinking of this paper: • not to consider the outlier document as “noise”, all documents assume to be equal. • Try to eliminate noise from the minor terms but not eliminate the influence of the outlier documents. Outlier documents Documents very different from other documents
Compare with SVD • Same • Trying to find a smaller set of basis vectors for a reduced space. • Differ • Scale the length of each residual vector • Treat documents and terms in a nonsymmetrical way.
term Algorithm-basis vector creation • Input: term-document matrix D, scaling factor q • Output: basis vectors For ( i=1;until reaching some criterion ;i=i+1) the first unit eigenvector of End for m*m m*n Doc
n n m m Algorithm-basis vector creation m = n = n m
Algorithm-document vector creation • Dimension reduction n n m = k m k There are two important variables in this algorithm: (scaling factor) and (the number of dimensions)
Find the eigenvector of example
example Find it’s eigenvector
Probabilistic model • Basis vectors: • Follows a Gaussian distribution • Multivariate Normal (MVN) Distribution
Probabilistic model • The log likelihood for the document vectors reduced to dimension k is computed as (Ding) Maximize this Negligible because it changes slowly
parameter • : set 1 to 10,increment of 1. • : selection of dimension by log-likelihood
experiment • Test data: • TREC collections • 20 topics • Total umber of documents is 684 disjoint pool2 training data Test data pool1 15 document set 15 document set Each set range from 31~126 Number of topic range from 6~20
Baseline algorithm • Three algorithm • SVD taking the left singular vectors as the basis vector • Term-document without any basis conversion (term frequency) • This paper algorithm
67.7 62.2 60 evaluation • Assumption • Similarity should be higher for any document pair relevant to the same topic (intra-topic pair).
evaluation • Preservation rate (document length): • Reduction rate (越大越好) : 1 - Preservation rate • Dimensional reduction rate (越大越好) : 1 - ( # of dimensions / max # of dimensions)
Selection dimension • Log-likelihood method: • Training-based method: • Choose the dimension which make the preservation rate closer to the average preservation rate. • Random guess-based method:
result 17.8%
result Dimension reduction rate 43% higher than SVD on average This algorithm shows 35.8% higher reduction rate than SVD
conclusion • This algorithm achieved higher precision (up 17.8%) of similarity measurement with higher reduction rate (43% higher) than the baseline algorithm. • Scaling factor can become dynamic to improve the performance.