1 / 28

Information Retrieval through Various Approximate Matrix Decompositions

Information Retrieval through Various Approximate Matrix Decompositions. Kathryn Linehan Advisor: Dr. Dianne O’Leary. Information Retrieval. Extracting information from databases We need an efficient way of searching large amounts of data Example: web search engine.

bertha-reid
Download Presentation

Information Retrieval through Various Approximate Matrix Decompositions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary

  2. Information Retrieval • Extracting information from databases • We need an efficient way of searching large amounts of data • Example: web search engine

  3. Querying a Document Database • We want to return documents that are relevant to entered search terms • Given data: • Term-Document Matrix, A • Entry ( i , j ): importance of term i in document j • Query Vector, q • Entry ( i ): importance of term i in the query

  4. Term-Document Matrix • Entry ( i, j) : weight of term i in document j Document 1 2 3 4 Term Example: Mark Twain Samuel Clemens Purple Fairy Example taken from [5]

  5. Query Vector • Entry ( i ) : weight of term i in the query Term Example: search for “Mark Twain” Mark Twain Samuel Clemens Purple Fairy Example taken from [5]

  6. Document Scoring Document 1 2 3 4 • Doc 1 and Doc 3 will be returned as relevant, but Doc 2 will not Term Scores Mark Doc 1 Twain Samuel Doc 2 Clemens Doc 3 Purple Doc 4 Fairy Example taken from [5]

  7. Can we do better if we replace the matrix by an approximation? • Singular Value Decomposition (SVD) • Nonnegative Matrix Factorization (NMF) • CUR Decomposition

  8. Nonnegative Matrix Factorization (NMF) • W and H are nonnegative k x n m x n m x k Storage: k(m + n) entries

  9. NMF • Multiplicative update algorithm of Lee and Seung found in [1] • Find W, H to minimize • Random initialization for W,H • Gradient descent method • Slow due to matrix multiplications in iteration

  10. NMF Validation A: 5 x 3 random dense matrix. Average over 5 runs. B: 500 x 200 random sparse matrix. Average over 5 runs.

  11. NMF Validation B: 500 x 200 random sparse matrix. Rank(NMF) = 80.

  12. CUR Decomposition • C (R) holds c (r) sampled and rescaled columns (rows) of A • U is computed using C and R C U R c x r r x n , m x n m x c where k is a rank parameter Storage: (nz(C) + cr + nz(R)) entries

  13. CUR Implementations • CUR algorithm in [3] by Drineas, Kannan, and Mahoney • Linear time algorithm • Improvement: Compact Matrix Decomposition (CMD) in [6] by Sun, Xie, Zhang, and Faloutsos • Modification: use ideas in [4] by Drineas, Mahoney, Muthukrishnan (no longer linear time) • Other Modifications: our ideas • Deterministic CUR code by G. W. Stewart [2]

  14. Sampling • Column (Row) norm sampling [3] • Prob(col j) = (similar for row i) • Subspace Sampling [4] • Uses rank-k SVD of A for column probabilities • Prob(col j) = • Uses “economy size” SVD of C for row probabilities • Prob(row i) = • Sampling without replacement

  15. Computation of U • Linear U [3]: approximately solves • Optimal U: solves

  16. Deterministic CUR • Code by G. W. Stewart [2] • Uses a RRQR algorithm that does not store Q • We only need the permutation vector • Gives us the columns (rows) for C (R) • Uses an optimal U

  17. Compact Matrix Decomposition (CMD) Improvement • Remove repeated columns (rows) in C (R) • Decreases storage while still achieving the same relative error [6] A: 50 x 30 random sparse matrix, k = 15. Average over 10 runs.

  18. CUR: Sampling with Replacement Validation A: 5 x 3 random dense matrix. Average over 5 runs. Legend: Sampling, U

  19. Sampling without Replacement: Scaling vs. No Scaling • Invert scaling factor applied to

  20. CUR: Sampling without Replacement Validation A: 5 x 3 random dense matrix. Average over 5 runs. B: 500 x 200 random sparse matrix. Average over 5 runs. Legend: Sampling, U, Scaling

  21. CUR Comparison B: 500 x 200 random sparse matrix. Average over 5 runs. Legend: Sampling, U, Scaling

  22. Judging Success: Precision and Recall • Measurement of performance for document retrieval • Average precision and recall, where the average is taken over all queries in the data set • Let Retrieved = number of documents retrieved, Relevant = total number of relevant documents to the query, RetRel = number of documents retrieved that are relevant. • Precision: • Recall:

  23. LSI Results Term-document matrix size: 5831 x 1033. All matrix approximations are rank 100 approximations (CUR: r = c = k). Average query time is less than 10-3 seconds for all matrix approximations.

  24. LSI Results Term-document matrix size: 5831 x 1033. All matrix approximations are rank 100 approximations. (CUR: r = c = k)

  25. Matrix Approximation Results Rel. Error (F-norm) Storage (nz) Runtime (sec) SVD 0.8203 686500 22.5664 NMF 0.8409 686400 23.0210 CUR: cn,lin 1.4151 17242 0.1741 CUR: cn,opt 0.9724 16358 0.2808 CUR: sub,lin 1.2093 16175 48.7651 CUR: sub,opt 0.9615 16108 49.0830 CUR: w/oR,no 0.9931 17932 0.3466 CUR: w/oR,yes 0.9957 17220 0.2734 CUR:GWS 0.9437 25020 2.2857 LTM -- 52003 --

  26. Conclusions • We may not be able to store an entire term-document matrix and it may be too expensive to compute an SVD • We can achieve LSI results that are almost as good with cheaper approximations • Less storage • Less computation time

  27. Completed Project Goals • Code/validate NMF and CUR • Analyze relative error, runtime, and storage of NMF and CUR • Improve CUR algorithm of [3] • Analyze use of NMF and CUR in LSI

  28. References [1] Michael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52(1):155-173, September 2007. M.W. Berry, S.A. Pulatova, and G.W. Stewart. Computing sparse reduced-rank approximations to sparse matrices. Technical Report UMIACS TR-2004-34 CMSC TR-4591, University of Maryland, May 2004. Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36(1):184-206, 2006. Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844-881, 2008. Tamara G. Kolda and Dianne P. O'Leary. A semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM Transactions on Information Systems, 16(4):322-346, October 1998. Jimeng Sun, Yinglian Xie, Hui Zhang, and Christos Faloutsos. Less is more: Sparse graph mining with compact matrix decomposition. Statistical Analysis and Data Mining, 1(1):6-22, February 2008. [2] [3] [4] [5] [6]

More Related