1 / 27

Homework

Homework. Define a loss function that compares two matrices (say mean square error) b = svd(bellcore ) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2] ) b3 = b$u[,1 :3] %*% diag(b$d[1 :3] ) %*% t(b$v[,1 :3]) More generally, for all possible r

akina
Download Presentation

Homework

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Homework • Define a loss function that compares two matrices (say mean square error) • b = svd(bellcore) • b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2]) • b3 = b$u[,1:3] %*% diag(b$d[1:3]) %*% t(b$v[,1:3]) • More generally, for all possible r • Let b.r= b$u[,1:r] %*% diag(b$d[1:r]) %*% t(b$v[,1:r]) • Compute the loss between bellcore and b.r as a function of r • Plot the loss as a function of r

  2. IR Models • Keywords (and Boolean combinations thereof) • Vector-Space ‘‘Model’’ (Salton, chap 10.1) • Represent the query and the documents as V- dimensional vectors • Sort vectors by • Probabilistic Retrieval Model • (Salton, chap 10.3) • Sort documents by

  3. Information Retrieval and Web Search Alternative IR models Instructor: RadaMihalcea Some of the slides were adopted from a course tought at Cornell University by William Y. Arms

  4. Latent Semantic Indexing Objective Replace indexes that use sets of index terms by indexes that use concepts. Approach Map the term vector space into a lower dimensional space, using singular value decomposition. Each dimension in the new space corresponds to a latent concept in the original data.

  5. Deficiencies with Conventional Automatic Indexing Synonymy: Various words and phrases refer to the same concept (lowers recall). Polysemy: Individual words have more than one meaning (lowers precision) Independence: No significance is given to two terms that frequently appear together Latent semantic indexing addresses the first of these (synonymy), and the third (dependence)

  6. Bellcore’s Examplehttp://en.wikipedia.org/wiki/Latent_semantic_analysis  c1 Human machine interface for Lab ABC computer applications  c2 A survey of user opinion of computer system response time  c3 The EPS user interface management system  c4 System and humansystem engineering testing of EPS  c5 Relation of user-perceived responsetime to error measurement m1 The generation of random, binary, unordered trees m2 The intersection graph of paths in trees m3 Graph minors IV: Widths of trees and well-quasi-ordering m4 Graph minors: A survey

  7. Term by Document Matrix

  8. "bellcore"<- structure(.Data = c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1), .Dim = c( 12, 9), .Dimnames = list(c("human", "interface", "computer", "user", "system", "response", "time", "EPS", "survey", "trees", "graph", "minors"), c("c1", "c2", "c3", "c4", "c5", "m1", "m2", "m3", "m4"))) help(dump) help(source)

  9. Query Expansion Query: Find documents relevant tohuman computer interaction Simple Term Matching: Matches c1, c2, and c4 Misses c3 and c5

  10. LargeCorrel-ations

  11. Correlations: Too Large to Ignore

  12. How to compute correlations round(100 * cor(bellcore)) c1 c2 c3 c4 c5 m1 m2 m3 m4 c1 100 -19 0 0 -33 -17 -26 -33 -33 c2 -19 100 0 0 58 -30 -45 -58 -19 c3 0 0 100 47 0 -21 -32 -41 -41 c4 0 0 47 100 -31 -16 -24 -31 -31 c5 -33 58 0 -31 100 -17 -26 -33 -33 m1 -17 -30 -21 -16 -17 100 67 52 -17 m2 -26 -45 -32 -24 -26 67 100 77 26 m3 -33 -58 -41 -31 -33 52 77 100 56 m4 -33 -19 -41 -31 -33 -17 26 56 100 round(100 * cor(t(bellcore))) human interface computer user system response time EPS survey trees graph minors human 100 36 36 -38 43 -29 -29 36 -29 -38 -38 -29 interface 36 100 36 19 4 -29 -29 36 -29 -38 -38 -29 computer 36 36 100 19 4 36 36 -29 36 -38 -38 -29 user -38 19 19 100 23 76 76 19 19 -50 -50 -38 system 43 4 4 23 100 4 4 82 4 -46 -46 -35 response -29 -29 36 76 4 100 100 -29 36 -38 -38 -29 time -29 -29 36 76 4 100 100 -29 36 -38 -38 -29 EPS 36 36 -29 19 82 -29 -29 100 -29 -38 -38 -29 survey -29 -29 36 19 4 36 36 -29 100 -38 19 36 trees -38 -38 -38 -50 -46 -38 -38 -38 -38 100 50 19 graph -38 -38 -38 -50 -46 -38 -38 -38 19 50 100 76 minors -29 -29 -29 -38 -35 -29 -29 -29 36 19 76 100

  13. plot(hclust(as.dist(-cor(t(bellcore)))))

  14. plot(hclust(as.dist(-cor(bellcore))))

  15. Correcting forLarge Correlations

  16. Thesaurus

  17. Term by Doc Matrix:Before & After Thesaurus

  18. Singular Value Decomposition (SVD)X = UDVT txd t x m m x m m x d D VT X = U • m is the rank of X< min(t, d) • D is diagonal • D2 are eigenvalues (sorted in descending order) • U UT = I and V VT = I • Columns of U are eigenvectors of X XT • Columns of V are eigenvectors of XT X

  19. m is the rank of X< min(t, d) • D is diagonal • D2are eigenvalues (sorted in descending order) • U UT = I and V VT = I • Columns of U are eigenvectors of X XT • Columns of V are eigenvectors of XT X

  20. Dimensionality Reduction t x d t x k k x k k x d D VT ^ = X U k is the number of latent concepts (typically 300 ~ 500)

  21. Dimension Reduction in R b= svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2]) dimnames(b2) = dimnames(bellcore) par(mfrow=c(2,2)) plot(hclust(as.dist(-cor(bellcore)))) plot(hclust(as.dist(-cor(t(bellcore))))) plot(hclust(as.dist(-cor(b2)))) plot(hclust(as.dist(-cor(t(b2)))))

  22. SVDB BT = U D2 UTBT B = V D2 VT Doc Term Latent

  23. Dimension Reduction  Block Structure round(100*cor(bellcore)) c1 c2 c3 c4 c5 m1 m2 m3 m4 c1 100 -19 0 0 -33 -17 -26 -33 -33 c2 -19 100 0 0 58 -30 -45 -58 -19 c3 0 0 100 47 0 -21 -32 -41 -41 c4 0 0 47 100 -31 -16 -24 -31 -31 c5 -33 58 0 -31 100 -17 -26 -33 -33 m1 -17 -30 -21 -16 -17 100 67 52 -17 m2 -26 -45 -32 -24 -26 67 100 77 26 m3 -33 -58 -41 -31 -33 52 77 100 56 m4 -33 -19 -41 -31 -33 -17 26 56 100 > round(100*cor(b2)) c1 c2 c3 c4 c5 m1 m2 m3 m4 c1 100 91 100 100 84 -86 -85 -85 -81 c2 91 100 91 88 99 -57 -56 -56 -50 c3 100 91 100 100 84 -86 -85 -85 -81 c4 100 88 100 100 81 -89 -88 -88 -84 c5 84 99 84 81 100 -44 -44 -43 -37 m1 -86 -57 -86 -89 -44 100 100 100 100 m2 -85 -56 -85 -88 -44 100 100 100 100 m3 -85 -56 -85 -88 -43 100 100 100 100 m4 -81 -50 -81 -84 -37 100 100 100 100

  24. Dimension Reduction  Block Structure round(100*cor(t(bellcore))) human interface computer user system response time EPS survey trees graph minors human 100 36 36 -38 43 -29 -29 36 -29 -38 -38 -29 interface 36 100 36 19 4 -29 -29 36 -29 -38 -38 -29 computer 36 36 100 19 4 36 36 -29 36 -38 -38 -29 user -38 19 19 100 23 76 76 19 19 -50 -50 -38 system 43 4 4 23 100 4 4 82 4 -46 -46 -35 response -29 -29 36 76 4 100 100 -29 36 -38 -38 -29 time -29 -29 36 76 4 100 100 -29 36 -38 -38 -29 EPS 36 36 -29 19 82 -29 -29 100 -29 -38 -38 -29 survey -29 -29 36 19 4 36 36 -29 100 -38 19 36 trees -38 -38 -38 -50 -46 -38 -38 -38 -38 100 50 19 graph -38 -38 -38 -50 -46 -38 -38 -38 19 50 100 76 minors -29 -29 -29 -38 -35 -29 -29 -29 36 19 76 100 > round(100*cor(t(b2))) human interface computer user system response time EPS survey trees graph minors human 100 100 93 94 99 82 82 100 -12 -85 -84 -83 interface 100 100 95 96 100 85 85 100 -7 -82 -80 -80 computer 93 95 100 100 96 98 98 93 26 -59 -57 -56 user 94 96 100 100 97 97 97 94 23 -62 -60 -59 system 99 100 96 97 100 88 88 100 -2 -79 -78 -77 response 82 85 98 97 88 100 100 83 46 -40 -38 -37 time 82 85 98 97 88 100 100 83 46 -40 -38 -37 EPS 100 100 93 94 100 83 83 100 -11 -84 -83 -82 survey -12 -7 26 23 -2 46 46 -11 100 63 65 66 trees -85 -82 -59 -62 -79 -40 -40 -84 63 100 100 100 graph -84 -80 -57 -60 -78 -38 -38 -83 65 100 100 100 minors -83 -80 -56 -59 -77 -37 -37 -82 66 100 100 100

  25. The term vector space t3 The space has as many dimensions as there are terms in the word list. d1 d2 t2  t1

  26. Latent concept vector space • term document query --- cosine > 0.9

More Related