620 likes | 755 Views
This outline delves into the concepts of Latent Semantic Indexing (LSI) and Singular Value Decomposition (SVD) for effective text analysis and information retrieval. It discusses the methodology for mapping documents and terms to latent concepts, facilitating intelligent information filtering based on user-specified interests. By breaking down text into manageable concepts, LSI aids in document retrieval even in the absence of exact keywords. The structure includes motivations for SVD, its properties, and case studies demonstrating its applications in text compression and dimension reduction.
E N D
Multimedia Databases LSI and SVD
Text - Detailed outline • text • problem • full text scanning • inversion • signature files • clustering • information filtering and LSI
Information Filtering + LSI • [Foltz+,’92] Goal: • users specify interests (= keywords) • system alerts them, on suitable news-documents • Major contribution: LSI = Latent Semantic Indexing • latent (‘hidden’) concepts
Information Filtering + LSI Main idea • map each document into some ‘concepts’ • map each term into some ‘concepts’ ‘Concept’:~ a set of terms, with weights, e.g. • “data” (0.8), “system” (0.5), “retrieval” (0.6) -> DBMS_concept
Information Filtering + LSI Pictorially: term-document matrix (BEFORE)
Information Filtering + LSI Pictorially: concept-document matrix and...
Information Filtering + LSI ... and concept-term matrix
Information Filtering + LSI Q: How to search, eg., for ‘system’?
Information Filtering + LSI A: find the corresponding concept(s); and the corresponding documents
Information Filtering + LSI A: find the corresponding concept(s); and the corresponding documents
Information Filtering + LSI Thus it works like an (automatically constructed) thesaurus: we may retrieve documents that DON’T have the term ‘system’, but they contain almost everything else (‘data’, ‘retrieval’)
SVD - Detailed outline • Motivation • Definition - properties • Interpretation • Complexity • Case studies • Additional properties
SVD - Motivation • problem #1: text - LSI: find ‘concepts’ • problem #2: compression / dim. reduction
SVD - Motivation • problem #1: text - LSI: find ‘concepts’
SVD - Motivation • problem #2: compress / reduce dimensionality
Problem - specs • ~10**6 rows; ~10**3 columns; no updates; • random access to any cell(s) ; small error: OK
SVD - Detailed outline • Motivation • Definition - properties • Interpretation • Complexity • Case studies • Additional properties
SVD - Definition A[n x m] = U[n x r]L [ r x r] (V[m x r])T • A: n x m matrix (eg., n documents, m terms) • U: n x r matrix (n documents, r concepts) • L: r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix) • V: m x r matrix (m terms, r concepts)
SVD - Definition • A = ULVT - example:
SVD - Properties THEOREM [Press+92]:always possible to decomposematrix A into A = ULVT , where • U,L,V: unique (*) • U, V: column orthonormal (ie., columns are unit vectors, orthogonal to each other) • UTU = I; VTV = I (I: identity matrix) • L: eigenvalues are positive, and sorted in decreasing order
SVD - Example • A = ULVT - example: retrieval inf. lung brain data CS x x = MD
SVD - Example • A = ULVT - example: retrieval CS-concept inf. lung MD-concept brain data CS x x = MD
SVD - Example doc-to-concept similarity matrix • A = ULVT - example: retrieval CS-concept inf. lung MD-concept brain data CS x x = MD
SVD - Example • A = ULVT - example: retrieval ‘strength’ of CS-concept inf. lung brain data CS x x = MD
SVD - Example • A = ULVT - example: term-to-concept similarity matrix retrieval inf. lung brain data CS-concept CS x x = MD
SVD - Example • A = ULVT - example: term-to-concept similarity matrix retrieval inf. lung brain data CS-concept CS x x = MD
SVD - Detailed outline • Motivation • Definition - properties • Interpretation • Complexity • Case studies • Additional properties
SVD - Interpretation #1 ‘documents’, ‘terms’ and ‘concepts’: • U: document-to-concept similarity matrix • V: term-to-concept sim. matrix • L: its diagonal elements: ‘strength’ of each concept
SVD - Interpretation #2 • best axis to project on: (‘best’ = min sum of squares of projection errors)
minimum RMS error SVD - interpretation #2 SVD: gives best axis to project v1
x x = v1 SVD - Interpretation #2 • A = ULVT - example:
SVD - Interpretation #2 • A = ULVT - example: variance (‘spread’) on the v1 axis x x =
SVD - Interpretation #2 • A = ULVT - example: • UL gives the coordinates of the points in the projection axis x x =
x x = SVD - Interpretation #2 • More details • Q: how exactly is dim. reduction done?
SVD - Interpretation #2 • More details • Q: how exactly is dim. reduction done? • A: set the smallest eigenvalues to zero: x x =
SVD - Interpretation #2 x x ~
SVD - Interpretation #2 x x ~
SVD - Interpretation #2 x x ~
SVD - Interpretation #2 Equivalent: ‘spectral decomposition’ of the matrix: x x =
SVD - Interpretation #2 Equivalent: ‘spectral decomposition’ of the matrix: l1 x x = u1 u2 l2 v1 v2
l1 l2 u1 u2 vT1 vT2 SVD - Interpretation #2 Equivalent: ‘spectral decomposition’ of the matrix: m = + +... n
l1 l2 u1 u2 vT1 vT2 SVD - Interpretation #2 ‘spectral decomposition’ of the matrix: m r terms = + +... n n x 1 1 x m
l1 l2 u1 u2 vT1 vT2 SVD - Interpretation #2 approximation / dim. reduction: by keeping the first few terms (Q: how many?) m = + +... n assume: l1 >= l2 >= ...
l1 l2 u1 u2 vT1 vT2 SVD - Interpretation #2 A (heuristic - [Fukunaga]): keep 80-90% of ‘energy’ (= sum of squares of li ’s) m = + +... n assume: l1 >= l2 >= ...
SVD - Interpretation #3 • finds non-zero ‘blobs’ in a data matrix x x =