Loading in 5 sec....

Latent Semantic IndexingPowerPoint Presentation

Latent Semantic Indexing

- 80 Views
- Uploaded on
- Presentation posted in: General

Latent Semantic Indexing

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Latent Semantic Indexing

Journal Article Comparison

Al Funk

CS 5604 / Information Retrieval

- Use similarities between concepts to map documents and determine their proximity in concept space
- “Singular Value Decomposition” – popular statistical method for generating concept space via dimensionality reduction

- Mapping results from SVD’s spatial analysis of a collection of documents; does not require human intervention to generate

- Increased relevance of information retrieval, as concepts are recognized rather than keywords
- Larger result sets due to retrieval of texts that do not include the specific query keywords
- LSI recognizes that keywords are related

- Minimal human intervention to generate mappings

- Storage requirements for indexes
- Computation time
In essence, high dimensionality of document representation can make searching resource intensive. LSI can reduce these costs but also can incur some of its own.

Q: Is there a way to maintain the benefits of LSI and reduce resource requirements?

- Many journal articles focus on mitigating the resource intensivity of LSI by reducing dimensionality. Two approaches:
- Article 1: Use “random projection” to lower dimensionality of the concept space, hoping to prevent erosion of vector relationships
- Article 2: Replace SVD with “Semidiscrete Matrix Decomposition,” creating an approximation that serves to reduce dimensionality but still retain the bulk of relationships

- Traditional methods of dimensionality reduction have focused means of analyzing datasets to maximize benefit and minimize loss of variation. Two such methods are:
- Principal Component Analysis (PCA)
- Singular Value Decomposition (SVD)

- SVD is primary for document retrieval because it performs well with sparse matrices.
- PCA and SVD are both computationally expensive, particularly for large datasets.

- Random Projection (RP) attempts to solve these problems by creating a random matrix and using it to project the document observation vectors onto a lower dimensional space.
- Random projection can be used before SVD, enabling the expensive algorithm to operate on a matrix of lower dimension.
- Bingham and Mannila’s results indicate that RP has an acceptable impact on the data while significantly reducing required computation.

- Kolda and O’Leary propose to replace the expensive SVD algorithm with “Semidiscrete Matrix Decomposition”
- Lower computation time
- Lower storage requirements

- Claim that methodology is as accurate as SVD but less resource intensive

- Defined as the “closest rank-k matrix to the term-document matrix in the Frobenius measure”.
- Essentially creates a lower-order matrix that maximizes the approximation of the original m x n document / keyword matrix.

- SDD is a different LSI algorithm to achieve the same goals as SVD
- SDD creates a lower-order matrix like SVD but restricts vector item values to –1, 0 or 1

- As a result of the restriction to these values, SDD is computationally more expensive up front

- Despite higher up-front processing times, updates to the matrix can be made rapidly to accommodate changing collections
- Searching is more efficient (as much as ½ the time)
- Storage requirements are lower, as SDD can store each matrix value in 2 bits (rather than multiple bytes for a floating-point value)

- Both articles provide for a quantifiable increase in performance over traditional LSI techniques
- Techniques could potentially be used together, as both tackle the related issues of performance and dimensionality reduction

- http://doi.acm.org/10.1145/291128.291131
- http://doi.acm.org/10.1145/502512.502546