1 / 20

Self Organization of a massive document collection

Self Organization of a massive document collection. IEEE Transactions on Neural Networks VOL. 11 No.3 3,May 2000 Teuvo Kohonen Helsinki University of Technology, Espoo, Finland Presented by Chu Huei-Ming 2004/12/01. Reference.

elisha
Download Presentation

Self Organization of a massive document collection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Self Organization of a massive document collection IEEE Transactions on Neural Networks VOL. 11 No.3 3,May 2000 Teuvo Kohonen Helsinki University of Technology, Espoo, Finland Presented by Chu Huei-Ming 2004/12/01

  2. Reference • A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation • Dmitri G. Roussinov • Hsinchun Chen • Department of MIS, Karl Eller Graduate School of Management, University of Arizona • http://dlist.sir.arizona.edu/archive/00000460/01/A_Scalable-98.htm

  3. Outline • Introduction • The Self-Organizing Map • Statistical Models of Documents • Rapid Construction of Large Document Maps • Conclusion

  4. Introduction • From simple search to browsing of self-organized data collections • Document collections could be automatically organized in some meaningful way. • The context of the other related documents residing nearby would then help in understanding the true meaning

  5. Introduction • SOM can first be computed using any representative subset of old input data • New input items can be mapped straight into the most similar models without re-computation of the whole mapping

  6. The Self-Organizing Map • Unsupervised-learning neural-network method • Similarity graph of input data • Arranged as a regular, usually two-dimensional grid

  7. SOM algorithm

  8. SOM algorithm M1 M2 M3 M4 M5 …… MI-3 MI-2 MI-1 MI Wi,j Map Model Vector N1 N2 N3 N4 …… NJ Input Data Vector

  9. SOM algorithm 1. Initialize input nodes, output nodes, and connection weights • input vector : top (most frequently occurring) N terms • output nodes : create a two-dimensional map (grid) of M • Initialize weights wij from N input nodes to M output nodes to small random values

  10. SOM algorithm 2.Present each document in order • Describe each document as an input vector of N coordinates 3.Compute distance to all nodes

  11. SOM algorithm • 4.Select winning node i* and update weights to node i* and its neighbors • Update weights to nodes i* and its neighbors to reduce the distances between them and the input vector xj(t): • is an error-adjusting coefficient ( ) that decreases over time

  12. SOM algorithm • Recursive regression process • t is the index of the regression step • is the presentation of a sample of input x • is the neighborhood function • is the model (“winner”) that matches best with

  13. SOM algorithm • is the learning-rate factor which decreases monotonically with the regression steps • corresponds to the width of the neighborhood function, which is also decreasing monotonically with the regression steps

  14. Statistical Models of Documents • The primitive vector space model • Latent semantic indexing (LSI) • Randomly projection histograms • Histograms on the word category map

  15. Rapid Construction of Large Document Maps • Fast distance computation • Estimation of larger maps based on carefully constructed smaller ones • Rapid fine-tuning of the large maps

  16. Estimation of larger maps • Estimate good initial values for the model vectors of a very large map • Based on the asymptotic values of the model vectors of a much smaller map • The superscript d refers to the “dense” lattice • The superscript s refers to the “sparse” lattice • If three “sparse” vectors do not lie on the same straight line

  17. Estimation of larger maps Fig3. Illustration of the relation of the model vectors in a sparse (s, solid lines)and dense (d, dashed lines) grid.Here shall be interpolated in terms of the three closest “sparse” models

  18. Rapid fine-tuning of the large maps • Addressing old winners • in the meddle of the training process • the same training input is used again • the new winner is found at or in the vicinity of the old one Fig.4 Finding the new winner in the vicinity of the old one, where by the old winneris directly located by a pointer. The pointer is then updated

  19. Performance evaluation of the new methods • Average quantization error • The average distance of each input from the closest model vector • Classification accuracy • The separability of different classes of patents on the resulting map

  20. Conclusion • The graph is suitable for interactive data mining or exploration tasks. • A new method of forming statistical models of documents • Several new fast computing methods (shortcuts)

More Related