1 / 26

Document Organization using Self – Organizing Feature Maps (WEBSOFM)

Document Organization using Self – Organizing Feature Maps (WEBSOFM). A postolos Georgakis Artificial Intelligence and Information Analysis Lab. Self- O rganizing Maps Algorithm. SOMs are neural networks with: Two -layer structure Feed -forward topology They:

Download Presentation

Document Organization using Self – Organizing Feature Maps (WEBSOFM)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Organization using Self – OrganizingFeature Maps(WEBSOFM) Apostolos Georgakis Artificial Intelligence and Information Analysis Lab Department of Informatics Aristotle University of Thessaloniki

  2. Self-Organizing Maps Algorithm SOMs areneural networks with: • Two-layer structure • Feed-forward topology They: • form a non-linear projection from an arbitrarydata manifold onto a lowdimensional discrete map. • perform competitive, unsupervised training • map the probability density function (pdf) of the input space on a 2D or 3D lattice The topology of the lattice can be either hexagonal or orthogonal Department of Informatics Aristotle University of Thessaloniki

  3. xi (feature vectors) neurons mj (weight vector) SOMs belong to the class of vector quantizers (k-means, VQ, LVQ, etc) Letxi(t)=[xi1(t), xi2(t),…, xim(t)]Tdenote anm1feature vector and mj(t)denote the weight vector of the j-th neuron. Department of Informatics Aristotle University of Thessaloniki

  4. Index of the winning neuron: Updating function: whereNc is theneighbourhood centered on the winner which is modified towardxi(t) Department of Informatics Aristotle University of Thessaloniki

  5. a(t)is the learning rate, andσ(t)denotes the diameter of the updating kernel. rcandrjis the location on the lattice of the winner neuron and the neuron being updated, respectively. Department of Informatics Aristotle University of Thessaloniki

  6. Information Retrieval of Domain Specific Material Information retrieval (IR) consists of: • Organization • representation and storage of the available data • Retrieval • exploration of the organized data Prior to retrieval, the data repository (corpus) has to be organizedaccording to the retrieval method to be applied. Without a retrieval-orientedorganization method the retrieval of even a fraction of the available documents becomesonerous. Department of Informatics Aristotle University of Thessaloniki

  7. Information Retrieval of Domain Specific Material • Scenario: • A user is interested in locating documents or web pages related to a certain topic. • Query the IR system in Natural Language or use available sources (documents and / or web pages) Department of Informatics Aristotle University of Thessaloniki

  8. Information Retrieval • The above problem can be dealt either by the: • Boolean model Index terms • Vector model Vectors in t-dimensional space • Probabilistic model Probability theory Department of Informatics Aristotle University of Thessaloniki

  9. Boolean Model Documents: represented by index terms or keywords • Indexing techniques: • Suffix arrays (text suffix: a string that goes from a text position to the end of the text) • Signature files (word oriented index structures based on hashing) • Inverted files • Vocabulary file • Occurrences file • Boolean operators are used in the retrieval. Department of Informatics Aristotle University of Thessaloniki

  10. Drawbacks of the Boolean model • It does not support significance-weighting of the query terms. • Users are not always familiar with the Boolean operators. • No sufficient mechanism exists for ranking the retrieved documents. • Capable of handling onlykeyword-based queries. • Alternatives • Fuzzy Boolean operators. Department of Informatics Aristotle University of Thessaloniki

  11. Vector Space Model • The words and the documents are represented by vectors. • Advantages • Supports term-weighting. • The retrieved documents can be sorted according to their similarity with the help of the various vector norms. • It is a user-friendly model. Department of Informatics Aristotle University of Thessaloniki

  12. Preprocessing • Text processing • Html cleaning • Plain text cleaning (Email addresses, URLs, word separators, numbers etc) • Stemming • Clustering by elimination of word suffixes • Feature Vector Extraction • Calculation of the contextual statistics • Formation of the feature vectors (Vector space model) Department of Informatics Aristotle University of Thessaloniki

  13. 1 Language Modeling (Contextual statistics ) • Consider both its preceding and following words. or Department of Informatics Aristotle University of Thessaloniki

  14. The word categories map for a 27x27 network Department of Informatics Aristotle University of Thessaloniki

  15. Department of Informatics Aristotle University of Thessaloniki

  16. k-th document A Prague Guide - Czech Restaurants - Prag - Praha - From Andel 3W Tourist Service English.htm PRAGUE guid CZECH restaur PRAG PRAHA ANDEL tourist servic ENGLISH CZECH restaur HOSTINEC KALICHA NEBOZIZEK NOVOMESTSKY PIVOVAR POD KRIDLEM FLEKU KRKAVCU MALTEZSKYCH RYTIRU HOSTINEC KALICHA NA BOJISTI PRAGUE famou CZECH restaur PRAGUE popular . folklor musican realli get live mood drink glass beer gobbl traddit CZECH GULAS DUMPLINGS pick histori . reserv recommend . back list NEBOZIZEK NA BOJISTI PRAGUE famou CZECH restaur PRAGUE popular . …. Combined together and apply smoothing Department of Informatics Aristotle University of Thessaloniki

  17. Document Map Department of Informatics Aristotle University of Thessaloniki

  18. Document Map Department of Informatics Aristotle University of Thessaloniki

  19. Mean Square Error The solid vertical line denotes the time instant when the simulated annealing process was used to overcome local minima in the clustering procedure. Department of Informatics Aristotle University of Thessaloniki

  20. Dimensionality reduction The dimensionality of the feature vectors is exceptionally high, i.e. for the HyperGeo corpus (393 web pages, 3084 stems) the dimension is 3084*3=9252. Dimensionality reduction techniques: • PCA, ICA, LDA • Random projection • Component elimination Department of Informatics Aristotle University of Thessaloniki

  21. Random Projection In random projection we compute the m*3Nw matrix R having the following properties: • The components in each column are chosen to be independent, identically distributed Gaussian variables with zero mean and unit variance. • Each column is normalized to unit norm. Department of Informatics Aristotle University of Thessaloniki

  22. Component Elimination To overcome the computational complexity in locating the winner we order the components of the feature vectors according to the distance: where and k is the number of feature vectors. Department of Informatics Aristotle University of Thessaloniki

  23. Rearrange the components in each feature vector using ujso that the components with the strongest values appear first. The Euclidean distance is then computed by By omitting the second sum we still can get an estimate of the winning neuron. Department of Informatics Aristotle University of Thessaloniki

  24. Percentage of the correctly identified winners with respect to the dimensionality dif- ference between the original space and the “truncated” sub-space. Department of Informatics Aristotle University of Thessaloniki

  25. References • R. Baeza - Yates and B. Ribeiro - Neto, Modern Information Retrieval (ACM, 1999). • M. W. Berry and M. Browne, Understanding Search Engines: Mathematical Modelling and Text Retrieval (SIAM, 1999). • G. Salton and M. J. McGill, Introduction to Modern Information Retrieval (McGraw Hill, 1983). • S. Kaski, K. Lagus, T. Honkela, and T. Kohonen, Statistical aspects of the WEBSOM System in Organizing Document Collections, Computing Science and Statistics, 29, 1998, 281-290. • T. Kohonen, Self-Organizing Maps (Springer Verlag, 1997). • T. Kohonen, Self-organization of very large document collections: State of the art, Proc. of the 8th Int. Conf. on Artificial Neural Networks, 1, Springer, 1998, 65-74. • H. Ritter and T. Kohonen, Self-Organizing Semantic Map, Biol. Cybernetics, 61, 1989, 241-254. Department of Informatics Aristotle University of Thessaloniki

  26. S. Haykin, Neural Networks, A Comprehensive Foundation (Prentice-Hall, Inc, 1999). • T. Kohonen and S. Kaski and K. Lagus and J. Salojarvi and J. Honkela, and V. Paatero and A. Saarela, Self-Organization of Massive Document Collection, IEEE Trans. On Neural Networks, 11:3, 2000, 574-585. • S. Kaski, Data exploration using self-organizing maps, PhD Thessis, Helsinki University of Technology, 1997. • D. Manning and H. Schutze, Foundation of Statistical Natural Language Processing (MIT Press, 1999). • W. B. Frakes and R. Baeza - Yates, Information Retrieval: Data Structures and Algorithms (Prentice-Hall, Inc, 1992). • M. F. Porter, An algorithm for suffix stripping, Program, 14, 1980, 130-137. • C. Becchetti and L. P. Ricotti, Speech Recognition: Theory and C++ Implementation (Wiley, 1998). • M. P. Oak, Statistics for Corpus Linguistics (University Press, 1998). • S. Kaski, Dimensionality reduction by random mapping: Fast similarity computation for clustering, Proc. of the 8th Int. Conf. on Neural Networks, IEEE, 1, 1998, 413-418. • S. Kaski, Fast winner search for SOM-based monitoring and retrieval of high-dimensional data, Proc. of the 9th Int. Conf. on Artificial Neural Networks, Edinburgh, U.K., September 1999. Department of Informatics Aristotle University of Thessaloniki

More Related