Document Organization using Self – Organizing Feature Maps (WEBSOFM)

Document Organization using Self – OrganizingFeature Maps(WEBSOFM) Apostolos Georgakis Artificial Intelligence and Information Analysis Lab Department of Informatics Aristotle University of Thessaloniki

Self-Organizing Maps Algorithm SOMs areneural networks with: • Two-layer structure • Feed-forward topology They: • form a non-linear projection from an arbitrarydata manifold onto a lowdimensional discrete map. • perform competitive, unsupervised training • map the probability density function (pdf) of the input space on a 2D or 3D lattice The topology of the lattice can be either hexagonal or orthogonal Department of Informatics Aristotle University of Thessaloniki

xi (feature vectors) neurons mj (weight vector) SOMs belong to the class of vector quantizers (k-means, VQ, LVQ, etc) Letxi(t)=[xi1(t), xi2(t),…, xim(t)]Tdenote anm1feature vector and mj(t)denote the weight vector of the j-th neuron. Department of Informatics Aristotle University of Thessaloniki

Index of the winning neuron: Updating function: whereNc is theneighbourhood centered on the winner which is modified towardxi(t) Department of Informatics Aristotle University of Thessaloniki

a(t)is the learning rate, andσ(t)denotes the diameter of the updating kernel. rcandrjis the location on the lattice of the winner neuron and the neuron being updated, respectively. Department of Informatics Aristotle University of Thessaloniki

Information Retrieval of Domain Specific Material Information retrieval (IR) consists of: • Organization • representation and storage of the available data • Retrieval • exploration of the organized data Prior to retrieval, the data repository (corpus) has to be organizedaccording to the retrieval method to be applied. Without a retrieval-orientedorganization method the retrieval of even a fraction of the available documents becomesonerous. Department of Informatics Aristotle University of Thessaloniki

Information Retrieval of Domain Specific Material • Scenario: • A user is interested in locating documents or web pages related to a certain topic. • Query the IR system in Natural Language or use available sources (documents and / or web pages) Department of Informatics Aristotle University of Thessaloniki

Information Retrieval • The above problem can be dealt either by the: • Boolean model Index terms • Vector model Vectors in t-dimensional space • Probabilistic model Probability theory Department of Informatics Aristotle University of Thessaloniki

Boolean Model Documents: represented by index terms or keywords • Indexing techniques: • Suffix arrays (text suffix: a string that goes from a text position to the end of the text) • Signature files (word oriented index structures based on hashing) • Inverted files • Vocabulary file • Occurrences file • Boolean operators are used in the retrieval. Department of Informatics Aristotle University of Thessaloniki

Drawbacks of the Boolean model • It does not support significance-weighting of the query terms. • Users are not always familiar with the Boolean operators. • No sufficient mechanism exists for ranking the retrieved documents. • Capable of handling onlykeyword-based queries. • Alternatives • Fuzzy Boolean operators. Department of Informatics Aristotle University of Thessaloniki

Vector Space Model • The words and the documents are represented by vectors. • Advantages • Supports term-weighting. • The retrieved documents can be sorted according to their similarity with the help of the various vector norms. • It is a user-friendly model. Department of Informatics Aristotle University of Thessaloniki

Preprocessing • Text processing • Html cleaning • Plain text cleaning (Email addresses, URLs, word separators, numbers etc) • Stemming • Clustering by elimination of word suffixes • Feature Vector Extraction • Calculation of the contextual statistics • Formation of the feature vectors (Vector space model) Department of Informatics Aristotle University of Thessaloniki

1 Language Modeling (Contextual statistics ) • Consider both its preceding and following words. or Department of Informatics Aristotle University of Thessaloniki

The word categories map for a 27x27 network Department of Informatics Aristotle University of Thessaloniki

Department of Informatics Aristotle University of Thessaloniki

k-th document A Prague Guide - Czech Restaurants - Prag - Praha - From Andel 3W Tourist Service English.htm PRAGUE guid CZECH restaur PRAG PRAHA ANDEL tourist servic ENGLISH CZECH restaur HOSTINEC KALICHA NEBOZIZEK NOVOMESTSKY PIVOVAR POD KRIDLEM FLEKU KRKAVCU MALTEZSKYCH RYTIRU HOSTINEC KALICHA NA BOJISTI PRAGUE famou CZECH restaur PRAGUE popular . folklor musican realli get live mood drink glass beer gobbl traddit CZECH GULAS DUMPLINGS pick histori . reserv recommend . back list NEBOZIZEK NA BOJISTI PRAGUE famou CZECH restaur PRAGUE popular . …. Combined together and apply smoothing Department of Informatics Aristotle University of Thessaloniki

Document Map Department of Informatics Aristotle University of Thessaloniki

Mean Square Error The solid vertical line denotes the time instant when the simulated annealing process was used to overcome local minima in the clustering procedure. Department of Informatics Aristotle University of Thessaloniki

Dimensionality reduction The dimensionality of the feature vectors is exceptionally high, i.e. for the HyperGeo corpus (393 web pages, 3084 stems) the dimension is 3084*3=9252. Dimensionality reduction techniques: • PCA, ICA, LDA • Random projection • Component elimination Department of Informatics Aristotle University of Thessaloniki

Random Projection In random projection we compute the m*3Nw matrix R having the following properties: • The components in each column are chosen to be independent, identically distributed Gaussian variables with zero mean and unit variance. • Each column is normalized to unit norm. Department of Informatics Aristotle University of Thessaloniki

Component Elimination To overcome the computational complexity in locating the winner we order the components of the feature vectors according to the distance: where and k is the number of feature vectors. Department of Informatics Aristotle University of Thessaloniki

Rearrange the components in each feature vector using ujso that the components with the strongest values appear first. The Euclidean distance is then computed by By omitting the second sum we still can get an estimate of the winning neuron. Department of Informatics Aristotle University of Thessaloniki

Percentage of the correctly identified winners with respect to the dimensionality dif- ference between the original space and the “truncated” sub-space. Department of Informatics Aristotle University of Thessaloniki

References • R. Baeza - Yates and B. Ribeiro - Neto, Modern Information Retrieval (ACM, 1999). • M. W. Berry and M. Browne, Understanding Search Engines: Mathematical Modelling and Text Retrieval (SIAM, 1999). • G. Salton and M. J. McGill, Introduction to Modern Information Retrieval (McGraw Hill, 1983). • S. Kaski, K. Lagus, T. Honkela, and T. Kohonen, Statistical aspects of the WEBSOM System in Organizing Document Collections, Computing Science and Statistics, 29, 1998, 281-290. • T. Kohonen, Self-Organizing Maps (Springer Verlag, 1997). • T. Kohonen, Self-organization of very large document collections: State of the art, Proc. of the 8th Int. Conf. on Artificial Neural Networks, 1, Springer, 1998, 65-74. • H. Ritter and T. Kohonen, Self-Organizing Semantic Map, Biol. Cybernetics, 61, 1989, 241-254. Department of Informatics Aristotle University of Thessaloniki

S. Haykin, Neural Networks, A Comprehensive Foundation (Prentice-Hall, Inc, 1999). • T. Kohonen and S. Kaski and K. Lagus and J. Salojarvi and J. Honkela, and V. Paatero and A. Saarela, Self-Organization of Massive Document Collection, IEEE Trans. On Neural Networks, 11:3, 2000, 574-585. • S. Kaski, Data exploration using self-organizing maps, PhD Thessis, Helsinki University of Technology, 1997. • D. Manning and H. Schutze, Foundation of Statistical Natural Language Processing (MIT Press, 1999). • W. B. Frakes and R. Baeza - Yates, Information Retrieval: Data Structures and Algorithms (Prentice-Hall, Inc, 1992). • M. F. Porter, An algorithm for suffix stripping, Program, 14, 1980, 130-137. • C. Becchetti and L. P. Ricotti, Speech Recognition: Theory and C++ Implementation (Wiley, 1998). • M. P. Oak, Statistics for Corpus Linguistics (University Press, 1998). • S. Kaski, Dimensionality reduction by random mapping: Fast similarity computation for clustering, Proc. of the 8th Int. Conf. on Neural Networks, IEEE, 1, 1998, 413-418. • S. Kaski, Fast winner search for SOM-based monitoring and retrieval of high-dimensional data, Proc. of the 9th Int. Conf. on Artificial Neural Networks, Edinburgh, U.K., September 1999. Department of Informatics Aristotle University of Thessaloniki

Document Organization using Self – Organizing Feature Maps (WEBSOFM)

Document Organization using Self – Organizing Feature Maps (WEBSOFM)

Presentation Transcript

talk-ppt - PowerPoint Presentation

Towards the Self-Organizing Feature Map

Creating a Self-Running PowerPoint Presentation ~ Using PowerPoint 2003 ~

Self-organizing Maps

Information Visualization with Self-Organizing Maps

Organizing a spectral image database by using Self-Organizing Maps

Self-Organizing Maps

Mesoscale Atmospheric Model Validation Using Self-Organizing Maps (SOMs)

Self-Organizing Maps

Identifying Teleconnection Patterns from Point Correlation Maps using Self Organizing Maps

Joint seismic attributes visualization using Self-Organizing Maps

Joint seismic attributes visualization using Self-Organizing Maps

Self-Organizing Maps for Land Cover Classification

Wireless Localization using Self-Organizing Maps

Self-Organizing Maps

Wireless Localization using Self-Organizing Maps

Self Organizing Maps

Self-Organizing Maps

Gene Clustering Using Self-Organizing Maps and Particle Swarm Optimization

Multilingual document mining and navigation using self-organizing maps

Probabilistic self-organizing maps for qualitative data

Boosting (Part II) and Self-Organizing Maps