1 / 18

A Comparison of SOM Based Document Categorization Systems

A Comparison of SOM Based Document Categorization Systems. Advisor : Dr. Hsu Graduate : Kuo-min Wang Authors : X. Luo, A. Nur Zincir-Heywood . 2003 IEEE. Outline. Motivation Objective Architecture Overview Performance Evaluation Conclusions Personal Opinion.

Download Presentation

A Comparison of SOM Based Document Categorization Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Comparison of SOM Based Document Categorization Systems Advisor : Dr. Hsu Graduate : Kuo-min Wang Authors :X. Luo, A. Nur Zincir-Heywood 2003 IEEE .

  2. Outline • Motivation • Objective • Architecture Overview • Performance Evaluation • Conclusions • Personal Opinion

  3. Motivation • Document categorization systems can solution two problems • Information overload • Describe the constant influx of new information, which causes user to be overwhelmed by the subject and system knowledge required to access this information • Vocabulary differences • automatic selection and weighing of keywords in text documents may well bias the nature of the clusters found at later stage.

  4. word1 word1 word2 ……. ……. ……. wordn Objective • This paper describes the development and evaluation of two unsupervised learning mechanisms for solving the automatic document categorization problem. • Vector space model • Code-books model

  5. Introduction • A common approach among existing systems is to cluster based upon their word distributions, while word clustering is determined by document co-occurrence. • Vector Space Model (VSM) • The frequency of occurrence of each word in each document is recorded • Generally weighted using the Term Frequency (TF) multiplied by the Inverse Document Frequency (IDF)

  6. Introduction (cont.) • First clustering system • Built is based on the VSM • and makes use of topological ordering property of SOMs. • Second clustering system • Make use of the SOM based architecture as an encoder for data representation • By finding a smaller set of prototypes from a large input space – without using the typical information retrieval pre-processing • Consider the relationships between characters, then words and finally word co-occurrences

  7. Document Clustering With Self-Organizing Maps (cont.) • First step is the identification of an encoding of the original information such that pertinent features may be decoded most efficiently. • SOM acts as an encoder to represent a large input space by finding a smaller set of prototypes. Noise v BMU Input vector x Σ Encoder c(x) Reconstruction vector x’ Decoder x’(x) Weight vector, wj

  8. Data collect Data reduction Pattern Discovery Data preprocess Architecture Overview • There are two main parts to the vector space model • Parsing • 1)Converts documents into a succession of words • 2)Using a basic Stop-List of common English words • 3)A stemming algorithm is then applied to the remaining words ex: story 和 stories, 或者First 和 first • Indexing • 1)Each document is represented in the Vector Space Model • 2)The frequency of occurrence of each word in each document is recorded • 3)using TF multiplied by IDF to generate weighted value

  9. Original Document Tags and Non-textual data are removed Stop words are removed Words are stemmed Words occurring less than 5 times are removed TF/IDF is used to weight each word Architecture-1 :Emphasizing Density Matching Property • Data Reduction • Randomly selected to reduced data dimensions x’ = Rx • x − the original data vector, where x  RN • R − A matrix consisting of random values where the Euclidean length of each column has been normalized to unity • x’ − Reduced-dimensional or quantized vector, where x’  Rd Data pre-processing

  10. Architecture-1 :Emphasizing Density Matching Property (cont.) Data Preprocess Data Reduction With crowed neurons may solve the problem by a divide and conquer method

  11. Architecture-2 :Emphasizing Encoding-Decoding Property • The core of the approach is to automate the identification of typical category characteristics. • A document is summarized by its words and their frequencies (TF) in descending order • An SOM be used to identify a suitable character encoding, thenword encoding, and word co- occurrence encoding Original Document Tags and Non-textual data are removed Stop words are removed Words are stemmed Words frequencies are formed Words occurring less than 5 times are removed Data pre-processing

  12. Architecture-2 :Emphasizing Encoding-Decoding Property (cont.) • Input for the First-level SOMs • Employs a three level hierarchical SOM architecture, characters, words, and word co-occurrences. • Characters be represented by their ASCII code • The relationships between characters are represented by a character’s position, or time index, in a word • Examplenews: n1 , e2, w3, s4ASCII: n->14, e5, w23, s19 • Pre-processing process • Convert the word’s characters to numerical • Find the time indices of the characters • Linearly normalize the indices , so thatthe first character is one, second is two.

  13. Architecture-2 :Emphasizing Encoding-Decoding Property (cont.) • Input for the second-level SOMs • For each word, k, that is input to the first-level SOM of each document, • Form a vector of size equal to the number of neurons (r) in the first-level SOM • For each character of k • Observe which neurons n1, n2,…nr are affected the most (the first 3 BMUs) • Increment entries in the vector corresponding to the first 3 BMUs by 1/j, 1 <= j <= 3

  14. Architecture-2 :Emphasizing Encoding-Decoding Property (cont.) • Input for the Third-level SOMs • The third-level input vectors are built using BMUs resulting from word vectors passed through the second-level SOMs.

  15. Architecture-2 :Emphasizing Encoding-Decoding Property (cont.) • Training the SOMs • Initialization: • Choose random values for the initial weight vectors wj(0), j=1, 2,…,l where l is the number of neurons in the map • Sampling: • Draw a sample x from the input space with a uniform probability • Similarity matching: • Find the best matching neuron i(x) using the Euclidean criterion • Updating: Adjust the weight vectors of all neurons by using the update formula

  16. A Performance Evaluation • The performance measurement used is based on • A-set of correct class labels(answer key) • B-baseline clusters where each document is one cluster • C-set of clusters (‘winning’ clustering results) • dist(C, A)-The number of operations required to transform C into A • dist(B, A)-The number of operations required to transform B into A

  17. Conclusion • First architecture emphasizes a layered approach to lower the computational cost of the training of the map and employs a random mapping to decrease the dimension of the input space. • The second architecture is based on a new idea where the SOM acts as an encoder to represent a large input space by finding a smaller set of prototypes • Future work • Develop a classifier, which will work in conjunction to these clustering systems • Apply the technique to a wider cross-section of benchmark data set

  18. Personal Opinions • Advantages • Using random mapping method to reduced dimensions. • Drawback • Architecture description is not clear. • Limit • …

More Related