1 / 34

Document Classification with Self-Organizing Maps (SOM)

This assignment involves creating a document collection, generating a document-term matrix, and performing document classification using Self-Organizing Maps (SOM). Results will be interpreted and presented in cis392dc.html.

garyhunter
Download Presentation

Document Classification with Self-Organizing Maps (SOM)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NJIT CIS 392 Text Processing, Retrieval, and MiningSpring 2003 Materials: Sullivan Ch 8 An Introduction to Neural Networks, Ch 1, by Kevin Gurney (optional) http://www.shef.ac.uk/psychology/gurney/notes/ Assign#2

  2. Document Space • Visualizing objects of high dimensional datasets is difficult. Assign#2

  3. A Complex Document Space NJIT Doc2 CIS634 Wu Doc1 Doc5 Doc3 Doc4 Information Systems Dept Assign#2

  4. Overview of Neural Networks Algorithms • A neural network is an interconnected assembly of simple processing elements, units or nodes, …. The processing ability of the network is stored in the inter unit connection strengths, or weights, obtained by a process of adaptation to, or learning from, a set of training patterns. --Kevin Gurney, 1997, An Intro to Neural Networks Assign#2

  5. Self-Organizing Maps (SOM) • It is a neural networks (NN) algorithm developed by Dr. Teuvo Kohonen. • It’s a self-organized and non-supervised NN technique. • The SOM defines a mapping from high dimensional input data space onto a regular two-dimensional array of neurons.(-- by Nenet Team, http://koti.mbnet.fi/~phodju/nenet/SelfOrganizingMap/Theory.html) Assign#2

  6. SOM • Initially, every neuron on the feature map is assigned a vector of weights, called a reference vector, which is an n-dimensional vector. • The reference vectors together form a codebook. • Neurons are connected to each other by a neighborhood relation, either rectangular or hexagonal. Assign#2

  7. SOM Learning/Training Process • One sample vector (V) is randomly selected from input vectors. • During the training process, V is compared to the codebook vectors, according to Euclidian metric. The one closest to V is chosen as best matching unit (BMU). • After BMU is found, the codebook vectors are updated. The BMU and its neighbors are moved closer to V. • The above steps together form one training process. The process continues until the pre-specified number of training steps is reached. Assign#2

  8. SOM This diagram is adapted from: http://koti.mbnet.fi/~phodju/nenet/SelfOrganizingMap/Theory.html Assign#2

  9. SOM’s Applications in Concept Classification • AI Lab: http://ai.bpa.arizona.edu • Web SOM: http://websom.hut.fi/websom/ • Results produced by these research centers are enhanced with hypertext technology to allow “searching by browsing.” Assign#2

  10. SOM Classification Results from AI Lab Assign#2

  11. Information Provided on SOM • The information provided on a SOM feature map for term classification includes: • The labels of areas on the concept map refer to major concepts in the document collection. • The size of an area implicates the relative term frequency. The larger the area is, the more frequent the term is. • The adjacent areas indicate terms that frequently co-occur in the document set. Assign#2

  12. D-T and T-D Matrixes for Text Classifications/Clustering • For term/concept classification, you need T-D matrixes, in which documents are properties. • For document classification, you need D-T matrixes, in which terms are properties. Assign#2

  13. Assignment 2 Overview • Create a document collection based on your RE. Please remember to remove duplicates. • Create a vocabulary file with default lexical options for your own document collection. • Create a document-term matrix • Perform document classification with SOM. • Interpret results. • Present results in cis392dc.html. Assign#2

  14. Creating document collections • Create a directory named “model4” under ~yourusername/public_html/cis392 directory. • Create a directory named “mycollection” under ~yourusername/public_html/cis392 directory. • Create group1 and group2 sub-directories under “mycollection.” Assign#2

  15. Creating document collections • Copy top 25 retrieved documents from model1 of your RE1 into mycollection/group1. • Copy top 25 retrieved documents from model2 of your RE1 into mycollection/group2. (Remove duplicates: Remember to check each document number to see if it is already in group1. If yes, do not copy it.) Assign#2

  16. Creating document collections • Copying retrieved documents into mycollection/group1: • Make sure you are in ~yourusername/public_html/cis392/mycollection/group1 • At the system prompt, type in: cp filename . • . (the dot sign) means the current directory • ex: cp /afs/cad/u/w/u/wu/cis392/tc/lisa/text/group0/doc_306 . • Repeat the same process for group2, which contains retrieved documents from model2 of RE1 (remember to remove duplicates). Assign#2

  17. Creating document collections After you have created the document collection, execute these 2 commands: • more ~yourusername/public_html/cis392/ mycollection/group1/* > ~yourusername/public_html/ cis392/model4/group1.txt • more ~yourusername/public_html/cis392/ mycollection/group2/* > ~yourusername/public_html/ cis392/model4/group2.txt Assign#2

  18. Using Rainbow to Create Vocabularies • Remember the test collection now is in ~yourusername/public_html/cis392/mycollection/* • Go to BOW directory, at the system prompt, type in: ./rainbow -d ~yourusername/public_html /cis392/model4 --index ~yourusername/ public_html/cis392/mycollection/* Assign#2

  19. Printing the D-T Matrix • Only the top 5 terms (based on info-grain) from the vocabulary lists are selected. • Type in the following at the system prompt: ./rainbow -d ~yourusername/public_html /cis392/model4 --prune-vocab-by-infogain=5 --print-matrix=abe > ~yourusername/ public_html/cis392/model4/matrix • Check BOW web site to see what “abe” means. Assign#2

  20. Cleaning the matrix • ftp the matrix file to your PC. • Open it with Excel, select “delimited,” and select “space” as delimiters. • Delete 2nd column (the class name). • Move the document number (1st column) to the last column. • Delete paths on the last column (before the actual document numbers). • Only document numbers and frequency counts are left. Assign#2

  21. Cleaning the matrix • Insert an empty row before the first row. Type in 5 (for 5 properties) in the very first cell. • Save the file in “Text” (Tab delimited) format, the file name is matrix.txt • Upload this file back to model4 directory. • ****However, Nenet uses *.dat for input matrix files. You will have to specify the file type as “all files,” when opening data file in Nenet. Assign#2

  22. The final matrix (to be saved in Text (Tab delimited) format) Assign#2

  23. SOM toolkit for Windows -- Nenet • The trial version is available at: http://koti.mbnet.fi/~phodju/nenet/Nenet/General.html • Trial version has limited capability: up to 8 properties, 6x6 dimensions (36 neurons). However, if the matrix has 8 properties, Nenet seems to have trouble with it. So, please limit your raw data (matrix and matrix.txt files) to exact 5 properties. Assign#2

  24. Download and Install Nenet • Create a file folder named temp. Download all three zip files to temp and uncompress them with WinZip. Install the software by clicking on setup.exe. • If your PC doesn’t have WinZip, download it here: http://www.winzip.com/ Assign#2

  25. Nenet Demo & Dataset • Interactive demo: http://koti.mbnet.fi/~phodju/nenet/Nenet/InteractiveDemo.html • For your Assignment #2, the initial dataset, training dataset, and test dataset are the same, that is “matrix.txt”. • Nenet uses *.dat for input matrix files. You will have to specify the file type as “all files,” when opening data file in Nenet. • Remember to select “Use Automatic Labeling” at the testing stage. (or your map will not have document numbers as labels!!) Assign#2

  26. Assign#2

  27. Assign#2

  28. View Results on the Feature Map • After training and testing, Nenet presents the results on a map similar to next slide. • Click on “view”  “labels and vectors,” Nenet will show you to a map with labels. • Click on any neuron on the map that has document numbers on it, you will see a list of document numbers associated with that neuron. Assign#2

  29. Assign#2

  30. Assign#2

  31. How are Doc# mapped to the neuron? • When labeling, each document vector is compared to the final vector of weights of each neuron. • The best matching neuron determines where the document# will be located on the map. Assign#2

  32. Copy Map to Clipboard Assign#2

  33. Save this map in matrix.jpg or matrix.gif file, and upload the map to model4 directory Assign#2

  34. Your tasks • Use the D-T matrix (matrix.txt) created earlier for document clustering with Nenet. • Follow the instructions on the interactive demo. • Save the final results in matrix.cod file. • Upload the matrix.txt, matrix.cod, and matrix.jpg to model4 directory. • Create RE2 page, format: http://www-ec.njit.edu/~wu/cis392/cis392dc.html Assign#2

More Related