Visualization and navigation of document information spaces using a self organizing map
This presentation is the property of its rightful owner.
Sponsored Links
1 / 33

Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on
  • Presentation posted in: General

Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map. Daniel X. Pape Community Architectures for Network Information Systems [email protected] www.canis.uiuc.edu CSNA’98 6/18/98. Overview. Self-Organizing Map (SOM) Algorithm

Download Presentation

Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Visualization and navigation of document information spaces using a self organizing map

Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Daniel X. Pape

Community Architectures for Network Information Systems

[email protected]

www.canis.uiuc.edu

CSNA’98

6/18/98


Overview

Overview

  • Self-Organizing Map (SOM) Algorithm

  • U-Matrix Algorithm for SOM Visualization

  • SOM Navigation Application

  • Document Representation and Collection Examples

  • Problems and Optimizations

  • Future Work


Basic som algorithm

Basic SOM Algorithm

  • Input

    • Number (n) of Feature Vectors (x)

    • format:

      vector name: a, b, c, d

    • examples:

      1: 0.1, 0.2, 0.3, 0.4

      2: 0.2, 0.3, 0.3, 0.2


Basic som algorithm1

Basic SOM Algorithm

  • Output

    • Neural network Map of (M) Nodes

    • Each node has an associated Weight Vector (m) of the same dimensionality as the input feature vectors

    • Examples:

      m1: 0.1, 0.2, 0.3, 0.4

      m2: 0.2, 0.3, 0.3, 0.2


Basic som algorithm2

Basic SOM Algorithm

  • Output (cont.)

    • Nodes laid out in a grid:


Basic som algorithm3

Basic SOM Algorithm

  • Other Parameters

    • Number of timesteps (T)

    • Learning Rate (eta)


Basic som algorithm4

Basic SOM Algorithm

SOM() {

foreach timestep t {

foreach feature vector fv {

wnode = find_winning_node(fv)

update_local_neighborhood(wnode)

}

}

}

find_winning_node() {

foreach node n {

compute distance of m to feature vector

}

return node with the smallest distance

}

update_local_neighborhood(wnode) {

foreach node n {

m = m + eta [x - m]

}

}


U matrix visualization

U-Matrix Visualization

  • Provides a simple way to visualize cluster boundaries on the map

  • Simple algorithm:

    • for each node in the map, compute the average of the distances between its weight vector and those of its immediate neighbors

  • Average distance is a measure of a node’s similarity between it and its neighbors


U matrix visualization1

U-Matrix Visualization

  • Interpretation

    • one can encode the U-Matrix measurements as greyscale values in an image, or as altitudes on a terrain

    • landscape that represents the document space: the valleys, or dark areas are the clusters of data, and the mountains, or light areas are the boundaries between the clusters


U matrix visualization2

U-Matrix Visualization

  • Example:

    • dataset of random three dimensional points, arranged in four obvious clusters


U matrix visualization3

U-Matrix Visualization

Four (color-coded) clusters of three-dimensional points


U matrix visualization4

U-Matrix Visualization

Oblique projection of a terrain derived from the U-Matrix


U matrix visualization5

U-Matrix Visualization

Terrain for a real document collection


Current labeling procedure

Current Labeling Procedure

  • Feature vectors are encoded as 0’s and 1’s

  • Weight vectors have real values from 0 to 1

  • Sort weight vector dimensions by element value

    • dimension with greatest value is “best” noun phrase for that node

  • Aggregate nodes with the same “best” noun phrase into groups


Umatrix navigation

Umatrix Navigation

  • 3D Space-Flight

  • Hierarchical Navigation


Document data

Document Data

  • Noun phrases extracted

  • Set of unique noun phrases computed

    • each noun phrase becomes a dimension of the data set

  • Each document represented by a binary vector with a 1 or a 0 denoting the existence or absence of each noun phrase


Document data1

Document Data

  • Example:

    • 10 total noun phrases:

      alexander, king, macedonians, darius, philip, horse, soldiers, battle, army, death

    • each element of the feature vector will be a 1 or a 0:

      • 1: 1, 1, 0, 0, 1, 1, 0, 0, 0, 0

      • 2: 0, 1, 0, 1, 0, 0, 1, 1, 1, 1


Document collection examples

Document Collection Examples


Problems

Problems

  • As document sets get larger, the feature vectors get longer, use more memory, etc.

  • Execution time grows to unrealistic lengths


Solutions

Solutions?

  • Need algorithm refinements for sparse feature vectors

  • Need a faster way to do the find_winning_node() computation

  • Need a better way to do the update_local_neighborhood() computation


Sparse vector optimization

Sparse Vector Optimization

  • Intelligent support for sparse feature vectors

    • saves on memory usage

    • greatly improves speed of the weight vector update computation


Faster find winning node

Faster find_winning_node()

  • SOM weight vectors become partially ordered very quickly


Faster find winning node1

Faster find_winning_node()

U-Matrix Visualization of an Initial, Unordered SOM


Faster find winning node2

Faster find_winning_node()

Partially Ordered SOM after 5 timesteps


Faster find winning node3

Faster find_winning_node()

  • Don’t do a global search for the winner

  • Start search from last known winner position

  • Pro:

    • usually finds a new winner very quickly

  • Con:

    • this new search for a winner can sometimes get stuck in a local minima


Better neighborhood update

Better Neighborhood Update

  • Nodes get told to “update” quite often

  • Weight vector is made public only during a find_winner() search

  • With local find_winning_node() search, a lazy neighborhood weight vector update can be performed


Better neighborhood update1

Better Neighborhood Update

  • Cache update requests

    • each node will store the winning node and feature vector for each update request

  • The node performs the update computations called for by the stored update requests only when asked for its weight vector

  • Possible reduction of number of requests by averaging the feature vectors in the cache


New execution times

New Execution Times


Future work

Future Work

  • Parallelization

  • Label Problem


Label problem

Label Problem

  • Current Procedure not very good

  • Cluster boundaries

  • Term selection


Cluster boundaries

Cluster Boundaries

  • Image processing

  • Geometric


Cluster boundaries1

Cluster Boundaries

  • Image processing example:


Term selection

Term Selection

  • Too many unique noun phrases

    • Too many dimensions in the feature vector data

  • “Knee” of frequency curve


  • Login