visualization and navigation of document information spaces using a self organizing map
Download
Skip this Video
Download Presentation
Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Loading in 2 Seconds...

play fullscreen
1 / 33

Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map - PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on

Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map. Daniel X. Pape Community Architectures for Network Information Systems [email protected] www.canis.uiuc.edu CSNA’98 6/18/98. Overview. Self-Organizing Map (SOM) Algorithm

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map' - rico


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
visualization and navigation of document information spaces using a self organizing map

Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Daniel X. Pape

Community Architectures for Network Information Systems

[email protected]

www.canis.uiuc.edu

CSNA’98

6/18/98

overview
Overview
  • Self-Organizing Map (SOM) Algorithm
  • U-Matrix Algorithm for SOM Visualization
  • SOM Navigation Application
  • Document Representation and Collection Examples
  • Problems and Optimizations
  • Future Work
basic som algorithm
Basic SOM Algorithm
  • Input
    • Number (n) of Feature Vectors (x)
    • format:

vector name: a, b, c, d

    • examples:

1: 0.1, 0.2, 0.3, 0.4

2: 0.2, 0.3, 0.3, 0.2

basic som algorithm1
Basic SOM Algorithm
  • Output
    • Neural network Map of (M) Nodes
    • Each node has an associated Weight Vector (m) of the same dimensionality as the input feature vectors
    • Examples:

m1: 0.1, 0.2, 0.3, 0.4

m2: 0.2, 0.3, 0.3, 0.2

basic som algorithm2
Basic SOM Algorithm
  • Output (cont.)
    • Nodes laid out in a grid:
basic som algorithm3
Basic SOM Algorithm
  • Other Parameters
    • Number of timesteps (T)
    • Learning Rate (eta)
basic som algorithm4
Basic SOM Algorithm

SOM() {

foreach timestep t {

foreach feature vector fv {

wnode = find_winning_node(fv)

update_local_neighborhood(wnode)

}

}

}

find_winning_node() {

foreach node n {

compute distance of m to feature vector

}

return node with the smallest distance

}

update_local_neighborhood(wnode) {

foreach node n {

m = m + eta [x - m]

}

}

u matrix visualization
U-Matrix Visualization
  • Provides a simple way to visualize cluster boundaries on the map
  • Simple algorithm:
    • for each node in the map, compute the average of the distances between its weight vector and those of its immediate neighbors
  • Average distance is a measure of a node’s similarity between it and its neighbors
u matrix visualization1
U-Matrix Visualization
  • Interpretation
    • one can encode the U-Matrix measurements as greyscale values in an image, or as altitudes on a terrain
    • landscape that represents the document space: the valleys, or dark areas are the clusters of data, and the mountains, or light areas are the boundaries between the clusters
u matrix visualization2
U-Matrix Visualization
  • Example:
    • dataset of random three dimensional points, arranged in four obvious clusters
u matrix visualization3
U-Matrix Visualization

Four (color-coded) clusters of three-dimensional points

u matrix visualization4
U-Matrix Visualization

Oblique projection of a terrain derived from the U-Matrix

u matrix visualization5
U-Matrix Visualization

Terrain for a real document collection

current labeling procedure
Current Labeling Procedure
  • Feature vectors are encoded as 0’s and 1’s
  • Weight vectors have real values from 0 to 1
  • Sort weight vector dimensions by element value
    • dimension with greatest value is “best” noun phrase for that node
  • Aggregate nodes with the same “best” noun phrase into groups
umatrix navigation
Umatrix Navigation
  • 3D Space-Flight
  • Hierarchical Navigation
document data
Document Data
  • Noun phrases extracted
  • Set of unique noun phrases computed
    • each noun phrase becomes a dimension of the data set
  • Each document represented by a binary vector with a 1 or a 0 denoting the existence or absence of each noun phrase
document data1
Document Data
  • Example:
    • 10 total noun phrases:

alexander, king, macedonians, darius, philip, horse, soldiers, battle, army, death

    • each element of the feature vector will be a 1 or a 0:
      • 1: 1, 1, 0, 0, 1, 1, 0, 0, 0, 0
      • 2: 0, 1, 0, 1, 0, 0, 1, 1, 1, 1
problems
Problems
  • As document sets get larger, the feature vectors get longer, use more memory, etc.
  • Execution time grows to unrealistic lengths
solutions
Solutions?
  • Need algorithm refinements for sparse feature vectors
  • Need a faster way to do the find_winning_node() computation
  • Need a better way to do the update_local_neighborhood() computation
sparse vector optimization
Sparse Vector Optimization
  • Intelligent support for sparse feature vectors
    • saves on memory usage
    • greatly improves speed of the weight vector update computation
faster find winning node
Faster find_winning_node()
  • SOM weight vectors become partially ordered very quickly
faster find winning node1
Faster find_winning_node()

U-Matrix Visualization of an Initial, Unordered SOM

faster find winning node2
Faster find_winning_node()

Partially Ordered SOM after 5 timesteps

faster find winning node3
Faster find_winning_node()
  • Don’t do a global search for the winner
  • Start search from last known winner position
  • Pro:
    • usually finds a new winner very quickly
  • Con:
    • this new search for a winner can sometimes get stuck in a local minima
better neighborhood update
Better Neighborhood Update
  • Nodes get told to “update” quite often
  • Weight vector is made public only during a find_winner() search
  • With local find_winning_node() search, a lazy neighborhood weight vector update can be performed
better neighborhood update1
Better Neighborhood Update
  • Cache update requests
    • each node will store the winning node and feature vector for each update request
  • The node performs the update computations called for by the stored update requests only when asked for its weight vector
  • Possible reduction of number of requests by averaging the feature vectors in the cache
future work
Future Work
  • Parallelization
  • Label Problem
label problem
Label Problem
  • Current Procedure not very good
  • Cluster boundaries
  • Term selection
cluster boundaries
Cluster Boundaries
  • Image processing
  • Geometric
cluster boundaries1
Cluster Boundaries
  • Image processing example:
term selection
Term Selection
  • Too many unique noun phrases
    • Too many dimensions in the feature vector data
  • “Knee” of frequency curve
ad