1 / 14

A Nonlinear Mapping for Data Structure Analysis

A Nonlinear Mapping for Data Structure Analysis. John W. Sammon, Jr., IEEE Transaction on Computers, Vol. C-18, No. 5, 1969, pp. 401-409. Presenter : Wei-Shen Tai Advisor : Professor Chung-Chian Hsu 200 7 / 4/4. Outline. Introduction Nonlinear mapping Some computer results

jsherron
Download Presentation

A Nonlinear Mapping for Data Structure Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Nonlinear Mapping for Data Structure Analysis John W. Sammon, Jr., IEEE Transaction on Computers, Vol. C-18, No. 5, 1969, pp. 401-409. Presenter : Wei-Shen Tai Advisor : Professor Chung-Chian Hsu 2007/4/4

  2. Outline • Introduction • Nonlinear mapping • Some computer results • Relationship of NLM to other structure analysis algorithm • Limitations and extensions • Comments • MDS

  3. Motivation • Data structure visualization • Provide a highly effective visualization method in the analysis of multivariate data. • Data structure refers to geometric relationships among subsets of the data vectors in the L-space.

  4. Objective • Nonlinear mapping algorithm (NLM) • Based upon a point mapping of the NL-dimensional vectors from the L-space to a lower dimensional space such that the inherent structure of the data is approximately preserved under the mapping.

  5. Nonlinear mapping • N vectors in an L-space designated Xi, i= 1, …, N and corresponding to these we define N vectors in a d-space (d = 2 or 3) designated Yi, i=l, …, N. • Let the distance between the vectors Xi and Xj in the L-space be defined by dij*=dist [Xi, Xj] and the distance between the corresponding vectors-Yi and Yj in the d-space be defined by dij= dist [Yi, Yj]. • A steepest descent procedure to search for a minimum of the error

  6. Computer results

  7. 19-dimensional Gaussian simplex distribution Fig 7. result of principle eigenvector plots Fig 6. result of NLM

  8. Experiments in document classification • A document classification space • Every document in the library was represented as a 17-dimensional vector. All of them are described a mapping of 1125 preselected words and phrases into the C-space. • Query 1 ~ 5 and their related documents are shown, respectively. • Documents considered relevant to a given request were clustered. • Documents tend to be uniformly distributed throughout the space. • Clusters 2 and 3 tend to overlap, yet they are well-separated from clusters 4 and 5. In general, the intercluster relationships seem consistent with their respective subject relationships.

  9. Relationship to other related algorithm • Multidimensional Scaling • Find a configuration of points in a t-space such that the resultant inter-point distances preserve a monotonic relationship to a given set of inter-element similarities (or dissimilarities). • Deficiencies • Resulting cluster configuration is highly dependent upon a set of control parameters which must be fixed by the user. • Particularly sensitive to hyper-spherical structure and are inefficient in detecting more complex relationships in the data. • Do not exist really good ways for evaluating a resultant cluster configuration. • When two clusters are close, the vectors between tend to form a bridge and cause spurious mergers.

  10. Nonlinear mapping vantage • A highly promising structure analysis algorithm • None control parameters require a priori knowledge. • Highly efficient in identifying complex data structures. • Easy to detect and identify data structure. • Dealing extraneous data and spurious mergers. • Simple and efficient.

  11. Limitation and extension • Limitations • Reliability of the scatter diagram in displaying extremely complex high-dimensional structure. • Minimum mapping error is too large (E>>0.1) and the 2-dimensional scatter plot fails to portray the true structure. • Number of vectors that it can handle. • Limited at present to N< 250 vectors. • When N> 250, we suggest using a data compression technique to reduce the data set to less than 250 vectors. • Extension • On-Line Pattern Analysis and Recognition System (OLPARS)

  12. Comments • Advantage • A visualization method for hyper-space data. • The distance of data space can be preserved and interpreted in geometric relationship in the low-dimension map. • Drawback • Easy to learn and hard to compute. • The computational cost seems quite high. • Application • Data structure visualization related applications.

  13. MDS • A very simple example, using mileage distances between cities. • Start with a map, which illustrates the relative geographic locations of a set of American cities. • The map is a geometric model in which cities are represented as points in two-dimensional space. The distances between the points are proportional to the geographic proximities of the cities. • Using the map/model it is easy to construct a square matrix containing the distances between any pair of cities. • The matrix, itself, is analogous to the mileage chart that is often included with road maps.

  14. MDS algorithm • MDS uses the matrix of distances (i.e., the “mileage chart”) as input data. • The output from MDS consists of two parts: • A model showing the cities as points in space, with the distances between the points proportional to the entries in the input data matrix (i.e., a map). • A goodness-of-fit measure showing how closely the geometric point configuration corresponds to the data values from the input data matrix.

More Related