1 / 32

Document Maps

Document Maps. Slawomir Wierzchon , Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy of Sciences Warsaw.

Download Presentation

Document Maps

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Maps Slawomir Wierzchon , Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy of Sciences Warsaw Research partially supported by the KBN research project 4 T11C 026 25 "Maps and intelligent navigation in WWW using Bayesian networks and artificial immune systems"

  2. Agenda • Motivation • What is a document map • Map creation • Clustering • Experimental results • Future directions

  3. Motivation • The Web as well as intranets become increasingly content-rich: simple ranked lists or even hierarchies of results seem not to be adequate anymore • A good way of presenting massive document sets in an understandable form will be crucial in the near future

  4. Document map • Many attempts have been made to visualize sets of dicuments not just like a list, but rather in two dimensions • A document map is a mapping of a set of documents to 2-D representing their inter-relationships

  5. Linear relationship presentation(Internet Cartographer)

  6. A relationship • A link between hypertext documents • Citation in the bibliography • Content similarity

  7. A tree of relations with central subject (Inxight – Tree Studio)

  8. Selforganizing map (WebSOM)dissimilarity of grouops of documents

  9. Document frequency in clusters

  10. A meta search engine map

  11. Our approach – multiple representations (BEATCA)

  12. Map visualizations in 3D (BEATCA)

  13. Future research – hypergeometric representation (Fish-Eye eEffect)

  14. The preparation of documents is done by an indexer, which turns a document into a vector-space model representation • Indexer also identifies frequent phrases in document set for clustering and labelling purposes • Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded • The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation • ‘The best’ (wrt some similarity measure) map is used bythe query processor in response to the user’s query

  15. How are the maps created • A modified WebSOM method is used: • compact reference vectors representation • broad-topic initialization method • joint winner search method • multi-level (hierarchical) maps • multi-phase document clustering: • initial grouping to identify major topics • Initial document grouping • WEBSOM on document groups • fuzzy cell clusters extraction and labelling

  16. Document model in search engines My dog likes this food dog • In the so-called vector model a document is considered as a vector in space spanned by the words it contains. food When walking, I take some food walk

  17. Document model in search engines dog • The relevance of a document to a query or to another document is measured as cosine of angle between the query and the document. food Query: walk walk

  18. Reference vector representation • Vectors are sparse by nature • During learning process they become even sparser • Represented as a balanced red-black trees • Tolerance threshold imposed • Terms (dimensions) below threshold are removed • Significant complexity reduction without negative quality impact

  19. Topic-sensitive initialization • Inter-topic similarities important both for map learning and visualization/cluster extraction • Simple approach: • Use LSI to select K main broad topics • Select K map cells (evenly spread over the map) as the fixpoints for individual topics • Initialize selected fixpoints with broad topics • Initialize remaining cells with „in-between values”

  20. Clustering document vectors r x m Mocna zmiana położenia (gruba strzałka) Document space 2D map Important difference to general clustering: not only clusters with similar documents, but also neighboring clusters similar

  21. Joint winner search • Global winner search: accurate but slow • Local winner search: faster but can be inaccurate during rapid changes • Start with single phase of global search • Document movements become more smooth during learning process: usually local search is enough • Use global search when occassional sudden moves occur (eg. outliers, neighbourhood width decrease)

  22. Hierarchical maps • Bottom-up approach • Feasible (with joint winner search method) • Start with most detailed map • Compute weighted centroids of map areas • Use them as seeds for coarser map • Top-down approach is possible but requires fixpoints

  23. Clustering document groups • Numerous methods exists but none of them directly applicable: • Extremely fuzzy structure of topical groups in SOM cells • Neccesity of taking into account similiarity measures both in original document space and in the map space • Outlier-handling problem during cluster formation • No a priori estimation of the number of topical groups • Fuzzy C-MEANS on lattice of map cells applied • Graph theoretical approach (density- and distance- based MST) combined with fuzzy clustering • Clustered documents are labeled by weighted centroids of cell reference vectors scaled with between-group entropy

  24. Experiments with map convergence • We examined the convergence of the maps to a stable state depending on: • type of alpha function (search radius reduction) • type of winner search method • type of initialization method

  25. Convergence – alpha functions (linear versus reciprocal)

  26. Convergence – winner search (joint versus local)

  27. Experiments with execution time • The impact of the following factors on the speed of map creation was investigated: • Map size (total number of cells) • Optimization methods: • dictionary optimization • reference vector representation • Map quality assessment: • Compare with ‘ideal’ map (e.g. without optimizations) • Identical initialization and learning parameters • Compute sum of squared distances of location of each document on both maps

  28. Execution time - map size

  29. Execution time - optimizations

  30. Future research • Maps for joint term-citation model, taking into account between-group link flow direction • Fully distributed map creation • Adaptive document retrieval and clustering: • Bayesian network based relevance measure • Survival models for document update rate estimation • Dead link propagation methods for page freshness estimation • We also intend to integrate Bayesian and immune system methodologies with WebSOM in order to achieve new clustering effects

  31. Future research • Bayesian networks will be applied in particular to: • measure relevance and classify documents • accelerate document clustering processes • construct a thesaurus supporting query enrichment • keyword extraction • between-topic dependencies estimation

  32. Thank you! Any questions?

More Related