1 / 18

An Interface for navigating clustered document sets returned by queries

An Interface for navigating clustered document sets returned by queries. Robert Allen, Pascal Obry & Michael Littman. presented by: Anna Hunecke. Overview. Information Retrieval The Interface The procedure Results Discussion. Information Retrieval (IR).

cicada
Download Presentation

An Interface for navigating clustered document sets returned by queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Interface for navigating clustered document sets returned by queries Robert Allen, Pascal Obry & Michael Littman presented by: Anna Hunecke

  2. Overview • Information Retrieval • The Interface • The procedure • Results • Discussion

  3. Information Retrieval (IR) • find the information relevant to a user query in a database • the results can be documents, text, images... • For the IR a mathematical model of the database is created • Example: Web-Search Engines

  4. Problems of IR • Synonymy: concept can be described by several words • Polysemy: one word has more than one meaning • -> Latent Semantic Indexing (LSI) is supposed to solve these problems

  5. ldea of the Interface • creating an interface that answers user queries by giving back the most likely articles about it from a corpus • results should be represented by a dendrogram rather than by a linear representation of similar articles • The dendrogram should be created via Hierarchical Clustering

  6. Dendrogram • a tree where the apparent similarity between the leaves is shown by the height of the connection which joins them • Example

  7. The Interface text window lists of proximal documents user query Dendrogram Subtree Document lists

  8. The Corpus • 25.629 articles from the Academic American Encyclopaedia • articles were preprocessed using latent semantic indexing (LSI) • Cross-reference links were not used

  9. How does the retrieval work? • user query is processed using LSI • results are gathered in a return set consisting of 400 articles (the top 1.6%) • these are clustered using Ward’s algorithm

  10. Latent Semantic Analysis (LSA) • Idea: determine the similarity of meaning of words and passages in large corpora • applications: • model human conceptual knowledge: • synonym tests • calculating text coherence • ... • information retrieval (-> LSI)

  11. LSI • Documents are converted into a matrix A: • The values are transformed so that they express the word’s importance and the information it possesses • This is done by using the entropy measure

  12. LSI • In the next step the document matrices are converted into vectors using singular value decomposition (SVD) • This compresses the large matrix into (comparatively) small vectors

  13. LSI • The documents are now modelled as vectors • The query is also modelled as a vector • Similarity between vectors reflects which documents are related • Similarity measure is the cosine of the vectors

  14. Ward’s algorithm (Ward 1963) • Start: 400 sets which each contains one document-vector • Repeat: • find the two most similar sets according to the similarity measure and unite them • Similarity measure is the cosine of vectors • In sets with more than one element it is the cosine of the average vector • Stop if all documents are in one set

  15. Does the clustering improve theresult from LSI? • informal results: • sometimes clustering does not improve retrieval: • LSI does not return relevant articles for clustering • LSI already returns good results • clustering improves results, if the query consists of several parts or include ambigue terms

  16. Points of criticism • The user needs ca. 20 minutes to get used to the interface • the computational costs of the clustering are very high • the paper does not make a clear statement about the clustering features • the paper does not give a proper evaluation

  17. Limitations • For effective search the corpus must have entries with clear titles or descriptions • descriptions of clusters are only a list of the documents they contain • The usefulness of clustering depends strongly on the corpus and the query

  18. Discussion • Alternatives to hierarchical Clustering? • Alternatives to the average similarity? • Analysis of the query in order to decide if the clustering should be conducted? • Other applications?

More Related