1 / 33

Cluster Language Models

Cluster Language Models. By Roman Brauchle. Structure. Cluster-based retrieval vs. Document-based retrieval. Document-based retrieval Information retrieval (IR) system matches query against collection of documents Returns a ranked list of documents to the user Cluster-based retrieval

kerem
Download Presentation

Cluster Language Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cluster Language Models By Roman Brauchle

  2. Structure

  3. Cluster-based retrieval vs. Document-based retrieval • Document-based retrieval • Information retrieval (IR) system matches query against collection of documents • Returns a ranked list of documents to the user • Cluster-based retrieval • Similar documents match the same information needs • Groups documents into clusters  document clustering • Returns a list of documents based on the clusters they came from

  4. Qualities of clustering methods (1) • Static clustering • Used in most early attempts • A hierarchical clustering method usually used for small document collections • Different algorithms for matching the query against the clusters • Probably more precise than document-based retrieval, but less effective

  5. Different ways of document clustering • Static clustering • clustering independent of the an existing query • Match the query against clusters e.g. their centroids instead of single documents • Query specific clustering • Documents to be clustered are from traditional document retrieval result based on a query

  6. Qualities of clustering methods (2) • Query specific clustering • Main goal is to improve ranking of relevant documents • Number of top ranked documents does not have significant impact on clustering effectiveness • Approaches of static clustering and query specific clustering are comparable • Query specific clustering outperforms static clustering

  7. Purpose • Improving efficiency of traditional IR systems • Supposed to improve effectiveness as well • Categorize or classify documents • Important tool for Web search engines, document browsing, distributed retrieval etc.

  8. Recent developments • No conclusive findings about improvement of retrieval results on collections of realistic size • New ways of thinking about retrieval process using recent developments in statistical language modelling • Leads to re-examination of cluster-based retrieval

  9. Different approaches (1) • Traditional approach • Used to identify a subset of documents to be likely relevant for a query (static clustering) • At the time of retrieval, only those documents will be matched to the query • Most common approach

  10. Different approaches (2) • Ranking clusters • Retrieve a list of clusters in their entirety in response to a query (static or query specific clustering) • Rank clusters basing on their similarity to the query • Any document in cluster that is considered more relevant has a higher ranking than any document of an cluster ranked lower in the list

  11. Different approaches (3) • Smoothing clusters • Using clusters as a form of document smoothing (also static or query specific clustering) • By grouping documents into clusters, differences between representations of individual documents are smoothed out

  12. Cluster-based Language Models • Document clustering is used to organize collections around their topics in clusters • Language models are estimated for those clusters and used to represent their topics • Employed in Topic Detection and Tracking (DTD)

  13. CQL model (1) • Basic approach for using language models in IR is to model the query generation process • General idea • Statistical language models are probability distributions over all possible sentences or other linguistic units in a language • Build a language model D for each document in collection and rank them according to how likely the query Q could have been generated from each of this document models i.e. P(Q|D) • Also called the query-likelihood (QL) retrieval model

  14. CQL model (2) • Estimation of P(Q|D) • Most commonly the query can be treated as sequence of independent terms, thus the probability can be represented as where is the ith term in the query, and P( | D) is specified by the document language model

  15. CQL model (3) • Similar approach for cluster-based retrieval • Build language model for clusters an retrieve / rank clusters based on the likelihood of generating the query i.e. P(Q | Cluster) where P( | Cluster) is specified by the cluster language model

  16. CBDM model • A second model that smoothes representations of individual documents • The CBDM model can also be viewed as mixture of three sources • The document • The cluster/topic the documents belongs to • The collection

  17. Clustering algorithms (1) • Cluster-based retrieval requires that documents be first organized into clusters • The procedure consists of the following steps • Establish pairwise measure of document similarity using for example cosine measure • Building a specific number of clusters by partitioning the document collection depending on the clustering mode

  18. Clustering algorithms (2) • In case of static clustering • Three-pass K-means algorithm • Linear in the total number of N documents • In case of query based clustering • Hierarchical agglomerative clustering algorithms like single and complete linkage, group average, centroid and Ward’s algorithms

  19. Application examples (1) • Cluster-based retrieval by ranking clusters • Compares the CQL model with document based retrieval • This experimental procedure consists of three steps • Document-based retrieval using query QL model • Usage of one clustering algorithm for query-based clustering mentioned above • Build clusters of the top ranking documents using the CQL model • Results • Cluster-based retrieval is as effective as document-based retrieval or performs slightly better

  20. Application examples (2) • Cluster-based retrieval by smoothing documents with clusters • Evaluates the CBDM model in context of QL retrieval and relevance model RM for both static and query-specific clustering • Also compares to traditional document-retrieval • Consists of the following steps • Applying K-means algorithm to group documents into clusters • Performing cluster-based retrieval using query likelihood with CBDM as document model

  21. Application examples (3) • Results • Cluster-based retrieval performs significantly better than document-based retrieval in case of static clustering • In case of query-specific clustering it performs as well or slightly better than document-based retrieval

  22. Alternative approaches to language models • Previous application examples demonstrated the effectiveness of probabilistic language models • Language models typically use only individual document features • Other sources of information should be exploited like the similarity structure of the document collection e.g. the corpus

  23. Advantages • Offline-clustering is by definition query independent and may be based on irrelevant factors • Retrieving documents not containing on a certain query term but still relevant • Cluster statistics may overgeneralize with respect to specific member documents

  24. Alternative retrieval framework • Incorporating both individual document information and corpus structure information • Using precomputed overlapping clusters • The choice of which clusters to incorporate can depend on the query although cluster information is query independent

  25. Structure representation using overlapping clusters • Overlapping clusters represent corpus similarity structures • Clusters can be thought of as approximations of true facets of the corpus the user might interested in • Overlapping clusters forms a better model for similarity structure than partitioning the corpus

  26. General approach • To assign a ranking to the documents, the set of clusters is created in advance • Execution of the following fairly general algorithm given a query q and N documents to retrieve Additional notations for language model assigns probabilities over text strings in the document assigns probabilities over text strings entire cluster

  27. General algorithm Offline : Create Clusters Online : For each document d in Cluster C Choose Facets(d) (query dependent subset of C) Score d by a weighted combination of and for all cluster in Facets(d) Set TopDocs(N) to the ranked ordered list of N topscoring documents Optional : rerank each d in TopDocs(N) by Return TopDocs(N)

  28. Retrieval algorithms – Cluster formation and selection • Clusters consist of each Document and its k-1 neighbours • k is a free parameter • Clusters with different basis documents may contain the same set of documents • Inter-document distance is computed on the mentioned way

  29. Retrieval algorithms – retrieval-time actions (1) • Baseline-methods • use some sort of QL for ranking documents • no cluster information is needed • Selection methods • use either only the basis documents (basis-select) or all documents (set-select) in the retrieved top clusters • Use some sort of QL as ranking method

  30. Retrieval algorithms – retrieval-time actions (2) • Aspect-x-methods • make explicit more use of clusters as smoothing algorithms and employs the probability of text strings in the entire top clusters • re-ranking is applied using some sort of QL • Hybrid algorithms • as interpolation be selection and aspect-x-methods to combine advantages of these methods • Introducing a parameter lambda to control the weight of the different methods • No re-ranking steps

  31. Application examples • Cluster-based retrieval using overlapping clusters and retrieval-time actions • Introduction of a base language model with a additional parameter which controls degree to which document statistics are altered by overall corpus statistics • Use base language model as traditional fraction to compare • Results • Cluster-based retrieval is always competitive, especially the aspect-x and interpolation algorithms

  32. Conclusions • Cluster-based retrieval is a good alternative to traditional document-based retrieval • Performs as good or slightly better than document-based retrieval using query-specific clustering • Performs significantly better in the case of static clustering • Using corpus structure as well as document information • Additional performance boost in most cases

  33. Sources • Oren Kurland and Lilian Lee, “Corpus Structure, Languate Models, and Ad Hoc Information Retrieval” • Xiaoyong Liu and W. Bruce Croft, “Cluster-based Retrieval Using Language Models”

More Related