Cluster Language Models

Cluster Language Models By Roman Brauchle

Structure

Cluster-based retrieval vs. Document-based retrieval • Document-based retrieval • Information retrieval (IR) system matches query against collection of documents • Returns a ranked list of documents to the user • Cluster-based retrieval • Similar documents match the same information needs • Groups documents into clusters  document clustering • Returns a list of documents based on the clusters they came from

Qualities of clustering methods (1) • Static clustering • Used in most early attempts • A hierarchical clustering method usually used for small document collections • Different algorithms for matching the query against the clusters • Probably more precise than document-based retrieval, but less effective

Different ways of document clustering • Static clustering • clustering independent of the an existing query • Match the query against clusters e.g. their centroids instead of single documents • Query specific clustering • Documents to be clustered are from traditional document retrieval result based on a query

Qualities of clustering methods (2) • Query specific clustering • Main goal is to improve ranking of relevant documents • Number of top ranked documents does not have significant impact on clustering effectiveness • Approaches of static clustering and query specific clustering are comparable • Query specific clustering outperforms static clustering

Purpose • Improving efficiency of traditional IR systems • Supposed to improve effectiveness as well • Categorize or classify documents • Important tool for Web search engines, document browsing, distributed retrieval etc.

Recent developments • No conclusive findings about improvement of retrieval results on collections of realistic size • New ways of thinking about retrieval process using recent developments in statistical language modelling • Leads to re-examination of cluster-based retrieval

Different approaches (1) • Traditional approach • Used to identify a subset of documents to be likely relevant for a query (static clustering) • At the time of retrieval, only those documents will be matched to the query • Most common approach

Different approaches (2) • Ranking clusters • Retrieve a list of clusters in their entirety in response to a query (static or query specific clustering) • Rank clusters basing on their similarity to the query • Any document in cluster that is considered more relevant has a higher ranking than any document of an cluster ranked lower in the list

Different approaches (3) • Smoothing clusters • Using clusters as a form of document smoothing (also static or query specific clustering) • By grouping documents into clusters, differences between representations of individual documents are smoothed out

Cluster-based Language Models • Document clustering is used to organize collections around their topics in clusters • Language models are estimated for those clusters and used to represent their topics • Employed in Topic Detection and Tracking (DTD)

CQL model (1) • Basic approach for using language models in IR is to model the query generation process • General idea • Statistical language models are probability distributions over all possible sentences or other linguistic units in a language • Build a language model D for each document in collection and rank them according to how likely the query Q could have been generated from each of this document models i.e. P(Q|D) • Also called the query-likelihood (QL) retrieval model

CQL model (2) • Estimation of P(Q|D) • Most commonly the query can be treated as sequence of independent terms, thus the probability can be represented as where is the ith term in the query, and P( | D) is specified by the document language model

CQL model (3) • Similar approach for cluster-based retrieval • Build language model for clusters an retrieve / rank clusters based on the likelihood of generating the query i.e. P(Q | Cluster) where P( | Cluster) is specified by the cluster language model

CBDM model • A second model that smoothes representations of individual documents • The CBDM model can also be viewed as mixture of three sources • The document • The cluster/topic the documents belongs to • The collection

Clustering algorithms (1) • Cluster-based retrieval requires that documents be first organized into clusters • The procedure consists of the following steps • Establish pairwise measure of document similarity using for example cosine measure • Building a specific number of clusters by partitioning the document collection depending on the clustering mode

Clustering algorithms (2) • In case of static clustering • Three-pass K-means algorithm • Linear in the total number of N documents • In case of query based clustering • Hierarchical agglomerative clustering algorithms like single and complete linkage, group average, centroid and Ward’s algorithms

Application examples (1) • Cluster-based retrieval by ranking clusters • Compares the CQL model with document based retrieval • This experimental procedure consists of three steps • Document-based retrieval using query QL model • Usage of one clustering algorithm for query-based clustering mentioned above • Build clusters of the top ranking documents using the CQL model • Results • Cluster-based retrieval is as effective as document-based retrieval or performs slightly better

Application examples (2) • Cluster-based retrieval by smoothing documents with clusters • Evaluates the CBDM model in context of QL retrieval and relevance model RM for both static and query-specific clustering • Also compares to traditional document-retrieval • Consists of the following steps • Applying K-means algorithm to group documents into clusters • Performing cluster-based retrieval using query likelihood with CBDM as document model

Application examples (3) • Results • Cluster-based retrieval performs significantly better than document-based retrieval in case of static clustering • In case of query-specific clustering it performs as well or slightly better than document-based retrieval

Alternative approaches to language models • Previous application examples demonstrated the effectiveness of probabilistic language models • Language models typically use only individual document features • Other sources of information should be exploited like the similarity structure of the document collection e.g. the corpus

Advantages • Offline-clustering is by definition query independent and may be based on irrelevant factors • Retrieving documents not containing on a certain query term but still relevant • Cluster statistics may overgeneralize with respect to specific member documents

Alternative retrieval framework • Incorporating both individual document information and corpus structure information • Using precomputed overlapping clusters • The choice of which clusters to incorporate can depend on the query although cluster information is query independent

Structure representation using overlapping clusters • Overlapping clusters represent corpus similarity structures • Clusters can be thought of as approximations of true facets of the corpus the user might interested in • Overlapping clusters forms a better model for similarity structure than partitioning the corpus

General approach • To assign a ranking to the documents, the set of clusters is created in advance • Execution of the following fairly general algorithm given a query q and N documents to retrieve Additional notations for language model assigns probabilities over text strings in the document assigns probabilities over text strings entire cluster

General algorithm Offline : Create Clusters Online : For each document d in Cluster C Choose Facets(d) (query dependent subset of C) Score d by a weighted combination of and for all cluster in Facets(d) Set TopDocs(N) to the ranked ordered list of N topscoring documents Optional : rerank each d in TopDocs(N) by Return TopDocs(N)

Retrieval algorithms – Cluster formation and selection • Clusters consist of each Document and its k-1 neighbours • k is a free parameter • Clusters with different basis documents may contain the same set of documents • Inter-document distance is computed on the mentioned way

Retrieval algorithms – retrieval-time actions (1) • Baseline-methods • use some sort of QL for ranking documents • no cluster information is needed • Selection methods • use either only the basis documents (basis-select) or all documents (set-select) in the retrieved top clusters • Use some sort of QL as ranking method

Retrieval algorithms – retrieval-time actions (2) • Aspect-x-methods • make explicit more use of clusters as smoothing algorithms and employs the probability of text strings in the entire top clusters • re-ranking is applied using some sort of QL • Hybrid algorithms • as interpolation be selection and aspect-x-methods to combine advantages of these methods • Introducing a parameter lambda to control the weight of the different methods • No re-ranking steps

Application examples • Cluster-based retrieval using overlapping clusters and retrieval-time actions • Introduction of a base language model with a additional parameter which controls degree to which document statistics are altered by overall corpus statistics • Use base language model as traditional fraction to compare • Results • Cluster-based retrieval is always competitive, especially the aspect-x and interpolation algorithms

Conclusions • Cluster-based retrieval is a good alternative to traditional document-based retrieval • Performs as good or slightly better than document-based retrieval using query-specific clustering • Performs significantly better in the case of static clustering • Using corpus structure as well as document information • Additional performance boost in most cases

Sources • Oren Kurland and Lilian Lee, “Corpus Structure, Languate Models, and Ad Hoc Information Retrieval” • Xiaoyong Liu and W. Bruce Croft, “Cluster-based Retrieval Using Language Models”

Cluster Language Models