Clustering Personalized Web Search Results Xuehua Shen and Hong Cheng
Introduction • Search engine’s objectives • Rank most relevant search results at top • Effectiveness • PageRank / HITS • Group and present different categories of search results • Global view • Clustering
Clustering Personalized Search Results • Study the clustering problem in the UCAIR framework • Personalized search ranks or reranks the search results based on user implicit feedback • Bring interesting problems • Efficient and effective clustering/presentation • Dynamically update the clustering results based on personalization
Goal • Effective • Cluster user search results into meaningful groups • Present in a clear format • Provide users with main themes of search results • Efficient • Implement efficient clustering algorithms • Dynamic • Dynamically maintain the clustering results based on personalized ranking and reranking
Progress • Implemented two clustering algorithms • K-Medoids • Hierarchical clustering • Presentation • Replace Google ads with clustering results • Present ranked results together with clustering results • Two presentation strategies • Most centrally located document in each cluster • Most frequent terms in each cluster
Partial Results • K-Medoids • Select the most centrally located documents as cluster center • Present the centroid documents as each cluster’s representative • Efficiency not so good • Other processing time: 490+100+1562=2152 ms • Cluster search results time: 2844ms
Partial Results (II) • Hierarchical clustering • Merge similar documents in a pair-wise manner • Use weighted average term vectors to represent cluster center • Present centroid term vectors as a virtual documents (output Top-K terms) • Efficiency better than K-Medoids • Other processing time: 200+110+831= 1141 ms • Cluster search results time: 661ms
Efficiency Analysis • K-Medoids • O(k(n-k)2 ) for each iteration where n is # of documents, k is # of clusters • Need multiple iterations for convergence • Hierarchical clustering • O(n2 ) for each iteration • Need n-k iterations
Lessons Learned • Clustering takes longer time as more search results accumulate (when we click “Next”) • Top-K frequent terms in each cluster sometimes do not make sense • Combine additional information besides term frequency • Re-cluster each time when reranking search results • Incremental update of clustering results is desired!
Remaining • Implementation • KMeans • MMR • Frequent word sets • Effective presentation study • Based on user feedback • Literature survey • Dynamic maintenance of clustering based on search result ranking and reranking • Drill down in a particular cluster • Update overall clustering organization
Feedback • Which way to present clustering results is more meaningful? • Based on central documents • Based on term vectors • More options? • Any other clustering algorithms to achieve effectiveness and efficiency? • Any other presentation strategy besides “rank list + cluster center” ?