Clustering User Queries of a Search Engine

Clustering User Queries of a Search Engine Ji-Rong Wen Jian-Yun Nie Hong-Jiang Zhang

Outline • Principles • Clustering Algorithm • Similarity calculation • Similarity Based on Query Contents • Similarity Based on Keywords or Phrases • Similarity Based on String Matching • Similarity Based on User Feedback • Combination of Multiple Measures • Our works

Principles • using query contents • The longer the queries, the more reliable the principle 1 is • Queries are short • Group together queries of similar compositions • using document clicks • If two queries lead to the selection of the same document, then they are similar • User's judgments

Clustering Algorithm • DBSCAN • density-based clustering method • Incremental DBSCAN • Key Problem: • similarity function • session := query text [clicked document]*

Similarity calculation • Similarity Based on Keywords or Phrases • keyword-based similarity function: kn(.) is the number of keywords in a query, KN(p, q) is the number of common keywords in two queries • Term Weight w(ki(p)) is the weight of the i-th common keyword in query p and kn(.) w(ki(p)) =tf*idf • Phrases • history of China vs history of the United States • KeyWords: 33% • Phrases:50%

Similarity calculation • Similarity Based on String Matching • keyword-based similarity function: • useful for long and complete questions in natural language • Query 1: Where does silk come? • Query 2: Where does lead come from? • Query 3: Where does dew comes from? • incorporate a dictionary of synonyms

Similarity Based on User Feedback • The document set D_C(.) users clicked on for queries qi and qj may be seen as follows: • similarity between queries qi and qj is determined by D_C(qi) ∩ D_C(qj)

Similarity Based on User Feedback • Similarity Through Single Documents rd(.) is the number of clicked documents for a query, RD(p,q) is the number of document clicks in common • Example • correspond to the document “ID: 761588871, Title: Atomic Bomb” • Query 1: atomic bomb • Query 2: Nagasaki • Query 3: Nuclear bombs • Query 4: Manhattan Project • Query 5: Hiroshima

Similarity Based on User Feedback • Similarity Through Document Hierarchy F(di, dj) denote the lowest common parent node for documents di and dj, L(x) the level of node x, L_Total the totallevels in the hierarchy • The hierarchy-based similarity is defined as follows:

Combination of Multiple Measures

Combination of Multiple Measures • keyword-based measure • Cluster 1: Query 1 • Cluster 2: Query 2 • Cluster 3: Query 3 and Query 4 • based on individual documents • Cluster 1: Query 1 and Query 2 • Cluster 2: Query 3 • Cluster 3: Query 4 • measure on document hierarchy • Cluster 1: Query 1, Query 2, and Query 4 • Cluster 2: Query 3 • Similarity conten + similarity concept • Cluster 1: Query 1 and Query 2 • Cluster 2: Query 3 and Query 4

Question

Clustering User Queries of a Search Engine

Clustering User Queries of a Search Engine

Presentation Transcript

Choosing a Search Engine

Choosing a Search Engine

Spatial Variation in Search Engine Queries

Frompo a Search Engine

Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information

A Personalized Search Engine Based on Web Snippet Hierarchical Clustering

Retroactive Answering of Search Queries

Anatomy of a search engine

Clustering Web Queries

Language Identification of Search Engine Queries

Retroactive Answering of Search Queries

Grouping Search-Engine Returned Citations for Person Name Queries

Partitioning Search-Engine Returned Citations for Proper-Noun Queries

Search Engine

Clustering of search engine results by Google

Quality of a search engine

Search engine

Search Engine Optimization - Importance Of Search Engine Optimization

search engine

The Anatomy Of A Search Engine

Spatial Variation in Search Engine Queries

Grouping Search-Engine Returned Citations for Person Name Queries