120 likes | 261 Views
Clustering User Queries of a Search Engine. Ji-Rong Wen Jian-Yun Nie Hong-Jiang Zhang. Outline. Principles Clustering Algorithm Similarity calculation Similarity Based on Query Contents Similarity Based on Keywords or Phrases Similarity Based on String Matching
E N D
Clustering User Queries of a Search Engine Ji-Rong Wen Jian-Yun Nie Hong-Jiang Zhang
Outline • Principles • Clustering Algorithm • Similarity calculation • Similarity Based on Query Contents • Similarity Based on Keywords or Phrases • Similarity Based on String Matching • Similarity Based on User Feedback • Combination of Multiple Measures • Our works
Principles • using query contents • The longer the queries, the more reliable the principle 1 is • Queries are short • Group together queries of similar compositions • using document clicks • If two queries lead to the selection of the same document, then they are similar • User's judgments
Clustering Algorithm • DBSCAN • density-based clustering method • Incremental DBSCAN • Key Problem: • similarity function • session := query text [clicked document]*
Similarity calculation • Similarity Based on Keywords or Phrases • keyword-based similarity function: kn(.) is the number of keywords in a query, KN(p, q) is the number of common keywords in two queries • Term Weight w(ki(p)) is the weight of the i-th common keyword in query p and kn(.) w(ki(p)) =tf*idf • Phrases • history of China vs history of the United States • KeyWords: 33% • Phrases:50%
Similarity calculation • Similarity Based on String Matching • keyword-based similarity function: • useful for long and complete questions in natural language • Query 1: Where does silk come? • Query 2: Where does lead come from? • Query 3: Where does dew comes from? • incorporate a dictionary of synonyms
Similarity Based on User Feedback • The document set D_C(.) users clicked on for queries qi and qj may be seen as follows: • similarity between queries qi and qj is determined by D_C(qi) ∩ D_C(qj)
Similarity Based on User Feedback • Similarity Through Single Documents rd(.) is the number of clicked documents for a query, RD(p,q) is the number of document clicks in common • Example • correspond to the document “ID: 761588871, Title: Atomic Bomb” • Query 1: atomic bomb • Query 2: Nagasaki • Query 3: Nuclear bombs • Query 4: Manhattan Project • Query 5: Hiroshima
Similarity Based on User Feedback • Similarity Through Document Hierarchy F(di, dj) denote the lowest common parent node for documents di and dj, L(x) the level of node x, L_Total the totallevels in the hierarchy • The hierarchy-based similarity is defined as follows:
Combination of Multiple Measures • keyword-based measure • Cluster 1: Query 1 • Cluster 2: Query 2 • Cluster 3: Query 3 and Query 4 • based on individual documents • Cluster 1: Query 1 and Query 2 • Cluster 2: Query 3 • Cluster 3: Query 4 • measure on document hierarchy • Cluster 1: Query 1, Query 2, and Query 4 • Cluster 2: Query 3 • Similarity conten + similarity concept • Cluster 1: Query 1 and Query 2 • Cluster 2: Query 3 and Query 4