1 / 12

Clustering User Queries of a Search Engine

Clustering User Queries of a Search Engine. Ji-Rong Wen Jian-Yun Nie Hong-Jiang Zhang. Outline. Principles Clustering Algorithm Similarity calculation Similarity Based on Query Contents Similarity Based on Keywords or Phrases Similarity Based on String Matching

adsila
Download Presentation

Clustering User Queries of a Search Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering User Queries of a Search Engine Ji-Rong Wen Jian-Yun Nie Hong-Jiang Zhang

  2. Outline • Principles • Clustering Algorithm • Similarity calculation • Similarity Based on Query Contents • Similarity Based on Keywords or Phrases • Similarity Based on String Matching • Similarity Based on User Feedback • Combination of Multiple Measures • Our works

  3. Principles • using query contents • The longer the queries, the more reliable the principle 1 is • Queries are short • Group together queries of similar compositions • using document clicks • If two queries lead to the selection of the same document, then they are similar • User's judgments

  4. Clustering Algorithm • DBSCAN • density-based clustering method • Incremental DBSCAN • Key Problem: • similarity function • session := query text [clicked document]*

  5. Similarity calculation • Similarity Based on Keywords or Phrases • keyword-based similarity function: kn(.) is the number of keywords in a query, KN(p, q) is the number of common keywords in two queries • Term Weight w(ki(p)) is the weight of the i-th common keyword in query p and kn(.) w(ki(p)) =tf*idf • Phrases • history of China vs history of the United States • KeyWords: 33% • Phrases:50%

  6. Similarity calculation • Similarity Based on String Matching • keyword-based similarity function: • useful for long and complete questions in natural language • Query 1: Where does silk come? • Query 2: Where does lead come from? • Query 3: Where does dew comes from? • incorporate a dictionary of synonyms

  7. Similarity Based on User Feedback • The document set D_C(.) users clicked on for queries qi and qj may be seen as follows: • similarity between queries qi and qj is determined by D_C(qi) ∩ D_C(qj)

  8. Similarity Based on User Feedback • Similarity Through Single Documents rd(.) is the number of clicked documents for a query, RD(p,q) is the number of document clicks in common • Example • correspond to the document “ID: 761588871, Title: Atomic Bomb” • Query 1: atomic bomb • Query 2: Nagasaki • Query 3: Nuclear bombs • Query 4: Manhattan Project • Query 5: Hiroshima

  9. Similarity Based on User Feedback • Similarity Through Document Hierarchy F(di, dj) denote the lowest common parent node for documents di and dj, L(x) the level of node x, L_Total the totallevels in the hierarchy • The hierarchy-based similarity is defined as follows:

  10. Combination of Multiple Measures

  11. Combination of Multiple Measures • keyword-based measure • Cluster 1: Query 1 • Cluster 2: Query 2 • Cluster 3: Query 3 and Query 4 • based on individual documents • Cluster 1: Query 1 and Query 2 • Cluster 2: Query 3 • Cluster 3: Query 4 • measure on document hierarchy • Cluster 1: Query 1, Query 2, and Query 4 • Cluster 2: Query 3 • Similarity conten + similarity concept • Cluster 1: Query 1 and Query 2 • Cluster 2: Query 3 and Query 4

  12. Question

More Related