1 / 17

Query Suggestion

Query Suggestion. Naama Kraus. Slides are based on the papers: Baeza -Yates, Hurtado , Mendoza, Improving search engines by query clustering Boldi , Bonchi , Castillo, Donato , Vigna , The Query Flow Graph: Model and Applications. The Problem.

gabe
Download Presentation

Query Suggestion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi, Castillo, Donato, Vigna, The Query Flow Graph: Model and Applications

  2. The Problem • User queries are an imperfect description of their information needs • Examples: Ambiguous queries: jaguar General queries: haifa Terminology differences (synonyms) between user and corpus stars - planets

  3. Query Suggestions Assist the user to phrase her information need jaguar • Jaguar car • Jaguar xf • Jaguar animal • Jaguar cat

  4. Example: Google Related Searches

  5. Query suggestion algorithms • Query suggestions are extracted from the query log • There are methods that use different data sources such as a corpus, not covered today • Topic (cluster) based – identify groups of similar queries • Sequence based – mine and analyze the query log for likely query sequences

  6. Improving Search Engines by Query Clustering - Baeza-Yates et al. • Algorithm outline • Offline: • Represent queries as term weighted vectors • Cluster queries • Rank queries in each cluster • Online: • Given user’s query q • Find cluster C containing q • Suggest top k queries in cluster C • Based on their rank and similarity to q

  7. Query Model • Given query q • Let U be the set of URLs clicked for q (for all users and sessions) • Information is extracted from the query log • q’s term weighted vector has a non 0 entry for any term that appears in some URL in U • Terms are weighted according to • Term frequencyand URLs popularity • Formula in next slide …

  8. Query Model (2) - The number of clicks of u for the query q Note: paper proposes a refinement to Pop(u,q) which is not biased by search engine’s ranking Query similarity is computed by some measure, e.g. cosine similarity.

  9. Query Support • The fraction of the documents returned by the query that captured the attention of users (clicked documents) • Denotes how ‘good’ is a query • A ‘global score’ • Queries within a cluster are ranked according to their similarity to q as well as their support

  10. Query Flow Graph – Boldi et al. • Main idea: • Aggregate the (massive) raw data in the query log • Many queries of many users • Model user query behavior • Use sophisticated techniques to infer query relatedness

  11. Query Flow Graph Model • G=(V, E, w) a directed graph where: • V – nodes, representing a distinct set of queries Q • Queries are extracted from the query log • A set of directed edges E • Two queries q,q’ are connected with an edge if q’ follows q in at least one session

  12. QFG Illustration Nodes are queries Edges connect between queries q4 q1 q5 apple ipod q0 q2 q3 apple store

  13. Weighting Function • w : E -> (0..1] a weighting function that assigns a weight to every edge (q,q’) • For each edge (q,q’) assign a probability that q’ follows q in the same session • Extracted from the observed query log sessions

  14. Illustration q4 0.1 1.0 q1 0.55 q5 0.5 0.2 0.35 q0 q2 0.25 0.8 1.0 0.25 q3

  15. Random walk on the QFG • A random surfer executes a random walk on the graph as follows: • Start at a some node • Move along an edge with probability d • Choose an edge by its probability (weight) • Or teleport to a random node with probability 1-d • Choose an edge uniformly • The Stationary distribution • The probability to be at node q in the infinity • Random walk score vector – query absolute scores

  16. Random Walk Relative to a Node • Random walk with restart to a single node: • Start at node q • Instead of teleporting to any node, always teleport to q • The score of node q’ for this random walk measures relatednessof q’ to q • The probability to get from q to q’ in the infinity • Can normalize node’s relative score by its absolute score ; similar somehow to tfxidf – avoid highly popular queries (non related to q)

  17. The Full Picture • Off-line stage • For each node q in the graph • Compute the stationary distribution vector of q • A random walk score relative to q • Store suggestions for q, alternatives: • top k scored nodes • nodes having a score above some threshold • On-line stage • User submits query q • Suggest queries stored for q • Queries most related to q

More Related