Query Log Analysis

Query Log Analysis Naama Kraus Slides are based on the papers: Andrei Broder, A taxonomy of web search Ricardo Baeza-Yates, Graphs from Search Engine Queries Hassan, Jones, Klinkner, Beyond DCG: User Behavior as a Predictor of a Successful Search

A Taxonomy of Web Searches • [Andrei Broder] classifies web queries according to their intent: • Navigational - reach a particular site • Example: cnn , Oracle • Informational - acquire some information • Example: the history of haifa , information retrieval • Transactional - perform some web-mediated activity. Further interaction is expected. • E.g. shopping, downloading files, accessing databases • Example: new balance shoes , Israel flights

Query Log • Search Engine Query Log records users’ searches • A typical record contains • Anonymous User id u • Search query q • Returned documents V • Clicked documents C • Timestamp t

Query Log Example 1234 , apple, 12:04 1234, apple ipod, 12:05 1234ynet, 12:13 145google, 12:20 145eBay, 12:56 32ynet news, 12:59 145Solaris systen, 13:01 145Solaris system, 13:05 …

Session • A sequence of searches of one particular user u within a specific time limit • S = < <u, q1 ,t1> , …, <u, qk, tk> > • t1 < …< tk (=> ordered sequence) • ti+1 – ti < t0 (=> t0 is a timeout threshold) • Note1 may contain non related queries • Note2 identifying sessions is easy

Session Example • 1234 , apple, 12:04 • 1234, apple ipod, 12:05 • 1234 ynet, 12:13 • 1234 apple store, 12:20 • 1234 cnn news, 12:56 • 1234 cnn webcast, 12:59 • 1234 apple apps, 13:01 • Session 1 • Session 2 • Timeout threshold = 30 minutes

Query Chain • A sequence of queries with a similar information need of a particular user • Also known as mission or logical session • Example: • haifa maps • haifa travel • attractions in haifa • Note1 contains related queries only • Note2 identifying chains is difficult

Query Chain Example • 1234 , apple, 12:04 • 1234, apple ipod, 12:05 • 1234 ynet, 12:13 • 1234 apple store, 12:20 • 1234 cnn news, 12:56 • 1234 cnn webcast, 12:59 • 1234 apple apps, 13:01 • chain1 • chain2

Click Graph Bipartite graph Nodes in left side are uniquequeries Nodes in right side are uniqueURLs An edge between q,u if there exists in the log a click on u for query q Edges may be weighted according to number of clicks This graph is used by numerous Algorithm for various purposes E.g., query and URL clustering, query recommendations …

Query Graphs Each unique query is a node in the graph Next slides – Connection types between queries (edges) Proposed by [Ricardo Baeza-Yates]

Query Graphs – Word Graph paris hotels An edge between nodes exists, if queries share common terms Possible node weight – Number of occurrences in the log Possible edge weight - Jaccard distance london attractions paris attractions cheap paris hotels

Query Graphs – Session Graph paris hotels Node’s q weight is the number of sessions that contain the query q (usually equals number of query occurrences) A directed edge from q1 to q2 if q1 occurred before q2 in the same session Edge’s weight is number of such occurrences paris attractions london attractions cheap paris hotels

Query Graphs – URL Cover Graph paris hotels An edge exists between q1 and q2, if they share clicked URLs Node weight = #occurrences Edge’s weight is the number of common clicks paris attractions london attractions cheap paris hotels

Query Graph – URL Link Graph An edge exists between q1 and q2, if there is at least one link between a url click of q1 and a url click of q2 Node weight =#occurrences Edge’s weight is the number of such common links paris hotels paris attractions london attractions cheap paris hotels

Query Graph –URL Terms Graph Represent a clicked URL by a set of terms (whole page, snippet, anchors, title, a combination …) Weight terms by their frequencies Node weight =#occurrences There’s an edge between q1 and q2 if there are at least m common terms in at least one clicked url of q1 and one clicked url of q2 Edge weight is sum of frequencies of common terms paris hotels paris attractions london attractions cheap paris hotels

User Behavior as a Predictor of a Successful Search • Goal: given a sequence of user actions within a specific logical session, predict whether the search goal ended up successfully or not • Success – user is satisfied with the results • Failure – user is unsatisfied • Method: • Analyze the query log and learn success/failure patterns • Use learned models for prediction • Proposed by [Hassan, Jones and Klinkner]

Data • A rich query log of queries and user actions: • Query (Q) • Search Click (SR) • Sponsored Search Click (AD) • Related Search Click (RL) • Query recommendations • Spelling Suggestion Click (SP) • Shortcut Click (SC) • E.g. image, video, news … • Any Other Click (OTH) • E.g. browser tab

Data Labeling • Random sample of user sessions • Human editors labeled data: • Detected logical sessions • Success/Failure • definitely successful, probably successful, unsure, probably unsuccessful, and definitely unsuccessful

Markov Models • Partition training data into two splits • successful goals • unsuccessful goals • For each group construct a Markov Model derived from seen action sequences • A Model describes the user behavior in case of a successful/unsuccessful search goal • Action type is a state • Weight a transition from one state to another according to its probability as observed in the data (MLE)

Transition Weighting - MLE

Illustration 0.3 0.1 0.6 1 0.4 Q SR START END 0.1 0.5 1 1 AD RL

Prediction (1) • Given a user’s action sequence, need to predict whether it is successful or not • We’ve learned two models Ms and Mf of successful and unsuccessful patterns • Compute the probability that a given sequence S={S1,…,Sn} was generated from Ms, same for Mf • Predict success/non success by computing log likelihood • Formulas in next slide

Prediction (2) Formulas taken from the paper

Query Log Analysis