- 98 Views
- Uploaded on
- Presentation posted in: General

Query Recommendation Xiaofei Zhu ([email protected]) L3S Research Center, Leibniz Universität Hannover

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Query Recommendation

Xiaofei Zhu ([email protected])

L3S Research Center, Leibniz Universität Hannover

?

Short

(1-2 words)

Ambiguous

(e.g., Java)

Lack of domain knowledge

original query

- It aims to provide users alternative queries, which can represent their information needs more clearly in order to return better search results .

recommendation

- How to do query recommendation?
- Find alternative queries with similar search intent.
- Differ with Document , Image?

- Query log.
- A query log records information about the search actions of the users of a search engine.

- A typical query log is a set of records <qi,ui,ti,Vi,Ci>
- qi – the submitted query
- ui– an anonymized identifier for the user who submitted the query
- ti– timestamp, the time at which the query was submitted for search.
- Vi – the set of returned results to the query
- Ci - the set of documents clicked by the user.

AnonIDQuery QueryTimeItemRankClickURL

7051923motorola text messages 2006-03-24 19:35:311http://www.telusmobility.com

7051923motorola text messages 2006-03-24 19:35:314http://support.t-mobile.com

7051923motorola t730 text messages 2006-03-24 19:38:402http://www.phonescoop.com

7051923motorola t730 text messages 2006-03-24 19:38:403http://www.1800mobiles.com

7051923motorola t730 text messages 2006-03-24 19:38:405http://cgi.ebay.com

7051923motorola t730 text messages 2006-03-24 19:38:407http://phonearena.com

7051923spike muscle car 2006-03-25 12:57:432http://www.classicauto-sales.com

7051923spike muscle car 2006-03-25 12:57:435http://sev.prnewswire.com

7051923spike muscle car 2006-03-25 13:00:22

7051923usps 2006-03-25 14:23:211http://www.usps.com

7051923vc2 auctions 2006-03-25 14:31:41

7051923auctions for 1 2006-03-25 14:33:47

TimeQueryQueryIDSessionIDResultCount

2006-05-01 00:00:01defination Gravitational46c13f0705f6436b19ab975e898d46d111

2006-05-01 00:00:01kimclementa3d2cae45e2b4c5b1b748d1afa9b482810

2006-05-01 00:00:01scientology crazy beliefs418324ef33d14ed210f477402db84c9a10

2006-05-01 00:00:01www.joj.sk489238bdf8834d6816271eb6bf174c5c9

2006-05-01 00:00:04www.selectcareers.comf92efd8044904ac4193f9f8442d44c480

2006-05-01 00:00:08What is May Day?37afe7af832649d221f6a0dfea4348ac14

2006-05-01 00:00:10vikings draft choices suckb0519e4528d84b44196b0bb2f1d643f210

2006-05-01 00:00:10wwwcrownawards.com9eda4716dfb045e204e3a26067a847480

2006-05-01 00:00:15Australian minersba6d190cc4cd4fd3136fd5e571d2488610

QueryIDQueryTimeURL Position

0000003a718649f2schwab2006-05-11 08:07:35http://www.schwab.com/ 1

0000006d43b549c1us geography2006-05-04 14:23:00http://www.enchantedlearning.com/usa/ 3

0000006d43b549c1us geography2006-05-04 14:23:03http://www.sheppardsoftware.comState15s_500.html 4

0000016aa52e4fbcwwf2006-05-21 09:25:34http://www.panda.org/ 2

000002aa6e27443fbiggercity2006-05-07 13:30:45http://www.biggercity.com/chat/ 1

1000005aac1f6423fstudios2006-05-09 14:21:29http://www.shawneestudios.com/contact_us.php 1

1000008d8afaa459awww.nfl.com2006-05-28 18:22:39http://www.nfl.com/teams/NYJ.html 7

7000009c2848e4a68north hills school district2006-05-04 12:29:12http://www.nhsd.net/ 1

Click-through data

If user clicks a document after she issues a query, then the clicked document is more or less relevant to the submitted query, thus the query can be represented by it clicked documents.

- Click-through data records the clicked documents after user submit a query to the search engine.

Basic Assumption

[Mei, CIKM’08]

[Beeferman, KDD’00]

Query Feature

Representation

If two queries co-clicked many common documents, then they have similar search intent.

Query-URL Graph

Query Session

If two queries frequently co-occur in the same sessions, then they are relevant to each other.

- Query session: a single user submits a sequence of related queries in a time interval for a specific search task.

[Foneseca, LA-WEB’03]

Basic Assumption

[Zhang, WWW’06]

[Boldi, CIKM’08, WSCD’09]

Association Rules

Continuous submitted queries in short time interval by the same user share similar search intent.

Query Graph

- Query Suggestion Using Hitting Time (CIKM’08)
- Click-through Data
- Query-URL Bipartite Graph

- Query Suggestions Using Query-Flow Graphs (WSCD’09)
- Session Data
- Query-Flow Graph

- Query Suggestion Using Hitting Time (CIKM’08)
- Click-through Data
- Query-URL Bipartite Graph

- Query Suggestions Using Query-Flow Graphs (WSCD’09)
- Session Data
- Query-Flow Graph

5

- Query-URL Bipartite Graph
- Edges between V1 and V2
- No edge inside V1 or V2
- Edges are weighted
- e.g., V1 = query; V2 = Url

- Transition Probabilities

A

4

V1

4

V2

7

7

1

i

3

j

2

5

1

3

4

- Random Walk and Hitting Time
- Hitting time. How long does it take to hit node a in a random walk starting at node b ?

- Start at 1

2

5

1

3

4

- Random Walk and Hitting Time
- Hitting time. How long does it take to hit node a in a random walk starting at node b ?

- Start at 1
- Pick a neighbor i based on the transition probability.
- Move to i

t=1

2

5

1

3

4

- Random Walk and Hitting Time
- Hitting time. How long does it take to hit node a in a random walk starting at node b ?

- Start at 1
- Pick a neighbor i uniformly at random
- Move to i
- Continue

t=2

2

5

1

3

4

- Random Walk and Hitting Time
- Hitting time. How long does it take to hit node a in a random walk starting at node b ?

- Start at 1
- Pick a neighbor i uniformly at random
- Move to i
- Continue

If the random walk hits a node quickly, then its close to the start node!

Hitting time!

t=2

Graph G

i

A

Graph G

j

i

A

k

Graph G

j

i

A

k

- Construct a (kNN) subgraph from the query log data (of a predefined number of queries/urls)
- Compute transition probabilities p(i j)
- Compute hitting time hiA
- Rank candidate queries using hiA

Query

Url

300

T

www.aa.com

aa

15

www.theaa.com/travelwatch/planner_main.jsp

mexiana

american airline

en.wikipedia.org/wiki/Mexicana

Query = ‘aa’

- Query Suggestion Using Hitting Time (CIKM’08)
- Click-through Data
- Query-URL Bipartite Graph

- Query Suggestions Using Query-Flow Graphs (WSCD’09)
- Session Data
- Query-Flow Graph

- Session Data
- Definition: the sequence of queries of one particular user within a specific time limit .

two consecutive queries

queries that are not neighbors in the same session

- This model works by accumulating many query sessions and adding up the similarity values for many same query pairs

Z. Zhang and O. Nasraoui. Mining search engine query logs for query recommendation. In WWW, pages 1039–1040, 2006.

P. Boldi, F. Bonchi, C. Castillo, D. Donato, A. Gionis, S. Vigna: “The query-flow graph: model and applications”. CIKM 2008.

- The key aspect of the construction of the query-flow graph is to define the weighting function w.

represent the number of times the transition was observed in the same search session.

- The query recommendation methods are based on the probability of being at a certain node after performing a random walk over a query graph.
- Random Walk with restart
- a random surfer starts at the initial query q
- at each step
- α , follows one of the outlinks from the current node
- 1 - α , jumps back to q

- The query recommendation methods are based on the probability of being at a certain node after performing a random walk over a query graph.
- Random Walk with restart

M - the transition matrix of a Markov chain

P -row-normalized weight matrix of the query flow graph

ej -the vector j-th entry is 1,others are zeroes

- Random walks on graphs correspond to Markov Chains
- The set of states S is the set of nodes of the graph G
- The transition probability matrix is the probability that we follow an edge from one node to another

1

1

1

1/2

1

1

1

1/2

Adjacency matrix A

Transition matrix P

1

1/2

1

1/2

t=0

1

1

1/2

1/2

1

1

1/2

1/2

t=0

t=1

1

1

1

1/2

1/2

1/2

1

1

1

1/2

1/2

1/2

t=0

t=1

t=2

1

1

1

1

1/2

1/2

1/2

1/2

1

1

1

1

1/2

1/2

1/2

1/2

t=0

t=1

t=2

t=3

xt(i) = probability that the surfer is on node i at time t

xt+1(i) = ∑j(Probability of being at node j)*Pr(j->i)

=∑jxt(j)*P(j,i)

xt+1 = xtP= xt-1*P*P= xt-2*P*P*P = …=x0 Pt

What happens when the surfer keeps walking for a long time?

- Stationary Distribution
- Intuitively
- the stationary distribution at a node is related to the amount of time a random walker spends visiting that node.

- Mathematically
- Remember that we can write the probability distribution at a node as
xt+1 = xtP.

- For the stationary distribution v0 we have
v0 = v0 P

- Remember that we can write the probability distribution at a node as

- Intuitively

v0 is the left eigenvector of the transition matrix P !

- Does a stationary distribution always exist? Is it unique?
- Yes, if the graph is “well-behaved”, i.e., P is ergodic

- P is ergodic if :
- irreducible
- aperiodic

Irreducible: There is a path from every node to every other node.

Aperiodic: State i is periodic with period k if all paths from i to i have length that is multiple of k. Otherwise, it’s aperiodic.

Irreducible

Not irreducible

Aperiodic

Periodicity is 3

- If a markov chain P is irreducible and aperiodic then the largest eigenvalue of the transition matrix will be equal to 1 and all the other eigenvalues will be strictly less than 1.
- Let the eigenvalues of P be {σi| i=0:n-1} in non-increasing order of σi .
- σ0 = 1 >σ1 > σ2 >=……>= σn

相关性

- Actually, in query recommendation, only providing the “relevant” recommendations is far away from satisfying users’ information needs.

apple ipad 3

apple tree

apple iphone 4s

apple seed

apple computer

Original Query：Apple

The queries we recommend should cover multiple potential search intents of users and minimize the risk that users will not be satisfied.

⁞

- Diversifying Query Suggestion Results [Hao Ma, AAAI’10]
- Query-URL graph
- Hitting time

- A Unified Framework for Recommending Diverse and Relevant Queries[Xiaofei Zhu, WWW’11]
- Manifold
- Manifold Ranking with Stop Points

- Diversifying Query Suggestion Results [H. Ma, AAAI’10]
- Query-URL graph
- Hitting time

- A Unified Framework for Recommending Diverse and Relevant Queries[X.F. Zhu, WWW’11]
- Manifold
- Manifold Ranking with Stop Points

Figure 1: Example for Bipartite Graph

(extracted from the clickthrough data)

- Initial Transition Probability

--

the number of click frequency between node i and node j

--

normalization term, is the total number of times that the

query node i has been issued in the dataset.

--

initial transition probability from node i to node j

- Random Jump
- In addition to the transition probability, there are random relations among different queries.
- It adds a uniform random relation among different queries

--

the probability of taking a “random jump”, i.e., transit among different queries

--

Without any prior knowledge, it sets , where d is a uniform stochastic distribution vector

- Random Walk on the Query-URL graph
- With the transition probabilistic matrix P defined, it then can perform the random walk on the query-URL graph.
- the probability of transition from node i to node j after a t step random walk as:

Explain:

1) The random walk sums the probabilities of all paths of length t between the two nodes. if there are many paths the transition probability will be high

2) The larger the transition probability Pt(i, j) is, the more the node j is similar to the node i.

- the largest transition probability from node q will be recommendedas the first suggested query
- performing a t-step random walk

- parameter t
- determines the resolution of the Markov random walk
- Large t: the random walk depend more on the graph structure
- Small t: preserves information about the starting node

- determines the resolution of the Markov random walk

- Employ the hitting time to rank and diversify the rest of the queries.
- Hitting time
- Let S be a subset of vertex set V, the expected hitting time h(i|S) of the random walk is the expected number of steps before node i is visiting the starting set S.

- Hitting time

N(i) denotes the neighbors of node i

- Property
- those nodes strongly connected to s1 will have many fewer visits by the random walk
- nodes far away from s1 still allow the random walk to move among them and thus receive more visits

- The second suggestion node
- select the second suggestion node s2 ∈ Q with the largest expected hitting time to the subset S containing two nodes q and s1.

- Diversifying Query Suggestion Results [Hao Ma, aaai’10]
- Query-URL graph
- Hitting time

- A Unified Framework for Recommending Diverse and Relevant Queries[Xiaofei Zhu, WWW’11]
- Manifold
- Manifold Ranking with Stop Points

Query Recommendation

Manifold ranking

Import stop points

A novel unified framework

Manifold ranking with stop points

relevance

diversity

query1

query2

queryn

Affinity matrix W

Step 1:

Step 2:

Step 3:

W- affinity matrix, D – diagonal matrix

(1)

(2)

(3)

(4)

- Automatic Evaluation
- Open Directory Project(ODP) <-> Relevance
- Given two queries q and q’

- Open Directory Project(ODP) <-> Relevance

c(q): ‘Arts/Television/News’

c(q’): Arts/Television/Stations/North America /United States’

l(c, c’): their longest common prefix , e.g., ‘Arts/Television’

: the longest category of c and c’, e.g., 5

- Automatic Evaluation
- Open Directory Project(ODP) <-> Relevance
- Given two queries q and q’

- Open Directory Project(ODP) <-> Relevance

c(q): ‘Arts/Television/News’

c(q’): Arts/Television/Stations/North America /United States’

- Automatic Evaluation
- Commercial search engine (i.e., Google) <-> Diversity
- Given two queries q and q’

- Commercial search engine (i.e., Google) <-> Diversity

o(q, q) is the number of overlapped URLs among the

top k search results of query q and q’.

- Automatic Evaluation
- Commercial search engine (i.e., Google) <-> Diversity
- Given two queries q and q’

- Commercial search engine (i.e., Google) <-> Diversity

- Automatic Evaluation
- Open Directory Project(ODP) <-> Relevance
- Commercial search engine (i.e., Google) <-> Diversity

- Evaluation metrics
- Q-measure

β - parameter to control the tradeoff between relevance and diversity

- Average Q-measure of Query Recommendation over Different Recommendation Size under 5 Approaches.

Proposed Method

Recommendation pool

- Manual Evaluation
- Recommendation pool
- 3 human judges
- Label tool

search results

- Evaluation Metrics

- α-nDCG(α -normalized Discounted Cumulative Gain )

- Intent-Coverage

Table 2: Performance of recommendation results over a sample of queries under five different approaches.

- Focuses on recommending users relevant queries to their initial queries.

initial query

- Common Query Terms
- (Wen J. et al, WWW2001)
- Same Clicked Documents
- (Mei Q. et al, CIKM 2008)
- Co-Occurring in Same Search Sessions
- (Zhang Z.et al, WWW 2006)

Query Level

query 1

Only recommend relevant query is enough for find useful search results?

query 2

query 3

iphone sell time

‘iphone start sell’

Recommend High Utility Query

‘iphone initial release’

- More Than Relevance: High Utility Query Recommendation By Mining Users’ Search Behaviors[X.F. Zhu, CIKM’12]
- Probabilistic Graphical Model (Query Utility Model)

- Recommending High Utility Query via Session-Flow Graph [X.F. Zhu, ECIR’13]
- Session-Flow Graph
- Two-phase model based on absorbing random walk

- More Than Relevance: High Utility Query Recommendation By Mining Users’ Search Behaviors[X.F. Zhu, CIKM’12]
- Probabilistic Graphical Model (Query Utility Model)

- Recommending High Utility Query via Session-Flow Graph [X.F. Zhu, ECIR’13]
- Session-Flow Graph
- Two-phase model based on absorbing random walk

bad perceived utility

bad posterior utiltiy

red - relevant √ - attractiveness

Ri： whether there is a reformulation at position i

Ci：whether the user clicks on some of the search results of the reformulation at position i;

Ai：whether the user is attracted by the search results of the reformulaiton at position i;

Si：whether the user’s information needs have been satisfied at position i;

- Maximum Likelihood Estimation

Where

- Log Likelihood Function

- Maximize Log Likelihood Function

Lagrange multiplier

Regularization term

- Optimization Condition：

- Newton-Raphson

- Dataset
- Our experiments are based on publicly available query logs, namely UFindIt log data. There are totally 40 search tasks represented by 40 test queries.

- Metric
- QRR (Query Relevant Ratio)

Measuring the probability that a user finds relevant results when she uses query q for her search task

- MRD (Mean Relevant Document)

Measuring the average number of relevant results a user finds when she uses query q for her search task.

PTU

CT

QUM

CO

PCU

QF

ADJ

Query-Flow Graph (QF): query-flow graph based on collective search sessions, and perform a random walk on this graph for query recommendation [cikm'08].

Click-through Graph (CT): query-URL bipartite graph, employs the hitting time as a measure to select queries for recommendation [cikm'08].

Adjacency (ADJ): given a test query q, the top frequent queries in the same session adjacent to q are recommended to users[www'06].

Co-occurrence (CO): given a test query q, the top frequent queries co-occurred in the same session with q are selected as recommendations [wsdm'10].

Query Utility Model(QUM): the expected information gain users obtained from the search results of the query according to their original information needs, which is the product of the two component utilities.

Two component utilities (i.e., perceived utility and posterior utility) in the QUM method: Perceived Utility method (PCU) and Posterior Utility method (PTU).

Impact of parameter μ to the performance of QUM

- Cannot make full use of the click-through information.
- it only considers whether the search results of a reformulated query have some clicked documents or not, but does not take individually clicked document into consideration.

- It is necessary to proposes a novel method to further capture these specific clicked documents for modeling query utility.

Two-phase model based on Absorbing Random Walk (TARW)

Session-Flow Graph

Query-Flow Graph

Document Nodes

Reformulation Behaviors

+

Click Behaviors

Random Walk

Absorbing States

Absorbting Random Walk

query session

q → q1→ q3

q → q3→ q4

q → q4

⁞

Query-Flow Graph: Boldi et al. (CIKM 2008)

query session

q → q1:u1:u2→ q3:u3

q → q3→ q4:u4:u5

q → q4:u6

⁞

Session Flow Graph: expands query-flow graph (document nodes + failure nodes)

- Definition:

Adjacency Matrix

Nodes

Edges

Two-phase Model Based on Absorbing Random Walk

Forward Utility Propagation

Backward Utility Propagation

> Utility score was transferred from the original query node to reformulation node, and at last absorbed by document node and failure node.

> Utility score was inversely transferred from document nodes to reformulation node.

Recommendation: queries with the highest utilities.

- Assign transition probability to different types of nodes (reformulation, document, failure):

α2

α3

Reformulation Node

—— α1

Document Node

—— α2

α1

Failure Node

—— α3

α1+α2+α3=1

Previous work (Sadikov, WWW2010): share the same transition probability setting (a1,a2,a3) to different types of nodes.

—— Reformulation node

α1

—— document node

α2

—— failure node

α3

- Our work: assign transition probability based on characteristics of each candidate query.

prior transition probability

observed transition probability

posterior transition probability

Reformulation Nodes

Document Nodes:

Failure Node:

- In the forward utility propagation, the corresponding transition matrix is:

PQ : n n transition matrix on query nodes

PD : n m matrix of transition from query node to document node

PS : n 1 matrix of transition from query to failure node.

ID,IS: identity matrix, denoting document nodes and failure nodes are absorbing states.

reducible (no station distribution)

- Computing the absorbing distribution by an iterative way：

Pt[i, j] represents the probability of node i to node j after t step walk.

we only have to compute the probability from query to document.

O(tn3+n2m)

in recommendation scenario, only the probability from original query to documents are needed, i.e. computing the matrix row of original query.

O(tn2+nm)

- Dataset
- Our experiments are based on publicly available query logs, namely UFindIt log data. There are totally 40 search tasks represented by 40 test queries.

- Metric
- QRR (Query Relevant Ratio)

Measuring the probability that a user finds relevant results when she uses query q for her search task

- MRD (Mean Relevant Document)

Measuring the average number of relevant results a user finds when she uses query q for her search task.

- Overall Evaluation Results

TARW

TARW method significantly better than all the baseline recommendation methods

(p-value <= 0.05))

- Baseline methods:
- Document Frequency Based Method (DF)
- the click frequency of a document reflects users preference for that document when they search with the original query

- Session Document Frequency Based Method (SDF)
- clicked documents within the same search session convey the similar search intent

- Markov-model Based Method (MM):
- Based on the learned document distribution for the original query by a Markov-model based method

- Document Frequency Based Method (DF)

- Metrics:
- Precision at position k([email protected])
- Normalized Discounted Cumulative Gain(NDCG)
- Mean Average Precision (MAP)

TARW improvements over MM by:

using an adaptive transition probability setting to different types of nodes

modeling users' behaviors of giving up their search tasks by introducing the failure nodes.

- query recommendation techniques
- High Relevant Query Recommendation
- High Diversity Query Recommendation
- High Utility Query Recommendation