Loading in 5 sec....

Query Recommendation Xiaofei Zhu ([email protected]) L3S Research Center, Leibniz Universität HannoverPowerPoint Presentation

Query Recommendation Xiaofei Zhu ([email protected]) L3S Research Center, Leibniz Universität Hannover

- 133 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Query Recommendation Xiaofei Zhu ([email protected]) L3S Research Center, Leibniz Universität Hannover' - walker-santiago

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Xiaofei Zhu ([email protected])

L3S Research Center, Leibniz Universität Hannover

Query Recommendation

- It aims to provide users alternative queries, which can represent their information needs more clearly in order to return better search results .

recommendation

Query Recommendation

- How to do query recommendation?
- Find alternative queries with similar search intent.
- Differ with Document , Image?

Query log

- Query log.
- A query log records information about the search actions of the users of a search engine.

- A typical query log is a set of records <qi,ui,ti,Vi,Ci>
- qi – the submitted query
- ui– an anonymized identifier for the user who submitted the query
- ti– timestamp, the time at which the query was submitted for search.
- Vi – the set of returned results to the query
- Ci - the set of documents clicked by the user.

Example of query log (AOL, 2006)

AnonID Query QueryTimeItemRankClickURL

7051923 motorola text messages 2006-03-24 19:35:31 1 http://www.telusmobility.com

7051923 motorola text messages 2006-03-24 19:35:31 4 http://support.t-mobile.com

7051923 motorola t730 text messages 2006-03-24 19:38:40 2 http://www.phonescoop.com

7051923 motorola t730 text messages 2006-03-24 19:38:40 3 http://www.1800mobiles.com

7051923 motorola t730 text messages 2006-03-24 19:38:40 5 http://cgi.ebay.com

7051923 motorola t730 text messages 2006-03-24 19:38:40 7 http://phonearena.com

7051923 spike muscle car 2006-03-25 12:57:43 2 http://www.classicauto-sales.com

7051923 spike muscle car 2006-03-25 12:57:43 5 http://sev.prnewswire.com

7051923 spike muscle car 2006-03-25 13:00:22

7051923 usps 2006-03-25 14:23:21 1 http://www.usps.com

7051923 vc2 auctions 2006-03-25 14:31:41

7051923 auctions for 1 2006-03-25 14:33:47

Microsoft 2006 RFP dataset

Time Query QueryIDSessionIDResultCount

2006-05-01 00:00:01 defination Gravitational 46c13f0705f6436b 19ab975e898d46d1 11

2006-05-01 00:00:01 kimclement a3d2cae45e2b4c5b 1b748d1afa9b4828 10

2006-05-01 00:00:01 scientology crazy beliefs 418324ef33d14ed2 10f477402db84c9a 10

2006-05-01 00:00:01 www.joj.sk 489238bdf8834d68 16271eb6bf174c5c 9

2006-05-01 00:00:04 www.selectcareers.com f92efd8044904ac4 193f9f8442d44c48 0

2006-05-01 00:00:08 What is May Day? 37afe7af832649d2 21f6a0dfea4348ac 14

2006-05-01 00:00:10 vikings draft choices suck b0519e4528d84b44 196b0bb2f1d643f2 10

2006-05-01 00:00:10 wwwcrownawards.com 9eda4716dfb045e2 04e3a26067a84748 0

2006-05-01 00:00:15 Australian miners ba6d190cc4cd4fd3 136fd5e571d24886 10

QueryID Query Time URL Position

0000003a718649f2 schwab 2006-05-11 08:07:35 http://www.schwab.com/ 1

0000006d43b549c1 us geography 2006-05-04 14:23:00 http://www.enchantedlearning.com/usa/ 3

0000006d43b549c1 us geography 2006-05-04 14:23:03 http://www.sheppardsoftware.comState15s_500.html 4

0000016aa52e4fbc wwf 2006-05-21 09:25:34 http://www.panda.org/ 2

000002aa6e27443f biggercity 2006-05-07 13:30:45 http://www.biggercity.com/chat/ 1

1000005aac1f6423f studios 2006-05-09 14:21:29 http://www.shawneestudios.com/contact_us.php 1

1000008d8afaa459a www.nfl.com 2006-05-28 18:22:39 http://www.nfl.com/teams/NYJ.html 7

7000009c2848e4a68 north hills school district 2006-05-04 12:29:12 http://www.nhsd.net/ 1

How to use query log for query recommendation?

Click-through data

If user clicks a document after she issues a query, then the clicked document is more or less relevant to the submitted query, thus the query can be represented by it clicked documents.

- Click-through data records the clicked documents after user submit a query to the search engine.

Basic Assumption

[Mei, CIKM’08]

[Beeferman, KDD’00]

Query Feature

Representation

If two queries co-clicked many common documents, then they have similar search intent.

Query-URL Graph

How to use query log for query recommendation?

Query Session

If two queries frequently co-occur in the same sessions, then they are relevant to each other.

- Query session: a single user submits a sequence of related queries in a time interval for a specific search task.

[Foneseca, LA-WEB’03]

Basic Assumption

[Zhang, WWW’06]

[Boldi, CIKM’08, WSCD’09]

Association Rules

Continuous submitted queries in short time interval by the same user share similar search intent.

Query Graph

High Relevant Query Recommendation

- Query Suggestion Using Hitting Time (CIKM’08)
- Click-through Data
- Query-URL Bipartite Graph

- Query Suggestions Using Query-Flow Graphs (WSCD’09)
- Session Data
- Query-Flow Graph

High Relevant Query Recommendation

- Query Suggestion Using Hitting Time (CIKM’08)
- Click-through Data
- Query-URL Bipartite Graph

- Query Suggestions Using Query-Flow Graphs (WSCD’09)
- Session Data
- Query-Flow Graph

Query Suggestion Using Hitting Time (CIKM’08)

5

- Query-URL Bipartite Graph
- Edges between V1 and V2
- No edge inside V1 or V2
- Edges are weighted
- e.g., V1 = query; V2 = Url

- Transition Probabilities

A

4

V1

4

V2

7

7

1

i

3

j

5

1

3

4

Query Suggestion Using Hitting Time (CIKM’08)- Random Walk and Hitting Time
- Hitting time. How long does it take to hit node a in a random walk starting at node b ?

- Start at 1

5

1

3

4

Query Suggestion Using Hitting Time (CIKM’08)- Random Walk and Hitting Time
- Hitting time. How long does it take to hit node a in a random walk starting at node b ?

- Start at 1
- Pick a neighbor i based on the transition probability.
- Move to i

t=1

5

1

3

4

Query Suggestion Using Hitting Time (CIKM’08)- Random Walk and Hitting Time
- Hitting time. How long does it take to hit node a in a random walk starting at node b ?

- Start at 1
- Pick a neighbor i uniformly at random
- Move to i
- Continue

t=2

5

1

3

4

Query Suggestion Using Hitting Time (CIKM’08)- Random Walk and Hitting Time
- Hitting time. How long does it take to hit node a in a random walk starting at node b ?

- Start at 1
- Pick a neighbor i uniformly at random
- Move to i
- Continue

If the random walk hits a node quickly, then its close to the start node!

Hitting time!

t=2

Generate Query Suggestion

- Construct a (kNN) subgraph from the query log data (of a predefined number of queries/urls)
- Compute transition probabilities p(i j)
- Compute hitting time hiA
- Rank candidate queries using hiA

Query

Url

300

T

www.aa.com

aa

15

www.theaa.com/travelwatch/planner_main.jsp

mexiana

american airline

en.wikipedia.org/wiki/Mexicana

Result: Query Suggestion

Query = ‘aa’

High Relevant Query Recommendation

- Query Suggestion Using Hitting Time (CIKM’08)
- Click-through Data
- Query-URL Bipartite Graph

- Query Suggestions Using Query-Flow Graphs (WSCD’09)
- Session Data
- Query-Flow Graph

Query Suggestions Using Query-Flow Graphs (WSCD’09)

- Session Data
- Definition: the sequence of queries of one particular user within a specific time limit .

Query Graph

two consecutive queries

queries that are not neighbors in the same session

- This model works by accumulating many query sessions and adding up the similarity values for many same query pairs

Z. Zhang and O. Nasraoui. Mining search engine query logs for query recommendation. In WWW, pages 1039–1040, 2006.

Query-Flow Graph

P. Boldi, F. Bonchi, C. Castillo, D. Donato, A. Gionis, S. Vigna: “The query-flow graph: model and applications”. CIKM 2008.

Build Query-flow Graph

- The key aspect of the construction of the query-flow graph is to define the weighting function w.

represent the number of times the transition was observed in the same search session.

Query Recommendation

- The query recommendation methods are based on the probability of being at a certain node after performing a random walk over a query graph.
- Random Walk with restart
- a random surfer starts at the initial query q
- at each step
- α , follows one of the outlinks from the current node
- 1 - α , jumps back to q

Query Recommendation

- The query recommendation methods are based on the probability of being at a certain node after performing a random walk over a query graph.
- Random Walk with restart

M - the transition matrix of a Markov chain

P - row-normalized weight matrix of the query flow graph

ej - the vector j-th entry is 1,others are zeroes

Random walks

- Random walks on graphs correspond to Markov Chains
- The set of states S is the set of nodes of the graph G
- The transition probability matrix is the probability that we follow an edge from one node to another

Probability Distributions

xt(i) = probability that the surfer is on node i at time t

xt+1(i) = ∑j(Probability of being at node j)*Pr(j->i)

=∑jxt(j)*P(j,i)

xt+1 = xtP= xt-1*P*P= xt-2*P*P*P = …=x0 Pt

What happens when the surfer keeps walking for a long time?

What happens when the surfer keeps walking for a long time?

- Stationary Distribution
- Intuitively
- the stationary distribution at a node is related to the amount of time a random walker spends visiting that node.

- Mathematically
- Remember that we can write the probability distribution at a node as
xt+1 = xtP.

- For the stationary distribution v0 we have
v0 = v0 P

- Remember that we can write the probability distribution at a node as

- Intuitively

v0 is the left eigenvector of the transition matrix P !

Interesting questions

- Does a stationary distribution always exist? Is it unique?
- Yes, if the graph is “well-behaved”, i.e., P is ergodic

- P is ergodic if :
- irreducible
- aperiodic

Irreducible: There is a path from every node to every other node.

Aperiodic: State i is periodic with period k if all paths from i to i have length that is multiple of k. Otherwise, it’s aperiodic.

Irreducible

Not irreducible

Aperiodic

Periodicity is 3

- If a markov chain P is irreducible and aperiodic then the largest eigenvalue of the transition matrix will be equal to 1 and all the other eigenvalues will be strictly less than 1.
- Let the eigenvalues of P be {σi| i=0:n-1} in non-increasing order of σi .
- σ0 = 1 >σ1 > σ2 >=……>= σn

Why Diversity Query Recommendation

相关性

- Actually, in query recommendation, only providing the “relevant” recommendations is far away from satisfying users’ information needs.

apple ipad 3

apple tree

apple iphone 4s

apple seed

apple computer

Original Query：Apple

The queries we recommend should cover multiple potential search intents of users and minimize the risk that users will not be satisfied.

⁞

High Diversity Query Recommendation

- Diversifying Query Suggestion Results [Hao Ma, AAAI’10]
- Query-URL graph
- Hitting time

- A Unified Framework for Recommending Diverse and Relevant Queries[Xiaofei Zhu, WWW’11]
- Manifold
- Manifold Ranking with Stop Points

High Diversity Query Recommendation

- Diversifying Query Suggestion Results [H. Ma, AAAI’10]
- Query-URL graph
- Hitting time

- A Unified Framework for Recommending Diverse and Relevant Queries[X.F. Zhu, WWW’11]
- Manifold
- Manifold Ranking with Stop Points

Determining the First Suggested Query

- Initial Transition Probability

--

the number of click frequency between node i and node j

--

normalization term, is the total number of times that the

query node i has been issued in the dataset.

--

initial transition probability from node i to node j

Determining the First Suggested Query

- Random Jump
- In addition to the transition probability, there are random relations among different queries.
- It adds a uniform random relation among different queries

--

the probability of taking a “random jump”, i.e., transit among different queries

--

Without any prior knowledge, it sets , where d is a uniform stochastic distribution vector

Determining the First Suggested Query

- Random Walk on the Query-URL graph
- With the transition probabilistic matrix P defined, it then can perform the random walk on the query-URL graph.
- the probability of transition from node i to node j after a t step random walk as:

Explain:

1) The random walk sums the probabilities of all paths of length t between the two nodes. if there are many paths the transition probability will be high

2) The larger the transition probability Pt(i, j) is, the more the node j is similar to the node i.

Determining the First Suggested Query

- the largest transition probability from node q will be recommendedas the first suggested query
- performing a t-step random walk

- parameter t
- determines the resolution of the Markov random walk
- Large t: the random walk depend more on the graph structure
- Small t: preserves information about the starting node

- determines the resolution of the Markov random walk

Ranking the Rest Queries

- Employ the hitting time to rank and diversify the rest of the queries.
- Hitting time
- Let S be a subset of vertex set V, the expected hitting time h(i|S) of the random walk is the expected number of steps before node i is visiting the starting set S.

- Hitting time

N(i) denotes the neighbors of node i

Ranking the Rest Queries

- Property
- those nodes strongly connected to s1 will have many fewer visits by the random walk
- nodes far away from s1 still allow the random walk to move among them and thus receive more visits

- The second suggestion node
- select the second suggestion node s2 ∈ Q with the largest expected hitting time to the subset S containing two nodes q and s1.

High Diversity Query Recommendation

- Diversifying Query Suggestion Results [Hao Ma, aaai’10]
- Query-URL graph
- Hitting time

- A Unified Framework for Recommending Diverse and Relevant Queries[Xiaofei Zhu, WWW’11]
- Manifold
- Manifold Ranking with Stop Points

Manifold ranking

Import stop points

A novel unified framework

Manifold ranking with stop points

relevance

diversity

Traditional manifold ranking process

Step 1:

Step 2:

Step 3:

W- affinity matrix, D – diagonal matrix

Results: Query recommendation (‘abc’, ‘yamaha’)

Evaluation Metrics

- Automatic Evaluation
- Open Directory Project(ODP) <-> Relevance
- Given two queries q and q’

- Open Directory Project(ODP) <-> Relevance

c(q): ‘Arts/Television/News’

c(q’): Arts/Television/Stations/North America /United States’

l(c, c’): their longest common prefix , e.g., ‘Arts/Television’

: the longest category of c and c’, e.g., 5

Evaluation Metrics

- Automatic Evaluation
- Open Directory Project(ODP) <-> Relevance
- Given two queries q and q’

- Open Directory Project(ODP) <-> Relevance

c(q): ‘Arts/Television/News’

c(q’): Arts/Television/Stations/North America /United States’

Evaluation Metrics

- Automatic Evaluation
- Commercial search engine (i.e., Google) <-> Diversity
- Given two queries q and q’

- Commercial search engine (i.e., Google) <-> Diversity

o(q, q) is the number of overlapped URLs among the

top k search results of query q and q’.

Evaluation Metrics

- Automatic Evaluation
- Commercial search engine (i.e., Google) <-> Diversity
- Given two queries q and q’

- Commercial search engine (i.e., Google) <-> Diversity

Evaluation Metrics

- Automatic Evaluation
- Open Directory Project(ODP) <-> Relevance
- Commercial search engine (i.e., Google) <-> Diversity

- Evaluation metrics
- Q-measure

β - parameter to control the tradeoff between relevance and diversity

Experiments

- Average Q-measure of Query Recommendation over Different Recommendation Size under 5 Approaches.

Proposed Method

Experiments

Recommendation pool

- Manual Evaluation
- Recommendation pool
- 3 human judges
- Label tool

search results

Experiments

Table 2: Performance of recommendation results over a sample of queries under five different approaches.

Why High Utility Query Recommendation

- Focuses on recommending users relevant queries to their initial queries.

initial query

- Common Query Terms
- (Wen J. et al, WWW2001)
- Same Clicked Documents
- (Mei Q. et al, CIKM 2008)
- Co-Occurring in Same Search Sessions
- (Zhang Z.et al, WWW 2006)

Query Level

query 1

Only recommend relevant query is enough for find useful search results?

query 2

query 3

Why High Utility Query Recommendation

iphone sell time

‘iphone start sell’

Recommend High Utility Query

‘iphone initial release’

High Utility Query Recommendation

- More Than Relevance: High Utility Query Recommendation By Mining Users’ Search Behaviors[X.F. Zhu, CIKM’12]
- Probabilistic Graphical Model (Query Utility Model)

- Recommending High Utility Query via Session-Flow Graph [X.F. Zhu, ECIR’13]
- Session-Flow Graph
- Two-phase model based on absorbing random walk

High Utility Query Recommendation

- More Than Relevance: High Utility Query Recommendation By Mining Users’ Search Behaviors[X.F. Zhu, CIKM’12]
- Probabilistic Graphical Model (Query Utility Model)

- Recommending High Utility Query via Session-Flow Graph [X.F. Zhu, ECIR’13]
- Session-Flow Graph
- Two-phase model based on absorbing random walk

A Typical Search Session

bad perceived utility

bad posterior utiltiy

red - relevant √ - attractiveness

Probabilistic Graphical Model

Ri： whether there is a reformulation at position i

Ci：whether the user clicks on some of the search results of the reformulation at position i;

Ai：whether the user is attracted by the search results of the reformulaiton at position i;

Si：whether the user’s information needs have been satisfied at position i;

Parameter Estimation

- Log Likelihood Function

Parameter Estimation

- Optimization Condition：

Parameter Estimation

- Newton-Raphson

Experimental Results

- Dataset
- Our experiments are based on publicly available query logs, namely UFindIt log data. There are totally 40 search tasks represented by 40 test queries.

Experimental Results

- Metric
- QRR (Query Relevant Ratio)

Measuring the probability that a user finds relevant results when she uses query q for her search task

- MRD (Mean Relevant Document)

Measuring the average number of relevant results a user finds when she uses query q for her search task.

Experimental Results

PTU

CT

QUM

CO

PCU

QF

ADJ

Query-Flow Graph (QF): query-flow graph based on collective search sessions, and perform a random walk on this graph for query recommendation [cikm'08].

Click-through Graph (CT): query-URL bipartite graph, employs the hitting time as a measure to select queries for recommendation [cikm'08].

Adjacency (ADJ): given a test query q, the top frequent queries in the same session adjacent to q are recommended to users[www'06].

Co-occurrence (CO): given a test query q, the top frequent queries co-occurred in the same session with q are selected as recommendations [wsdm'10].

Query Utility Model(QUM): the expected information gain users obtained from the search results of the query according to their original information needs, which is the product of the two component utilities.

Two component utilities (i.e., perceived utility and posterior utility) in the QUM method: Perceived Utility method (PCU) and Posterior Utility method (PTU).

Experiments

Impact of parameter μ to the performance of QUM

Limitation of QUM method

- Cannot make full use of the click-through information.
- it only considers whether the search results of a reformulated query have some clicked documents or not, but does not take individually clicked document into consideration.

- It is necessary to proposes a novel method to further capture these specific clicked documents for modeling query utility.

Framework of Our Approach

Two-phase model based on Absorbing Random Walk (TARW)

Session-Flow Graph

Query-Flow Graph

Document Nodes

Reformulation Behaviors

+

Click Behaviors

Random Walk

Absorbing States

Absorbting Random Walk

Session Flow Graph

query session

q → q1→ q3

q → q3→ q4

q → q4

⁞

Query-Flow Graph: Boldi et al. (CIKM 2008)

Session Flow Graph

query session

q → q1:u1:u2→ q3:u3

q → q3→ q4:u4:u5

q → q4:u6

⁞

Session Flow Graph: expands query-flow graph (document nodes + failure nodes)

Two-phase model based on absorbing random walk (TARW)

Two-phase Model Based on Absorbing Random Walk

Forward Utility Propagation

Backward Utility Propagation

> Utility score was transferred from the original query node to reformulation node, and at last absorbed by document node and failure node.

> Utility score was inversely transferred from document nodes to reformulation node.

Recommendation: queries with the highest utilities.

Forward Utility Propagation

- Assign transition probability to different types of nodes (reformulation, document, failure):

α2

α3

Reformulation Node

—— α1

Document Node

—— α2

α1

Failure Node

—— α3

α1+α2+α3=1

Parameter Setting:

Previous work (Sadikov, WWW2010): share the same transition probability setting (a1,a2,a3) to different types of nodes.

—— Reformulation node

α1

—— document node

α2

—— failure node

α3

- Our work: assign transition probability based on characteristics of each candidate query.

prior transition probability

observed transition probability

posterior transition probability

Computing the Distribution

- In the forward utility propagation, the corresponding transition matrix is:

PQ : n n transition matrix on query nodes

PD : n m matrix of transition from query node to document node

PS : n 1 matrix of transition from query to failure node.

ID,IS: identity matrix, denoting document nodes and failure nodes are absorbing states.

reducible (no station distribution)

Computing the Distribution

- Computing the absorbing distribution by an iterative way：

Pt[i, j] represents the probability of node i to node j after t step walk.

we only have to compute the probability from query to document.

O(tn3+n2m)

in recommendation scenario, only the probability from original query to documents are needed, i.e. computing the matrix row of original query.

O(tn2+nm)

Experimental Results

- Dataset
- Our experiments are based on publicly available query logs, namely UFindIt log data. There are totally 40 search tasks represented by 40 test queries.

Experimental Results

- Metric
- QRR (Query Relevant Ratio)

Measuring the probability that a user finds relevant results when she uses query q for her search task

- MRD (Mean Relevant Document)

Measuring the average number of relevant results a user finds when she uses query q for her search task.

Experimental Results

- Overall Evaluation Results

TARW

TARW method significantly better than all the baseline recommendation methods

(p-value <= 0.05))

Evaluation of Document Utility

- Baseline methods:
- Document Frequency Based Method (DF)
- the click frequency of a document reflects users preference for that document when they search with the original query

- Session Document Frequency Based Method (SDF)
- clicked documents within the same search session convey the similar search intent

- Markov-model Based Method (MM):
- Based on the learned document distribution for the original query by a Markov-model based method

- Document Frequency Based Method (DF)

Evaluation of Document Utility

- Metrics:
- Precision at position k([email protected])
- Normalized Discounted Cumulative Gain(NDCG)
- Mean Average Precision (MAP)

Evaluation of Document Utility

TARW improvements over MM by:

using an adaptive transition probability setting to different types of nodes

modeling users' behaviors of giving up their search tasks by introducing the failure nodes.

Summary

- query recommendation techniques
- High Relevant Query Recommendation
- High Diversity Query Recommendation
- High Utility Query Recommendation

Download Presentation

Connecting to Server..