Entropy biased models for query representation on the click graph
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

Entropy-biased Models for Query Representation on the Click Graph PowerPoint PPT Presentation


  • 71 Views
  • Uploaded on
  • Presentation posted in: General

Entropy-biased Models for Query Representation on the Click Graph. Hongbo Deng , Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong July 2 1st , 2009. Query suggestion Query classification. Targeted advertising Ranking.

Download Presentation

Entropy-biased Models for Query Representation on the Click Graph

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Entropy biased models for query representation on the click graph

Entropy-biased Models for Query Representation on the Click Graph

Hongbo Deng, Irwin King and Michael R. Lyu

Department of Computer Science and Engineering

The Chinese University of Hong Kong

July 21st, 2009


Introduction

Query suggestion

Query classification

Targeted advertising

Ranking

Introduction

Query log analysis – improve search engine’s capabilities


Introduction1

Introduction

  • Click graph – an important technique

    • A bipartite graph between queries and URLs

    • Edges connect a query with the URLs

    • Capture some semantic relations, e.g., “map” and “travel”

How to utilize and model the click graph to represent queries?

Traditional model based on the raw click frequency (CF)

  • Robustness: Some queries with

    skewed click count may exclusively

    influence the click graph

  • Spam: Raw CF can be easily manipulated

Propose an entropy-biased framework


Motivation

General URL

Specific URL

Motivation

Is a single click on different URLs equally important?

  • Basic idea

    • Various query-URL pairs should be treated differently

  • Intuition

    • Common clicks on less frequent but more specific URLs are of greater value than common clicks on frequent and general URLs


Outline

Outline

  • Introduction

  • Related Work

  • Methodology

    • Preliminaries

    • Click Frequency Model

    • Entropy-biased Model

  • Experiments

  • Conclusion


Related work

Related Work

  • Using click graph

    • Query clustering (Befferman and Berger, KDD’00, Wen et al., WWW’ 01)

    • Random walks for relevance rank in image search (Craswell and Szummer, SIGIR’05)

    • Query suggestion by computing the hitting time on a click graph (Mei et al., CIKM’08)

    • Query classification from regularized click graph (Li et al., SIGIR’08)

Using click graph

Using click graph

Modeling queries

and URLs

Click entropy

& result entropy

These methods are proposed based on the click graph, while our objective

is to investigate a better model to utilize and represent the click graph.


Related work1

Related Work

  • Using click graph

    • Query clustering (Befferman and Berger, KDD’00, Wen et al., WWW’ 01)

    • Random walks for relevance rank in image search (Craswell and Szummer, SIGIR’05)

    • Query suggestion by computing the hitting time on a click graph (Mei et al., CIKM’08)

    • Query classification from regularized click graph (Li et al., SIGIR’08)

  • Modeling the representation

    • Use the content of clicked Web pages to define a term-weight vector model for a query (Baeza-Yates et al., 2004)

    • Represent query as a vector of documents (URLs) without considering the content information (Baeza-Yates and Tiberi, KDD’07)

    • Propose the query-set document model to represent documents by mining frequent query patterns rather than the content information of the documents (Poblete et al., WWW’08)

Using click graph

Using click graph

Modeling queries

and URLs

Modeling queries

and URLs

Click entropy

& result entropy

These existing methods do not distinguish the variation on different query-URL pairs


Related work2

Related Work

  • Using click graph

    • Query clustering (Befferman and Berger, KDD’00, Wen et al., WWW’ 01)

    • Random walks for relevance rank in image search (Craswell and Szummer, SIGIR’05)

    • Query suggestion by computing the hitting time on a click graph (Mei et al., CIKM’08)

    • Query classification from regularized click graph (Li et al., SIGIR’08)

  • Modeling the representation

    • Use the content of clicked Web pages to define a term-weight vector model for a query (Baeza-Yates et al., 2004)

    • Represent query as a vector of documents (URLs) without considering the content information (Baeza-Yates and Tiberi, KDD’07)

    • Propose the query-set document model to represent documents by mining frequency query patterns rather than the content information of the documents (Qin et al., WWW’08)

Using click graph

Using click graph

  • For personalization

    • Explore click entropy to measure the variability in click results (Dou et al., WWW’ 07)

    • Propose result entropy to capture how often results change (Teevan et al., SIGIR’08)

Modeling queries

and URLs

Modeling queries

and URLs

Click entropy

& result entropy

Click entropy

& result entropy

These methods are focused on personalization for different queries, while our entropy-

biased models are focused on the weighting scheme of various query-URL pairs


Outline1

Outline

  • Introduction

  • Related Work

  • Methodology

    • Preliminaries

    • Click Frequency Model

    • Entropy-biased Model

  • Experiments

  • Conclusion


Preliminaries

Preliminaries

Query instance:

Query:

URL:

User:


Traditional click frequency model

Traditional Click Frequency Model

  • Edges of click graph:

    • Weighted by the raw click frequency (CF)

  • Transition probability

    • Normalize CF

From query to URL:

From URL to query:

Based on the transition probabilities, the query and document can be represented by the vector of transition probabilities respectively.


Traditional click frequency model1

Traditional Click Frequency Model

  • Measure the similarity between queries

    • The most similar query

      • q2 (“map”)  q1 (“Yahoo”)

    • More reasonable

      • q2 (“map”)  q3 (“travel”)

Cosine similarity:

The CF model only considers the raw click frequency, and treats different

query-URL pairs equally, even if some URLs are heavily clicked.


Methodology

Methodology

Traditional click

frequency model

M

Entropy-biased

models


Entropy biased model

Entropy-biased Model

  • The more general and highly ranked URL

    • Connect with more queries

    • Increase the ambiguity and uncertainty

  • The entropy of a URL:

    • Suppose

    • Tend to be proportional to the n(dj)

It would be more reasonable to weight these two edges differently because of the variation of the connected URLs.


Entropy discriminative ability

Entropy  Discriminative Ability

  • Entropy increase, discriminative ability decrease

    • Be inversely proportional to each other

    • A URL with a high query frequency is less discriminative overall

  • Inverse query frequency

    • Measure the discriminative ability of the URL

    • Benefits

      • Constrain the influence of some heavily-clicked URLs

      • Balance the inherent bias of clicks for those highly ranked

      • Incorporate with other factors to tune the model


Cf iqf model

CF-IQF Model

  • Incorporate the IQF with the click frequency

A high click frequency

A low query frequency

“A” is weighted higher than “B”


Cf iqf model1

CF-IQF Model

  • Transition probability

The most similar query

q2 (“map”)  q1 (“Yahoo”)

The most similar query

q2 (“map”)  q3 (“travel”)


Uf model and uf iqf model

UF Model and UF-IQF Model

  • Drawback of CF model

    • Prone to spam by some malicious clicks (if a single user clicks on a certain URL thousands of times)

  • UF model

    • Weight by user frequency instead of click frequency

    • Improve the resistance against malicious click

  • UF-IQF model


Connection with tf idf

Connection with TF-IDF

  • TF-IDF has been extensively and successfully used in the vector space model for text retrieval

  • Several researchers have tried to interpret IDF based on binary independence retrieval (BIR), Possion, information entropy and LM

  • TF-IDF has never been explored to bipartite graphs, and the IQF is new. The CF-IQF is a simplified version of the entropy-biased model

  • The entropy-biased model is employed to identify the edge weighting of the click graph, which can be applied to other bipartite graphs


Mining query log on click graph

Mining Query Log on Click Graph

Query-to-query similarity

Query-to-query similarity

Models

Query clustering

Query suggestion

Query suggestion


Similarity measurement

Similarity Measurement

  • Cosine similarity

  • Jaccard coefficient

  • The similarity results are reported and analyzed


Graph based random walk

Graph-based Random Walk

  • Query-to-query graph

    • The transition probability from qi to qj

  • The personalized PageRank


Outline2

Outline

  • Introduction

  • Related Work

  • Methodology

    • Preliminaries

    • Click Frequency Model

    • Entropy-biased Model

  • Experiments

  • Conclusion


Experimental evaluation

Experimental Evaluation

  • Data collection

    • AOL query log data

  • Cleaning the data

    • Removing the queries that appear less than 2 times

    • Combining the near-duplicated queries

    • 883,913 queries and 967,174 URLs

    • 4,900,387 edges


Distributions

Distributions


Evaluation odp similarity

Evaluation: ODP Similarity

  • A simple measure of similarity among queries using ODP categories (query  category)

    • Definition:

    • Example:

      • Q1: “United States”  “Regional > North America > United States”

      • Q2: “National Parks”  “Regional > North America > United States > Travel and Tourism > National Parks and Monuments”

  • Precision at rank n ([email protected]):

  • 300 distinct queries

3/5


Experimental results

Experimental Results

Results:

  • Query similarity analysis

1. CF-IQF is better than CF

UF-IQF > UF

The results support our intuition of the entropy-biased framework about treating various query-URL pairs differently

2. UF is better than CF

UF-IQF > CF-IQF

The results indicates the user frequency associated with the query-URL pair is more robust than the click frequency for modeling the click graph.


Experimental results1

Experimental Results

  • Query similarity analysis

3. TF-IDF is better than TF

The improvements of CF-IQF over CF and UF-IQF over UF models are consistent with the improvement of TF-IDF over TF model.

The reason: they share the same key point to identify and tune the importance of a term or a query-URL edge.


Experimental results2

Experimental Results

  • Query similarity analysis

4. Jaccard coefficient

The improvements are consistent with the Cosine similarity


Experimental results3

Experimental Results

  • Query similarity analysis

5. UF-IQF achieves best

performance in most cases.

6. CF and UF models > TF

CF-IQF, UF-IQF > TF-IDF

The click graph catches more

semantic relations between

queries than the query terms

It is very essential and promising to consider the entropy-biased models for the click graph.


Experimental results4

Experimental Results

  • Random Walk Evaluation

Results:

1. With the increase of n, both models improve their performance.

2. CF-IQF model always performs better than the CF mode.


Experimental results5

Experimental Results

  • Random Walk Evaluation

In general, the results generated by the CF and the CF-IQF models are similar, and mostly semantically relative to the original query,

such as “American airline”.

Another important observation is that the CF-IQF model can boost more relevant queries as suggestion and reduce some irrelevant queries.


Conclusions

Conclusions

  • Introduce the inverse query frequency (IQF) to measure the discriminative ability of a URL

  • Identify a new source, user frequency, for diminishing the manipulation of the malicious clicks

  • Propose the entropy-biased models to combine the IQF with the CF as well as UF for click graphs

  • Experimental results show that the improvements of our proposed models are consistent and promising


Entropy biased models for query representation on the click graph

Q&A

Thanks!


  • Login