Mehran sahami
Download
1 / 21

Mehran Sahami - PowerPoint PPT Presentation


  • 142 Views
  • Uploaded on

A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets. Timothy D. Heilman. Mehran Sahami. Introduction. Wish to determine how similar two short text snippets are. High degree of semantic similarity United Nations Secretary General vs Kofi Annan

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Mehran Sahami' - miach


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Mehran sahami

A Web­based Kernel Function for

Measuring the Similarity of

Short Text Snippets

Timothy D. Heilman

Mehran Sahami


Introduction
Introduction

  • Wish to determine how similar two short text snippets are.

  • High degree of semantic similarity

    • United Nations Secretary General vs

      Kofi Annan

    • AI vs Articial Intelligence

  • Share terms

    • graphical models vs

      graphical interface

5%


Related work
Related Work

  • Query expansion techniques

  • Other means of determining query similarity

  • Set overlap (intersection)

  • SVM for text classification

    • Latent Semantic Kernels (LSK)

    • Semantic Proximity Matrix

  • Cross-lingual techniques

10%


A new similarity function
A New Similarity Function

  • represent a short text snippet (query) to a search engine S

  • be the set of n retrieved documents

  • Compute the TFIDF term vector for each document

  • Truncate each vector to include its m highest weighted term

15%


Normalize
Normalize

  • Let be the centroid of the L2 normalized vector

  • Let QE(x) be the L2 normalization of the centroid C(x)

20%



Initial results with kernel
Initial Results with Kernel

  • Three genres of text snippet matching

    • Acronyms

    • Individuals and their positions

    • Multi-faceted terms

30%





Related query suggestion
Related Query Suggestion

  • Kernel function for

  • u is any newly issued user query

  • A repository Q of approximately 116 million popular user queries issued in 2003, determined by sampling anonymized web search logs from the Google search engine

50%


Algorithm
Algorithm

  • Given user query and list of matched queries from repository

  • Output list of queries to suggest

  • Initialize suggestion list

  • Sort kernel scores in descending order to produce an ordered list of corresponding queries

  • MAX is set to the maximum number of suggestions

55%


Post filter
Post-Filter

|q| denotes the number of terms in query q

60%


Evaluation of query suggestion system
Evaluation of Query Suggestion System

  • suggestion is totally off topic.

  • suggestion is not as good as original query.

  • suggestion is basically same as original query.

  • suggestion is potentially better than original query.

  • suggestion is fantastic - should suggest this query since it might help a user find what they're looking for if they issued it instead of the original query.

65%



Average ratings at various kernel thresholds
Average ratings at various kernel thresholds

75%



Application in qa
Application in QA

  • K("Who shot Abraham Lincoln", "John Wilkes Booth") = 0.730

  • K("Who shot Abraham Lincoln", "Abraham Lincoln") = 0.597

85%


Conclusion
Conclusion

  • A new kernel function for measuring the semantic similarity between pairs of short text snippets

  • The first is improvement in the generation of query expansions with the goal of improving the match score for the kernel function


Term weighting scheme
Term Weighting Scheme

  • The weight associated with the term in document is defined to be :

  • Where is the frequency of in

  • N is the total number of ducuments , and is the total number of documents that contain


Lp norm
Lp Norm

  • Given by:

  • Most common cases

    • P=1 ,This is the L1 norm, which is also called Manhattan distance

    • P=2 ,This is the L2 norm, which is also called the Euclidean distance

    • P= , This is the L norm, also called the infinity norm or the Chebyshev norm