slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuk PowerPoint Presentation
Download Presentation
Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuk

Loading in 2 Seconds...

play fullscreen
1 / 5

Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuk - PowerPoint PPT Presentation


  • 160 Views
  • Uploaded on

Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka. Topic Semantic similarity measures between two words Why interesting? In information retrieval Query expansion Automatic annotation of Web pages Community mining

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuk' - greg


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1
Measuring Semantic Similarity between Words Using Web Search EnginesDanushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka
  • Topic
    • Semantic similarity measures between two words
  • Why interesting?
    • In information retrieval
      • Query expansion
      • Automatic annotation of Web pages
      • Community mining
    • In natural language processing
      • Word-sense disambiguation
      • Synonym extraction
      • Language modeling

WWW 2007 Paper Presentation Zheshen Wang May 8th, 2007

slide2
Solution proposed

By using the information available on the Web

  • Page Counts + Text Snippets
  • SVM for an optimal combination
  • Page Counts
    • Co-occurrence measures:Jaccard, Overlap (Simpson), Dice, PMI
    • Modification:Suppress random co-occurrences
      • Score=0, if H(P∩Q)<c, H(x): page counts for the query x
  • Text Snippets(context and statistical based) top 200 Pattern Freq
    • Lexico -syntactic Patterns Extraction
      • e.g. “Toyota and Nissan are two major Japanese car manufactures.”
      • If the appearing times of a pattern words

in snippets for synonymous words >> in snippets for non-synonymous

it is a reliable indicator of synonymy.

  • Combination
    • 204-D Feature vector F=[200 Pattern Freq, 4 co-occurrence measures]
    • Two-class SVM
      • synonymous word-pairs (Positive), non-synonymous word-pairs (Negative)

WWW 2007 Paper Presentation Zheshen Wang May 8th, 2007

my criticisms of the solution
My criticisms of the solution
  • Statistics and context based pattern selection is not reliable(No ontology or syntax templates)
    • Sparse Distribution
    • Noises (meaningless patterns)
    • Correlations (e.g. “X and Y” , “X and Y are”, “X and Y are two”)
    • Missing meaningful patterns due to limited n-grams range

(e.g. X and Y are far apart, beyond the range of n-grams, n=2,3,4,5

“Rose is a very popular flower in the US.”)

  • Feature vector F= [200 Pattern Freq, 4 co-occurrence measures]
  • Error prone for uncommon words
    • e.g. rarely used professional terms
    • Base set from the web is too small to be reliable.
    • Like the case of CBioC, users voting would be better

WWW 2007 Paper Presentation Zheshen Wang May 8th, 2007

how it is related to our course
How it is related to our course?
  • Web-based information extraction (Knowledge Extraction)
    • Extract base level knowledge (“facts”) directly from the web
    • Page counts(Hits), e.g. Knowitall
    • Inevitable drawback: Error prone for uncommon words in the web, e.g. CBioC
  • Making use of Collective Unconscious—Big Idea 3
    • Analyzing term co-occurrences to capture semantic information
  • Co-occurrence measures
    • Similarity measure in terms of co-occurrence
    • Jaccard, Overlap (Simpson), PMI…
  • Making use of context based on statistics
    • Patterns from context rather than from an ontology (“SemTag & Seeker”).
    • Patterns decided by statistics rather than templates from syntax tree (Generic extraction patterns, Hearst ’92).
    • n-grams for a word, somewhat like the “20-word-window” of “spot(l,c)” in “SemTag & Seeker”.

WWW 2007 Paper Presentation Zheshen Wang May 8th, 2007

slide5
Measuring Semantic Similarity between Words Using Web Search EnginesDanushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka

WWW 2007 Paper Presentation Zheshen Wang May 8th, 2007