distance functions and ie 4
Download
Skip this Video
Download Presentation
Distance functions and IE – 4?

Loading in 2 Seconds...

play fullscreen
1 / 52

Distance functions and IE – 4? - PowerPoint PPT Presentation


  • 64 Views
  • Uploaded on

Distance functions and IE – 4?. William W. Cohen CALD. Announcements. Current statistics: days with unscheduled student talks: 6 students with unscheduled student talks: 4 Projects are due: 4/28 (last day of class) Additional requirement: draft (for comments) no later than 4/21.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Distance functions and IE – 4?' - tejana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
announcements
Announcements
  • Current statistics:
    • days with unscheduled student talks: 6
    • students with unscheduled student talks: 4
    • Projects are due: 4/28 (last day of class)
    • Additional requirement: draft (for comments) no later than 4/21
string distance metrics so far
String distance metrics so far...
  • Term-based (e.g. TF/IDF as in WHIRL)
    • Distance depends on set of words contained in both s and t – so sensitive to spelling errors.
    • Usually weight words to account for “importance”
    • Fast comparison: O(n log n) for |s|+|t|=n
  • Edit-distance metrics
    • Distance is shortest sequence of edit commands that transform s to t.
    • No notion of word importance
    • More expensive: O(n2)
  • Other metrics
    • Jaro metric & variants
    • Monge-Elkan’s recursive string matching
    • etc?
  • Which metrics work best, for which problems?
string distance metrics so far1
String distance metrics so far...
  • Term-based (e.g. TF/IDF as in WHIRL)
    • Distance depends on set of words contained in both s and t – so sensitive to spelling errors.
    • Usually weight words to account for “importance”
    • Fast comparison: O(n log n) for |s|+|t|=n
  • Edit-distance metrics
    • Distance is shortest sequence of edit commands that transform s to t.
    • No notion of word importance
    • More expensive: O(n2)
  • Other metrics
    • Jaro metric & variants
    • Monge-Elkan’s recursive string matching
    • etc?
  • Which metrics work best, for which problems?
so which metric should you use
So which metric should you use?

SecondString (Cohen, Ravikumar, Fienberg):

  • Java toolkit of string-matching methods from AI, Statistics, IR and DB communities
  • Tools for evaluating performance on test data
  • Exploratory tool for adding, testing, combining string distances
    • e.g. SecondString implements a generic “Winkler rescorer” which can rescale any distance function with range of [0,1]
  • URL – http://secondstring.sourceforge.net
  • Distribution also includes several sample matching problems.
secondstring distance functions
SecondString distance functions
  • Edit-distance like:
    • Levenshtein – unit costs
    • untuned Smith-Waterman
    • Monge-Elkan (tuned Smith-Waterman)
    • Jaro and Jaro-Winkler
results edit distances
Results - Edit Distances

Monge-Elkan is the best on average....

secondstring distance functions1
SecondString distance functions
  • Term-based, for sets of terms S and T:
    • TFIDF distance
    • Jaccard distance:
    • Language models: construct PS and PT anduse
secondstring distance functions2
SecondString distance functions
  • Term-based, for sets of terms S and T:
    • TFIDF distance
    • Jaccard distance
    • Jensen-Shannon distance
      • smoothing toward union of S,T reduces cost of disagreeing on common terms
      • unsmoothed PS, Dirichlet smoothing, Jelenik-Mercer
    • “Simplified Fellegi-Sunter”
secondstring distance functions3
SecondString distance functions
  • Hybrid term-based & edit-distance based:
    • Monge-Elkan’s “recursive matching scheme”, segmenting strings at token boundaries (rather than separators like commas)
    • SoftTFIDF
      • Like TFIDF but consider not just tokens in both S and T, but tokens in S “close to” something in T (“close to” relative to some distance metric)
      • Downweight close tokens slightly
other results with secondstring
Other results with SecondString
  • Distance functions over structured data records (first name, last name, street, house number)
  • Learning to combine distance functions
  • Unsupervised/semi-supervised training for distance functions over structured data
experiments
Experiments
  • Hand-tagged 50 abstracts for gene/protein entities (pre-selected to be about human genes)
  • Collected dictionary of 40,000+ protein names from on-line sources
    • not complete
    • example matching is not sufficient
  • Approach: use hand-coded heuristics to propose likely generalizations of existing dictionary entries.
    • not hand-coded or off-the-shelf similarity metrics
basic idea behind the algorithm
Basic idea behind the algorithm

original dictionary

carefully-tuned heuristics (aka hacks)

similar (but not identical process) applied to word n-grams from text to do IE: extract if n-gram -> CD

example canonicalizing short names different procedure for full names and one word names
Example: canonicalizing “short names” (different procedure for “full names” and “one-word” names)
example canonicalizing short names different procedure for full names and one word names1
Example: canonicalizing “short names” (different procedure for “full names” and “one-word” names)

NF<n>

Nf<n>

NF => CD

(from <x><n>)

NF-25 in OD

NF in CD?

(<x><g><l>)

NF<n>

Recognize:

“... NF-kappa B...”

NF<g><l>

results
Results
  • Why is precision less than 100%?
  • When should you use “similarity by normalization”?
  • Could a simpler algorithm do as well?
  • Is there overfitting? (50 abstracts, <750 proteins)
background
Background
  • Common task in proteomics/genomics:
    • look for (soft) matches to a query sequence in a large “database” of sequences.
    • want to find subsequences (genes) that are highly similar (and hence probably related)
    • want to ignore “accidental” matches
    • possible technique is Smith-Waterman (local alignment)
      • want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance
background1
Background
  • Common task in proteomics/genomics:
    • look for (soft) matches to a query sequence in a large “database” of sequences.
    • want to find subsequences (genes) that are highly similar (and hence probably related)
    • want to ignore “accidental” matches
    • possible technique is Smith-Waterman (local alignment)
      • want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance
smith waterman distance

c o h e n d o r f

m 0 0 0 0 0 0 0 0 0

c 1 0 0 0 0 0 0 0 0

c 0 0 0 0 0 0 0 0 0

o 0 2 1 0 0 0 2 1 0

h 0 1 4 3 2 1 1 1 0

n 0 0 3 3 5 4 3 2 1

s 0 0 2 2 4 4 3 2 1

k 0 0 1 1 3 3 3 2 1

i 0 0 0 0 2 2 2 2 1

dist=5

Smith-Waterman distance
background2
Background
  • Common task in proteomics/genomics:
    • look for (soft) matches to a query sequence in a large “database” of sequences.
    • possible technique is Smith-Waterman (local alignment)
      • want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance
      • based on substitutability theory for amino acids
    • doesn’t scale well
      • BLAST and FASTA: fast approximate S-W
blast fasta ideas
BLAST/FASTA ideas
  • Find all char n-grams (“words”) in the query string.
  • FASTA:
    • Use inverted indices to find out where these words appear in the DB sequence
    • Use S-W only near DB sections that contain some of these words
blast fasta ideas1
BLAST/FASTA ideas
  • Find all char n-grams (“words”) in the query string.
  • BLAST:
    • Generate variations of these words by looking for changes that would lead to strong similarities
    • Discard “low IDF” words (where accidental matches are likely)
    • Use expanded set of n-grams to focus search
slide44

query string

words and expansions

blast fasta ideas2
BLAST/FASTA ideas
  • Find all char n-grams (“words”) in the query string.
  • BLAST:
    • Generate variations of these words by looking for changes that would lead to strong similarities
    • Discard “low IDF” words (where accidental matches are likely)
    • Use expanded set of n-grams to focus search
  • The BLAST program:
    • Widely used,
    • Fast implementation,
    • Supports asking multiple queries against a database at once...
    • Can one use it find soft matches of protein names (from a dictionary) in text?
basic idea
Protein database

Query strings

Proposed alignment (query->database)

Query algorithm: BLAST

Biomedical paper

Protein name dictionary

Extracted protein name (dict. entry->text)

IE system: dictionaries+BLAST (optimized for this problem)

Basic idea:
2 optimizing blast
2) Optimizing blast
  • Split protein-name database into several parts (for short, medium-length, long protein names)
  • Require space chars before and after “short” protein names.
  • Manually search (grid search?) for better settings for certain key parameters for each protein-name subdatabase
    • With what data?
  • Evaluate on one review article, 1162 protein names
    • inter-annotator agreement not great (70-85%)
results2
Results

Overall: precision 71.1%, recall 78.8% (opt)

ad