Distance functions and IE â 4?

1 / 52

# Distance functions and IE â 4? - PowerPoint PPT Presentation

Distance functions and IE – 4?. William W. Cohen CALD. Announcements. Current statistics: days with unscheduled student talks: 6 students with unscheduled student talks: 4 Projects are due: 4/28 (last day of class) Additional requirement: draft (for comments) no later than 4/21.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Distance functions and IE â 4?' - tejana

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Distance functions and IE – 4?

William W. Cohen

CALD

Announcements
• Current statistics:
• days with unscheduled student talks: 6
• students with unscheduled student talks: 4
• Projects are due: 4/28 (last day of class)
String distance metrics so far...
• Term-based (e.g. TF/IDF as in WHIRL)
• Distance depends on set of words contained in both s and t – so sensitive to spelling errors.
• Usually weight words to account for “importance”
• Fast comparison: O(n log n) for |s|+|t|=n
• Edit-distance metrics
• Distance is shortest sequence of edit commands that transform s to t.
• No notion of word importance
• More expensive: O(n2)
• Other metrics
• Jaro metric & variants
• Monge-Elkan’s recursive string matching
• etc?
• Which metrics work best, for which problems?
String distance metrics so far...
• Term-based (e.g. TF/IDF as in WHIRL)
• Distance depends on set of words contained in both s and t – so sensitive to spelling errors.
• Usually weight words to account for “importance”
• Fast comparison: O(n log n) for |s|+|t|=n
• Edit-distance metrics
• Distance is shortest sequence of edit commands that transform s to t.
• No notion of word importance
• More expensive: O(n2)
• Other metrics
• Jaro metric & variants
• Monge-Elkan’s recursive string matching
• etc?
• Which metrics work best, for which problems?
So which metric should you use?

SecondString (Cohen, Ravikumar, Fienberg):

• Java toolkit of string-matching methods from AI, Statistics, IR and DB communities
• Tools for evaluating performance on test data
• Exploratory tool for adding, testing, combining string distances
• e.g. SecondString implements a generic “Winkler rescorer” which can rescale any distance function with range of [0,1]
• URL – http://secondstring.sourceforge.net
• Distribution also includes several sample matching problems.
SecondString distance functions
• Edit-distance like:
• Levenshtein – unit costs
• untuned Smith-Waterman
• Monge-Elkan (tuned Smith-Waterman)
• Jaro and Jaro-Winkler
Results - Edit Distances

Monge-Elkan is the best on average....

SecondString distance functions
• Term-based, for sets of terms S and T:
• TFIDF distance
• Jaccard distance:
• Language models: construct PS and PT anduse
SecondString distance functions
• Term-based, for sets of terms S and T:
• TFIDF distance
• Jaccard distance
• Jensen-Shannon distance
• smoothing toward union of S,T reduces cost of disagreeing on common terms
• unsmoothed PS, Dirichlet smoothing, Jelenik-Mercer
• “Simplified Fellegi-Sunter”
SecondString distance functions
• Hybrid term-based & edit-distance based:
• Monge-Elkan’s “recursive matching scheme”, segmenting strings at token boundaries (rather than separators like commas)
• SoftTFIDF
• Like TFIDF but consider not just tokens in both S and T, but tokens in S “close to” something in T (“close to” relative to some distance metric)
• Downweight close tokens slightly
Other results with SecondString
• Distance functions over structured data records (first name, last name, street, house number)
• Learning to combine distance functions
• Unsupervised/semi-supervised training for distance functions over structured data

### Combining Information Extraction and Similarity Computations

2) Krauthammer et al

1) Bunescu et al

Experiments
• Hand-tagged 50 abstracts for gene/protein entities (pre-selected to be about human genes)
• Collected dictionary of 40,000+ protein names from on-line sources
• not complete
• example matching is not sufficient
• Approach: use hand-coded heuristics to propose likely generalizations of existing dictionary entries.
• not hand-coded or off-the-shelf similarity metrics
Basic idea behind the algorithm

original dictionary

carefully-tuned heuristics (aka hacks)

similar (but not identical process) applied to word n-grams from text to do IE: extract if n-gram -> CD

Example: canonicalizing “short names” (different procedure for “full names” and “one-word” names)
Example: canonicalizing “short names” (different procedure for “full names” and “one-word” names)

NF<n>

Nf<n>

NF => CD

(from <x><n>)

NF-25 in OD

NF in CD?

(<x><g><l>)

NF<n>

Recognize:

“... NF-kappa B...”

NF<g><l>

Results
• Why is precision less than 100%?
• When should you use “similarity by normalization”?
• Could a simpler algorithm do as well?
• Is there overfitting? (50 abstracts, <750 proteins)

### Combining Information Extraction and Similarity Computations

2) Krauthammer et al

1) Bunescu et al

Background
• look for (soft) matches to a query sequence in a large “database” of sequences.
• want to find subsequences (genes) that are highly similar (and hence probably related)
• want to ignore “accidental” matches
• possible technique is Smith-Waterman (local alignment)
• want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance
Background
• look for (soft) matches to a query sequence in a large “database” of sequences.
• want to find subsequences (genes) that are highly similar (and hence probably related)
• want to ignore “accidental” matches
• possible technique is Smith-Waterman (local alignment)
• want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance

c o h e n d o r f

m 0 0 0 0 0 0 0 0 0

c 1 0 0 0 0 0 0 0 0

c 0 0 0 0 0 0 0 0 0

o 0 2 1 0 0 0 2 1 0

h 0 1 4 3 2 1 1 1 0

n 0 0 3 3 5 4 3 2 1

s 0 0 2 2 4 4 3 2 1

k 0 0 1 1 3 3 3 2 1

i 0 0 0 0 2 2 2 2 1

dist=5

Smith-Waterman distance
Background
• look for (soft) matches to a query sequence in a large “database” of sequences.
• possible technique is Smith-Waterman (local alignment)
• want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance
• based on substitutability theory for amino acids
• doesn’t scale well
• BLAST and FASTA: fast approximate S-W
BLAST/FASTA ideas
• Find all char n-grams (“words”) in the query string.
• FASTA:
• Use inverted indices to find out where these words appear in the DB sequence
• Use S-W only near DB sections that contain some of these words
BLAST/FASTA ideas
• Find all char n-grams (“words”) in the query string.
• BLAST:
• Generate variations of these words by looking for changes that would lead to strong similarities
• Discard “low IDF” words (where accidental matches are likely)
• Use expanded set of n-grams to focus search

query string

words and expansions

BLAST/FASTA ideas
• Find all char n-grams (“words”) in the query string.
• BLAST:
• Generate variations of these words by looking for changes that would lead to strong similarities
• Discard “low IDF” words (where accidental matches are likely)
• Use expanded set of n-grams to focus search
• The BLAST program:
• Widely used,
• Fast implementation,
• Supports asking multiple queries against a database at once...
• Can one use it find soft matches of protein names (from a dictionary) in text?
Protein database

Query strings

Proposed alignment (query->database)

Query algorithm: BLAST

Biomedical paper

Protein name dictionary

Extracted protein name (dict. entry->text)

IE system: dictionaries+BLAST (optimized for this problem)

Basic idea:
2) Optimizing blast
• Split protein-name database into several parts (for short, medium-length, long protein names)
• Require space chars before and after “short” protein names.
• Manually search (grid search?) for better settings for certain key parameters for each protein-name subdatabase
• With what data?
• Evaluate on one review article, 1162 protein names
• inter-annotator agreement not great (70-85%)
Results

Overall: precision 71.1%, recall 78.8% (opt)