String Distance Functions and Information Extraction

Distance functions and IE – 5 William W. Cohen CALD

Announcements • Current statistics: • days with unscheduled student talks: 5 • students with unscheduled student talks: 3 • Projects are due: 4/28 (last day of class) • Additional requirement: draft (for comments) no later than 4/21

String distance metrics so far... • Term-based (e.g. TF/IDF as in WHIRL) • Distance depends on set of words contained in both s and t – so sensitive to spelling errors. • Usually weight words to account for “importance” • Fast comparison: O(n log n) for |s|+|t|=n • Edit-distance metrics • Distance is shortest sequence of edit commands that transform s to t. • No notion of word importance • More expensive: O(n2) • Other metrics • Jaro metric & variants • Monge-Elkan’s recursive string matching • etc? • Which metrics work best, for which problems?

Results - Overall

Combining Information Extraction and Similarity Computations Krauthammer et al

Background • Common task in proteomics/genomics: • look for (soft) matches to a query sequence in a large “database” of sequences. • want to find subsequences (genes) that are highly similar (and hence probably related) • want to ignore “accidental” matches • possible technique is Smith-Waterman (local alignment) • want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance

c o h e n d o r f m 0 0 0 0 0 0 0 0 0 c 1 0 0 0 0 0 0 0 0 c 0 0 0 0 0 0 0 0 0 o 0 2 1 0 0 0 2 1 0 h 0 1 4 3 2 1 1 1 0 n 0 0 3 3 5 4 3 2 1 s 0 0 2 2 4 4 3 2 1 k 0 0 1 1 3 3 3 2 1 i 0 0 0 0 2 2 2 2 1 dist=5 Smith-Waterman distance

In general “peaks” in the matrix scores indicate highly similar substrings.

Background • Common task in proteomics/genomics: • look for (soft) matches to a query sequence in a large “database” of sequences. • possible technique is Smith-Waterman (local alignment) • want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance • based on substitutability theory/stats for amino acids • doesn’t scale well • BLAST and FASTA: fast approximate S-W

BLAST/FASTA ideas • Find all char n-grams (“words”) in the query string. • FASTA: • Use inverted indices to find out where these words appear in the DB sequence • Use S-W only near DB sections that contain some of these words

BLAST/FASTA ideas • Find all char n-grams (“words”) in the query string. • BLAST: • Generate variations of these words by looking for changes that would lead to strong similarities • Discard “low IDF” words (where accidental matches are likely) • Use expanded set of n-grams to focus search

query string words and expansions

BLAST/FASTA ideas • Find all char n-grams (“words”) in the query string. • BLAST: • Generate variations of these words by looking for changes that would lead to strong similarities • Discard “low IDF” words (where accidental matches are likely) • Use expanded set of n-grams to focus search • The BLAST program: • Widely used, • Fast implementation, • Supports asking multiple queries against a database at once... • Can one use it find soft matches of protein names (from a dictionary) in text?

Protein database Query strings Proposed alignment (query->database) Query algorithm: BLAST Biomedical paper Protein name dictionary Extracted protein name (dict. entry->text) IE system: dictionaries+BLAST (optimized for this problem) Basic idea:

1) Mapping text to DNA sequences(Q: what sort of char similarity is this?)

2) Optimizing blast • Split protein-name database into several parts (for short, medium-length, long protein names) • Scoring depends on length of matched string • Require space chars before and after “short” protein names. • Manually search (grid search?) for better settings for certain key parameters for each protein-name subdatabase • With what data? • Evaluate on one review article, 1162 protein names • inter-annotator agreement not great (70-85%)

2) Optimizing blast

Results

Results Overall: precision 71.1%, recall 78.8% (optimized)

IE with Dictionaries Cohen & Sarawagi

Finding names you know about • Problem: given dictionary of names, find them in email text • Important task beyond email (biology, link analysis,...) • Exact match is unlikely to work perfectly, due to nicknames (Will Cohen), abbreviations (William C) , misspellings (Willaim Chen), polysemous words (June, Bill), etc • In informal text it sometimes works very poorly • Problem is similar to record linkage (aka data cleaning, de-duping, merge-purge, ...) problem of finding duplicate database records in heterogeneous databases.

Finding names you know about • Problem: given dictionary of names, find them in email text • Exact match is unlikely to work well for informal text. • Problem is similar to record linkage • Hard to combine state of the art similaritymetrics (as used in record linkage) with state of the art NER system due to representational mismatch: • Opening up the box, modern NER systems don’t really know anything about names....

IE as Sequential Word Classification A trained IE system models the relative probability of labeled sequences of words. person name location name background To classify, find the most likely state sequence for the given words: Yesterday Pedro Domingos spoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Pedro Domingos

IE as Sequential Word Classification Modern IE systems use a rich representation for words, and clever probabilistic models of how labels interact in a sequence, but do not explicitly represent the names extracted. w w w identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor last person name was female next two words are “and Associates” t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1

Train on sequences of labeled segments, not labeled words. S=(start,end,label) Build probability model of segment sequences, not word sequences Define features f of segments (Approximately) optimize feature weights on training data Semi-Markov models for IE with Sunita Sarawagi, IIT Bombay f(S) = words xt...xu, length, previous words, case information, ..., distance to known name maximize:

Details: Semi-Markov model

Conditional Semi-Markov models CMM: CSMM:

A training algorithm for CSMM’s (1) Review: Collins’ perceptron training algorithm Correct tags Viterbi tags

A training algorithm for CSMM’s (2) Variant of Collins’ perceptron training algorithm: voted perceptron learner for TTRANS like Viterbi

A training algorithm for CSMM’s (3) Variant of Collins’ perceptron training algorithm: voted perceptron learner for TTRANS like Viterbi

A training algorithm for CSMM’s (3) Variant of Collins’ perceptron training algorithm: voted perceptron learner for TSEGTRANS like Viterbi

Sample CSMM features

Experimental results • Baseline algorithms: • HMM-VP/1: tags are “in entity”, “other” • HMM-VP/4: tags are “begin entity”, “end entity”, “continue entity”, “unique”, “other” • SMM-VP: all features f(w) have versions for “f(w) true for some w in segment that is first (last, any) word of segment” • dictionaries: like Borthwick • HMM-VP/1: fD(w)=“word w is in D” • HMM-VP/4: fD,begin(w)=“word w begins entity in D”, etc, etc • Dictionary lookup

Datasets used Used small training sets (10% of available) in experiments.

Results

Results: varying history

Results: changing the dictionary

Results: vs CRF

String Distance Functions and Information Extraction

String Distance Functions and Information Extraction

Presentation Transcript