1 / 36

Distance functions and IE - 3

Distance functions and IE - 3. William W. Cohen CALD. Announcements. No meeting this Wed March 24 March 25 Thus – talk from Carlos Guestrin on max-margin Markov nets Newell-Simon Hall 1507 at 9:30am no wait! – make that Wean Hall 4625 Writeups:

jsparks
Download Presentation

Distance functions and IE - 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distance functions and IE - 3 William W. Cohen CALD

  2. Announcements • No meeting this Wed March 24 • March 25 Thus – talk from Carlos Guestrin on max-margin Markov nets • Newell-Simon Hall 1507 at 9:30am • no wait! – make that Wean Hall 4625 • Writeups: • today: “distance metrics for text” – three papers

  3. Record linkage: definition • Record linkage: determine if pairs of data records describe the same entity • I.e., find record pairs that are co-referent • Entities: usually people (or organizations or…) • Data records: names, addresses, job titles, birth dates, … • Main applications: • Joining two heterogeneous relations • Removing duplicates from a single relation • Storing results of information extraction in a database, or answering queries that involve information extracted from different places • Key step: measuring similarity of two strings • TFIDF metric (WHIRL) • Edit distance (Monge-Elkan)

  4. The data integration problem

  5. Levenshtein distance - example • distance(“William Cohen”, “Willliam Cohon”) s gap alignment t op cost

  6. = D(s,t) Computing Levenshtein distance D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete D(i,j)= min

  7. c o h e n d o r f m 0 0 0 0 0 0 0 0 0 c 1 0 0 0 0 0 0 0 0 c 0 0 0 0 0 0 0 0 0 o 0 2 1 0 0 0 2 1 0 h 0 1 4 3 2 1 1 1 0 n 0 0 3 3 5 4 3 2 1 s 0 0 2 2 4 4 3 2 1 k 0 0 1 1 3 3 3 2 1 i 0 0 0 0 2 2 2 2 1 dist=5 Smith-Waterman distance

  8. D(i-1,j) - A IS(i-1,j) - B Best score in which si is aligned with a ‘gap’ IS(i,j) = max Best score in which tj is aligned with a ‘gap’ D(i,j-1) - A IT(i,j-1) - B IT(i,j) = max Affine gap distances - 3 D(i-1,j-1) + d(si,tj) IS(I-1,j-1) + d(si,tj) IT(I-1,j-1) + d(si,tj) D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)-1 //insert D(i,j-1)-1 //delete D(i,j) = max

  9. Record linkage: definition • Record linkage: determine if pairs of data records describe the same entity • I.e., find record pairs that are co-referent • Entities: usually people (or organizations or…) • Data records: names, addresses, job titles, birth dates, … • Main applications: • Joining two heterogeneous relations • Removing duplicates from a single relation • Storing results of information extraction in a database, or answering queries that involve information extracted from different places • Key step: measuring similarity of two strings • TFIDF metric (WHIRL) • Edit distance (Monge-Elkan)

  10. Explode p(X1,X2,X3): find all DB tuples <p,a1,a2,a3> for p and bind Xi to ai. Constrain X~Y: if X is bound to a and Y is unbound, find DB column C to which Y should be bound pick a term t in X, find proper inverted index for t in C, and bind Y to something in that index Keep track of t’s used previously, and don’t allow Y to contain one. Inference in WHIRL

  11. String distance metrics so far... • Term-based (e.g. TF/IDF as in WHIRL) • Distance depends on set of words contained in both s and t – so sensitive to spelling errors. • Usually weight words to account for “importance” • Fast comparison: O(n log n) for |s|+|t|=n • Edit-distance metrics • Distance is shortest sequence of edit commands that transform s to t. • No notion of word importance • More expensive: O(n2) • Other metrics • Jaro metric & variants • Monge-Elkan’s recursive string matching • etc? • Which metrics work best, for which problems?

  12. Jaro metric • Jaro metric is (apparently) tuned for personal names: • Given (s,t) define c to be common in s,t if it si=c, tj=c, and |i-j|<min(|s|,|t|)/2. • Define c,d to be a transposition if c,d are common and c,d appear in different orders in s and t. • Jaro(s,t) = average of #common/|s|, #common/|t|, and 0.5#transpositions/#common • Variant: weight errors early in string more heavily • Fast to compute

  13. Jaro metric

  14. Winkler-Jaro metric

  15. String distance metrics so far... • Term-based (e.g. TF/IDF as in WHIRL) • Distance depends on set of words contained in both s and t – so sensitive to spelling errors. • Usually weight words to account for “importance” • Fast comparison: O(n log n) for |s|+|t|=n • Edit-distance metrics • Distance is shortest sequence of edit commands that transform s to t. • No notion of word importance • More expensive: O(n2) • Other metrics • Jaro metric & variants • Monge-Elkan’s recursive string matching • etc? • Which metrics work best, for which problems?

  16. So which metric should you use? SecondString (Cohen, Ravikumar, Fienberg): • Java toolkit of string-matching methods from AI, Statistics, IR and DB communities • Tools for evaluating performance on test data • Exploratory tool for adding, testing, combining string distances • e.g. SecondString implements a generic “Winkler rescorer” which can rescale any distance function with range of [0,1] • URL – http://secondstring.sourceforge.net • Distribution also includes several sample matching problems.

  17. SecondString distance functions • Edit-distance like: • Levenshtein – unit costs • untuned Smith-Waterman • Monge-Elkan (tuned Smith-Waterman) • Jaro and Jaro-Winkler • Less ad hoc Jaro variants • Term-based • TFIDF • Jaccard distance:

  18. SecondString distance functions • Edit-distance like: • Levenshtein – unit costs • untuned Smith-Waterman • Monge-Elkan (tuned Smith-Waterman) • Jaro and Jaro-Winkler

  19. Results - Edit Distances Monge-Elkan is the best on average....

  20. Edit distances

  21. SecondString distance functions • Term-based, for sets of terms S and T: • TFIDF distance • Jaccard distance: • Language models: construct PS and PT anduse

  22. SecondString distance functions • Term-based, for sets of terms S and T: • TFIDF distance • Jaccard distance • Jensen-Shannon distance • smoothing toward union of S,T reduces cost of disagreeing on common terms • unsmoothed PS, Dirichlet smoothing, Jelenik-Mercer • “Simplified Fellegi-Sunter”

  23. Results – Token Distances

  24. SecondString distance functions • Hybrid term-based & edit-distance based: • Monge-Elkan’s “recursive matching scheme”, segmenting strings at token boundaries (rather than separators like commas) • SoftTFIDF • Like TFIDF but consider not just tokens in both S and T, but tokens in S “close to” something in T (“close to” relative to some distance metric) • Downweight close tokens slightly

  25. Results – Hybrid distances

  26. Results - Overall

  27. Prospective test on two clustering tasks

  28. An anomolous dataset

  29. An anomalous dataset: census

  30. An anomalous dataset: census Why?

More Related