TAXAMATCH: Overview for Nomina IV Workshop

TAXAMATCH:Overview for Nomina IV Workshop Tony Rees CSIRO Marine and Atmospheric Research, Australia May 2009

Fuzzy matching basics – for taxon scientific names • Exact match: name A = name B, name A ¹name B • Fuzzy match: name A @ name B (i.e, similar to) • Q1: How similar is “similar”?: threshold question, don’t want to exclude any “true” hits (i.e., produce false negatives)… plus, what is a suitable measurement method, or methods (different methods, different results…) • Q2: are all equal computed similarities equally acceptable? (true vs. false positives may have same computed similarity) • …additional “accept / reject” rules may help here • Refinement: authority similarity also potentially of value • name A { = or @} name B, authority A { = .. @ .. ¹} authority B (continuum here) • again, measured similarity is affected by choice / design of metric (possibly different form best solution for scientific names) • Additional wrinkles: abbreviated author names, diacritics, “house styles” e.g. punctuation and combining terms, initials included or omitted, more… • Probably useful for common names as well (won’t go there today…). Tony Rees, CSIRO: TAXAMATCH – May 2009

Fuzzy matching basics – for taxon names – cont’d • Phonetic (“sounds like”) matching: • rapid / robust for simple errors, but many errors are non-phonetic, e.g.: • inserted / deleted / wrong / transposed characters OR syllables OR words / dates [in authorities]) • OCR or transmission / data translation errors • however – few or no false hits (with a well designed algorithm – not SOUNDEX!) • Similarity / distance metrics incl. Edit Distance (e.g. Levenshtein Distance), n-grams, more: • trap non-phonetic errors, but more false hits as similarity thresholds widened (to catch all errors) – also SLOW (real time computational load) • wide spectrum of performance in practice – not all algorithms good for all error types in real world taxonomic data • Combination of the two potentially beneficial: • Phonetic matches are “high grade” matches – pass straight through without further testing • Also can use in pre-filters etc. (catch candidate names that would otherwise be excluded) Tony Rees, CSIRO: TAXAMATCH – May 2009

Phonetic matching “Phonetic key” approach: • transform input and all target terms using same rules to produce a “key”, then if “keys” match, a near match is reported (binary yes / no) • Examples: SOUNDEX, Phonex, Phonix, Metaphone, Double Metaphone, Rees 2001 Near Match (latter was author’s first attempt, addressing specific characteristics of taxon scientific names – used in CAAB, WoRMS, OBIS…) • “Rees 2007 Near Match” (a.k.a. “near match” for TAXAMATCH purposes) is an improved “Rees 2001 Near Match” – handles silent leading characters, variant epithet endings (equivalent to stemming), some parsing and normalization included (e.g. stripping “cf.”, “aff.”, etc.) • A useful building block for TAXAMATCH – especially since comparisons can be very fast (if all target terms are transformed in advance, basically becomes an exact match operation). “Rees 2007 near match” phonetic keys (example) Tony Rees, CSIRO: TAXAMATCH – May 2009

Non-Phonetic matching • Edit Distance measures • Levenshtein Distance: count of inserted, deleted, substituted chars (0 = identical, 1..x = increasing dissimilarity) • Damerau-Levenshtein Distance: same but also allows transposed chars at cost=1 (e.g. aslo / also) • Modified Damerau-Levenshtein Distance (MDLD): new for TAXAMATCH, as DLD but allows transposed character blocks at cost = block length (e.g. Panulirus vs. Palinurus, ED=2 not 4) • NB this is an absolute measure independent of word length (may need to adjust for that with rules or calculated proportionality) – also ignores position in word (e.g. near beginning vs. later) • Other similarity measures • n-grams (a.k.a. q-grams) – proportion of common substrings of length n (or q) – e.g. bigrams, trigrams – on test, more sutable for authorities than scientific names (too many false positives with the latter), so used for these in TAXAMATCH • Many others e.g. Bag Distance, Smith-Waterman Distance, Least Common Substring, Positional n-grams, skip grams, compression, Jaro, Jaro-Winkler, sorted Winkler, permuted Winkler, EDITEX, Syllable Alignment Distance… most developed for person names, good summary / group test in Christen, 2006 • “No single best technique” (for person names) [Christen], MDLD expected to outperform most / all esp. with transposed chars / syllables, and short words. Tony Rees, CSIRO: TAXAMATCH – May 2009

TAXAMATCH:a combination (“belt and braces”) approach, developed 2007-8 • TAXAMATCH utilizes: • “Rees 2007 near match” algorithm for phonetic matches plus normalization / stemming • (Newly developed) MDLD test plus rule-based filtering for non-phonetic matches • Abbreviation expansion, normalization, and blend of bigram + trigram similarity measure for authority comparisons • Current reference implementation (at CSIRO, Australia) accepts: • Genus only (e.g. “Hombo”) • Genus + species epithet (e.g. “Hombo sapient”) • Genus + species epithet + authority (e.g. “Hombo sapient L.”, or “Hombo sapient Linnaeus”, or “Hombo sapient Linnaeus, 1758”) • Includes some normalizing – strips subgenera if supplied in brackets, also some qualifier terms e.g. cf., aff., etc. • Returns “candidate near matches” to supplied input term, against target names in a reference database or local list, plus computed auth. similarity (where supplied) • Reference DB for the author’s implementation is “IRMNG” – Interim Register of Marine and Nonmarine Genera – held at CSIRO • IRMNG currently contains ~1.4m genus + species combinations, plus additional ~300,000 genus names without associated species • Goal is >99% recall, 80%+ precision, high efficiency (= rapid execution). Tony Rees, CSIRO: TAXAMATCH – May 2009

Goal #1: Max. possible Recall (= retrieval of true near matches) • ~ 40-50% retrieved by “Rees 2007 near match” phonetic algorithm e.g. “Peneus” vs. “Penaeus”, “Caelorinchis” vs. “Coelorhynchus”, “Uglena” vs. “Euglena” • also retrieves variant species epithet gender endings, e.g. radiatus / radiata / radiatum etc. • no or very few false hits (especially when species epithet included) • of remainder: • ~80% retrieved by MDLD algorithm at threshold ED 1 (with a few false hits, e.g. macrocarpa vs. microcarpa) • further 15%-18% at threshold ED 2 (with some false hits) • remaining 1-2% at threshold ED 3-4 (with very many false hits) • NB: there are a LOT of ED 3 – 4 matches, don’t want them all !! (filtering definitely required) – see next slide… Tony Rees, CSIRO: TAXAMATCH – May 2009

Goal #2: High precision (= suppression of false hits) Post-filters (as heuristic rules) applied to results of both genus- and species- level tests • E.g. genus post-filter: • length filtering: shortest term must have at least 51% “good” chars – i.e. : • must be at least 3 chars long for ED 1, 5 chars long for ED 2, 7 chars long for ED 3… OR be a phonetic match • pattern matching: ED 2+ : require match on leading character (ED 1 does not), OR be a phonetic match • Genus ED cannot exceed 3 (except for phonetic matches) • species post-filter: • length filtering: shortest term must have at least 50% “good” chars – i.e. : • must be at least 2 chars long for ED 1, 4 chars long for ED 2, 6 chars long for ED 3… OR be a phonetic match • ED 3: require match on leading character, OR be a phonetic match • ED 4: require match on leading 3 characters, OR be a phonetic match • Combined Genus and Species ED cannot exceed 4 (except for phonetic matches) • Final result shaping / dynamic filter stage (can be switched out if desired): • always return phonetic matches plus (overall) ED 0, 1 and 2 matches • only return ED 3 matches when no ED 0, 1, 2, etc. • only return ED 4 matches when no ED 0, 1, 2, 3 etc. Tony Rees, CSIRO: TAXAMATCH – May 2009

Goal #3: High efficiency (= shortest possible process time) Avoid testing all target names!! (MDLD test for e.g. 20 input chars takes around 1.25 ms per name-name comparison, test against 1.4m names would take 1,750 secs = 29 mins approx.) • 1: Test short strings – i.e. genus, species epithet separately (e.g. 10 chars – 0.3 ms/test) • 2: Employ genus pre-filter (more heuristics…) • (a) Test phonetic matches (rapid to detect, using pre-calculated values), PLUS • (b) Test other genus names with length +/- 2 chars of input genus name AND between 1 and 3 leading / trailing chars in common (depending on word length), PLUS • (c) Test other genus names sharing same species epithet (as a phonetic match), same conditions as per (b) above except that length +/- 3 chars is accepted. • 3: Employ species pre-filter (more heuristics…) • (a) Test species only of genera that pass all preceding tests, and also: • (b) Test only species epithets with length +/- 4 chars from input term (except for phonetic matches). • Result (in author’s system): typically test only e.g. 1,000-5,000 genus names (of 400,000+, i.e ~99% saving), 100-200 species epithets (of 1.4m) – takes ~1-2 secs / input name tested (genus alone, or genus + species – makes little difference) (could optimise further by testing homonyms once only if desired) • “TAXAMATCH Rapid” variant also available – test only phonetic match genera, plus genera of phonetic match species epithets satisfying other criteria (but will miss some non-phonetic hits where in both genus + species, or where only genus supplied). Tony Rees, CSIRO: TAXAMATCH – May 2009

Additional authority comparison stage • Expand abbreviated auth’s, where abbreviations held in an accessory table (e.g. L -> Linnaeus, Born. -> Bornet, etc.) • Standardize cases (e.g. de Saedeleer = De Saedeleer, etc.) • Tolerate commas before dates (Linnaeus 1758 = Linnaeus, 1758) • Tolerate spaces after initials + full stop (F. J. R. Taylor = F.J.R. Taylor) • Accept “&”, “and”, “et” as equivalent (except “et al.”) • Semi-tolerate discrepancies in diacritical marks (Lacepede vs. Lacepède vs. Lacépède, etc.) – weighted 50% less than other character differences • Semi-tolerate transposed / missing words (Smith & Jones vs. Jones & Smith – considered quite similar (cf. would have large Edit Distance) • Actual test is a blend of two n-gram calculations (67% bigram, 33% trigram) after normalisation etc., as above; results returned on a 0-1 scale where 0 = no similarity, 1 = identical (after relevant normalization) • Examples: “L., 1758” vs. “Linnaeus”: similarity = 0.76 “Smith et Jones” vs. “Jones et Smith”: similarity = 0.91 “F.J.R. Taylor vs. “F. J. R. Taylor”: similarity = 1.0 “F.J.R. Taylor vs. “Taylor”: similarity = 0.59 • Possibly some scope for further development here – but only if needed (in practice, cutoff for likely non-equivalent auth’s is fuzzy, around similarity 0.3-0.4, also some exceptions) • NB in current implementation, authority similarity is returned as supplementary info as available – useful for human feedback, but not used for ranking (because not foolproof). Tony Rees, CSIRO: TAXAMATCH – May 2009

Putting it all together (TAXAMATCH block diagram): Tony Rees, CSIRO: TAXAMATCH – May 2009

Example real world performance (recall & precision) (A) Where true near matches exist in reference DB, 667 misspelled input names tested: • Result shaping on: finds 665 of 667 “true” near matches (99.7% recall), 14 false hits (2%) • Result shaping off: finds 667 of 667 “true” near matches (100% recall), 32 false hits (5%) (cf. with phonetic match alone, <50% of true hits retrieved; without TAXAMATCH heuristics, false hits would be very high, also query time impossibly slow) (B) where true near matches do not exist in reference DB, 100 “not known” input names (genus held in most cases): • Result shaping on: 18 false hits (18%) • Result shaping off: 24 false hits (24%) (not obvious yet why this is higher than the 5% in previous example – but may be in part because higher % of insect and plant names included, and latter are quite speciose) Take home message/s: • Maybe leave result shaping on for user web queries, off for serious deduplication work (need to manually inspect candidate near matches anyway) • Larger / more complete reference DB increases the chance of both “true” and “false” near matches – however true should mask the false ones when result shaping is on. Tony Rees, CSIRO: TAXAMATCH – May 2009

Example real world performance (efficiency) • User web queries take around 1-2 secs average per input name, in present CMAR reference system (includes time for subsidiary database queries e.g. get tax. hierarchies, synonyms for all returned names) • Possibilities for increased efficiency (e.g. for large scale deduplication runs such as 1.4m x 1.4m names): • (a) ignore genera with no species (if interested only in species level comparisons) • (b) only test e.g. insects vs. insects, mammals vs. mammals, etc. (should save min. 50% of queries) – i.e., reduce size of reference DB • (c) write directly to a file or database table rather than generating web query results • (d) employ “TAXAMATCH rapid” variant: require at least one of genus or species epithet to be a phonetic match (saves e.g. 90%+ of queries – but may give misleading result in a few cases (or many, where no species epithet supplied) • (e) use faster / non shared hardware, parallelise queries, other system level optimisation (not yet investigated) • In author’s case study, application of (a) through (c) above, plus use of slightly faster / dedicated machine, reduced typical query time from >1 sec to ~0.3 sec per input name tested • Test 1.4m names against each other took 420,000 secs = 7,000 min = 116.7 hours (4.85 days). Tony Rees, CSIRO: TAXAMATCH – May 2009

Where TAXAMATCH might fit (in a “names recognition” workflow): Raw Text – including species names …Presumed genus + species (or genus + species + auth) in present example, could extend match options as needed Candidate names set Names recognition tool (EOL diagram) • Genus + species both known • Genus known, species not known (new name, or misspelled known species name?) manual review required (detection of misspellings very time consuming, some likely to be missed) Exact match comparison • Genus not known, species known (new name, or misspelled known genus name?) Known names • Genus and species not known (new name, misspelled genus + species names, or bad data?) Tony Rees, CSIRO: TAXAMATCH – May 2009

Where TAXAMATCH might fit (in a “names recognition” workflow): With TAXAMATCH or equivalent: • Improved and more rapid detection of candidate misspellings (still require manual review, but process is much faster, and semi-automated e.g. phonetic matches can mostly go straight through) • Manual review required to sort remaining candidate near matches presented into “true” / “false” • Names not matched are either “new, correctly spelled”, “new, misspelled” or “bad data” – again, manual intervention to sort these (but smaller set than original “no match” set) • Could conceivably introduce additional tests on “no match” names e.g. look for common names, transposed or truncated text, familes used as genus, subgenus used as species, etc. etc. Tony Rees, CSIRO: TAXAMATCH – May 2009

Real-world deduplication example (Sep 2008) Tony Rees, CSIRO: TAXAMATCH – May 2009

Example TAXAMATCH operation – species level Tony Rees, CSIRO: TAXAMATCH – May 2009

or can enter <1,200 species names (via web): Tony Rees, CSIRO: TAXAMATCH – May 2009

Example TAXAMATCH operation – species level Tony Rees, CSIRO: TAXAMATCH – May 2009

TAXAMATCH implementation requirements • Availability of appropriate reference DB – plus what is “appropriate” – e.g. domain or region or taxon specific, or “all names” ?? (Also probably need to either exclude or flag misspellings in the reference DB) • Installation of relevant functions including near match, MDLD, normalize, etc., plus installation of TAXAMATCH routine itself (also author abbreviations lookup table and some temporary results tables) • Addition of pre-computed, indexed columns in reference DB, holding “search” (= normalized) and “near” (= phonetic) match version of each name, plus name length, for both genus and species epithet • Appropriate server hardware – e.g. not too slow, also may be desirable to quarantine from running other concurrent tasks if these impact on relevant search times • Advice from relevant database experts re optimising tables, indexes, tablespaces, memory allocations etc. (can make a big difference to performance in practice) • Currently Oracle® PL/SQL and PHP support (soon?), others maybe to follow ?? Tony Rees, CSIRO: TAXAMATCH – May 2009

Aspects of TAXAMATCH not yet developed… • Match on genus authorities (trivial, just not implemented as yet) • Match on subgenera, infraspecies elements, hybrids, etc. (+ requisite parsing) • Test for split or concatenated names, e.g. Ho mo sapiens, Homosapiens (or Ho mosapiens?) • Test for incorrect content e.g. common name entered in sci. name fields (e.g. “Sand crabs”), family name in genus field (e.g. “Dorippidae sp. 1”, etc.) – also match at different ranks e.g. Homo sapiens neanderthalensis vs. Homo neanderthalensis, Penaeus (Fenneropenaeus) merguiensis vs. Fenneropenaeus merguiensis, etc. • Match on species epithet only (maybe plus tax. group too where known), e.g. to flag possible genus reallocations (trivial to implement if desired) • Better handling of diacritics e.g. Isoëtes vs. Isoetes, etc. (minor fix required) • Defining standard requests/responses for web services • Port to other languages e.g. PHP, Java (see M. Giddens work, this meeting) • Further population of reference DB (IRMNG in these examples) and auth. abbreviation tables – completeness matters (true hits mask false ones, at least when result shaping is on) • Any more fine tuning of filters, etc. if can further reduce false positives (without affecting recall) Tony Rees, CSIRO: TAXAMATCH – May 2009

CSIRO Marine and Atmospheric Research Hobart, Tasmania, Australia Tony Rees Manager, Divisional Data Centre Phone: +61 3 6232 5318 Email: Tony.Rees@csiro.au Web: www.cmar.csiro.au/datacentre/ Thank you Contact UsPhone: 1300 363 400 or +61 3 9545 2176Email: Enquiries@csiro.au Web: www.csiro.au

TAXAMATCH: Overview for Nomina IV Workshop