1 / 19

FAT – Finding All Taxa (in Text Documents)

Universität Karlsruhe (TH) Research University – founded 1825. FAT – Finding All Taxa (in Text Documents). Guido Sautter , Donat Agosti, Klemens Böhm. FAT – Basic Principle. Generate taxon name candidates Find out which candidates actually are a taxon names Divides text in Sure positives

lester
Download Presentation

FAT – Finding All Taxa (in Text Documents)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Universität Karlsruhe (TH) Research University – founded 1825 FAT – Finding All Taxa(in Text Documents) Guido Sautter, Donat Agosti, Klemens Böhm

  2. FAT – Finding All Taxa (in Text Documents) FAT – Basic Principle • Generate taxon name candidates • Find out which candidates actually are a taxon names • Divides text in • Sure positives • Sure negatives • Candidates • Use sure positives and negatives to deal with candidates

  3. FAT – Finding All Taxa (in Text Documents) FAT – Detail Overview • Find all parts of text that might be taxon names using • Morphological structure (in form of regular expressions) • Known taxon names (as positive gazetteer lists) • Successively rule candidates to be taxa or not using • Morphological structure (in form of regular expressions) • Known taxon names (as positive gazetteer lists) • Textual hints (name labels, e.g. “sp. nov.”) • Ruled-out words (as negative gazetteer lists) • Common dictionaries (as negative gazetteer lists) • Document internal contradictions • User feedback (as last instance)

  4. FAT – Finding All Taxa (in Text Documents) FAT – Basic Benefits / Deficits • Benefits • All available knowledge is used • Newly added knowledge is used early as possible • Can learn new taxa through use of structure • User can avoid errors through feedback at little effort • Deficits • Regular expression patterns somewhat inflexible regarding • Automated adaptation to different document styles • Language-dependent capitalization schemes (e.g. in German) • Gazetteer lists somewhat susceptible to • Misspellings / OCR errors • Unseen languages

  5. FAT – Finding All Taxa (in Text Documents) Morphological Rules • Exploit (Linnaean / ICZN) rules of nomenclature • Challenges: • Different schemes of in-taxon-name punctuation • Embedded author names (differing styles, strange names) • Imlementation: • Editor for basic building blocks, including- line-broken and indented layout - syntax check and test facilities • Actual expressions assembled dynamically at runtime (almost) all parts maintainable in one place

  6. FAT – Finding All Taxa (in Text Documents) Gazetteer Lists • Storage for known taxon names / epithets / authors • Challenges: • Huge amount of data (main memory footprint) • Misspellings (source text or OCR) • Imlementation: • Editor for lists, including- import / export- add / intersect / and subtract functions • Centralized access point loaded and stored only once

  7. FAT – Finding All Taxa (in Text Documents) Running FAT (Overview) Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names Taxon Names Candiates Not Taxon Names

  8. FAT – Finding All Taxa (in Text Documents) Recall Rules Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names • Create candidates: • morphological structure • filter out matches that stop contain stop words Taxon Names Candiates Not Taxon Names

  9. FAT – Finding All Taxa (in Text Documents) Dictionary Filter Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names • Filter candidates: • gazetteer based • filter out candidates with common language words in epithet positions (+ stemming for English) Taxon Names Candiates Not Taxon Names

  10. FAT – Finding All Taxa (in Text Documents) Lexicon Rules Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names • Exploit known epithets: • candidates  matches • create further candidates Taxon Names Candiates Not Taxon Names

  11. FAT – Finding All Taxa (in Text Documents) Label Rules Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names • Exploit taxon name labels: • labeled candidates matches • „Genus species, sp. nov.“ Taxon Names Candiates Not Taxon Names

  12. FAT – Finding All Taxa (in Text Documents) Precise Rules Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names • Exploit morphology: • candidates with distinctive structure  matches • „Genus species st. race“ Taxon Names Candiates Not Taxon Names

  13. FAT – Finding All Taxa (in Text Documents) Known Data Rules Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names • Exploit prior runs: • Extract epitets from candidates • Known epithet combination candidates  matches Taxon Names Candiates Not Taxon Names

  14. FAT – Finding All Taxa (in Text Documents) Author Name Rules Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names Exclude candidates with author names in genus or sub genus position Taxon Names Candiates Not Taxon Names

  15. FAT – Finding All Taxa (in Text Documents) Negative Rules Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names Exclude candidates with words from negatives (all text excluded so far) in epithet positions Taxon Names Candiates Not Taxon Names

  16. FAT – Finding All Taxa (in Text Documents) Data Rules Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names Candidates with known epithets in last position matches Taxon Names Candiates Not Taxon Names

  17. FAT – Finding All Taxa (in Text Documents) Dynamic Lexicon Rules Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names • Exploit matches & negatives: • Works as combination of lexicon-based rules before • But with current document • Compute transitive hull Taxon Names Candiates Not Taxon Names

  18. FAT – Finding All Taxa (in Text Documents) User Feedback Recall Rules Dictio-nary Filter Lexicon Rules Label Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules User Feedback Document Text Taxon Names Ask user to decide on remaining candidates (displaying some context) Optional step, can be omitted Taxon Names Candiates Not Taxon Names

  19. FAT – Finding All Taxa (in Text Documents) Universität Karlsruhe (TH) Research University – founded 1825 Questions? Browse Madagascar Corpus at http://plazi.org/GgSRS/search Download GoldenGATE from http://idaho.ipd.uka.de/GoldenGATE/

More Related