1 / 24

WP 10 Multilingual Access

WP 10 Multilingual Access. Philipp Daumke, Stefan Schulz. Multilingual Access - Rationale. English as a Foreign Language. English as Second Language. English as First Language. No English Language Skills. < 70 % of the world's scientists read in English

Download Presentation

WP 10 Multilingual Access

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WP 10 Multilingual Access Philipp Daumke, Stefan Schulz

  2. Multilingual Access - Rationale English as a Foreign Language English as Second Language English as First Language No English Language Skills • < 70 % of the world's scientists read in English • 80 % of the world's electronically stored information is in English • 90 % English articles in Medline (2000) Sources: The British Council, 2005Fung ICH: Open access for the non-English-speaking world: overcoming the language barrier. Emerging Themes in Epidemiology, 2008

  3. Non-native speakers English as a Foreign Language English as Second Language • Broad range of command of English • Reading skills > writing skills • Reduced active vocabulary Difficulty in formulating precise queries

  4. Cross-language document retrieval example Korrelation von Hypertonie und Läsion der Weißen Substanz… “Correlation of high blood pressure and lesion of the white substance”

  5. Cross-language document retrieval example Korrelation von Hypertonie und Läsion der Weißen Substanz… “Correlation of high blood pressure and lesion of the white substance”

  6. Cross-language document retrieval example Korrelation von Hypertonie und Läsion der Weißen Substanz… “Correlation of high blood pressure and lesion of the white substance”

  7. BootStrep WP 10 - Multilingual access • Objectives: • To provide a multilingual search interface to the BootStrep Biolexicon / Bioontology • We do NOT propose to deliver a multilingual extension of the BootStrep biolexicon • Query Languages: French, German, English, (Italian) • Output language: English • Method: Subword-based semantic indexing • Resources: • MorphoSaurus multilingual subword lexicon & thesaurus • MorphoSaurus Semantic Indexer

  8. Technique: Morphosemantic Indexing • Subword-based, multilingual semantic indexing for document retrieval • Subwords are atomic, conceptual or linguistic units: • Stems: stomach, gastr, diaphys • Prefixes: anti-, bi-, hyper- • Suffixes: -ary, -ion, -itis • Infixes: -o-, -s- • Equivalence classes contain synonymous subwords and their translations: • #derma = { derm, cutis, skin, haut, kutis, pele, cutis, piel, … } • #inflamm = { inflamm, -itic, -itis, -phlog, entzuend,-itis,-itisch,inflam, flog,inflam,flog, ... }

  9. heart herz subword corazon Eq Class card card HEART muscle INFLAMM myo MUSCLE - itis muscul inflam entzünd muskel inflamm Subword Thesaurus Structure • Thesaurus:~21.000 equivalence classes (MIDs) • Lexicon entries: • English: ~23.000 • German: ~24.000 • Portuguese: ~15.000 • Spanish : ~11.000 • French: ~ 8.000 • Swedish: ~10.000 • Italian: ~ 4.000 Indexation: #muscle #heart #inflamm #heart #muscle #inflamm #inflamm #heart #muscle Segmentation: Myo|kard|itis Herz|muskel|entzünd|ung Inflamm|ation of the heartmuscle

  10. Indexing Pipeline

  11. Indexing Pipeline

  12. Indexing Pipeline

  13. Indexing Pipeline

  14. Subword-based document transformation Morphosemanticindexer

  15. Subword-Based Search Korrelation von Hypertonie und Läsion der Weißen Substanz… #correl #hyper #tens #lesion #whit #matter

  16. Subword-based query transformation Korrelation von Hypertonie und Läsion der Weißen Substanz… #correl #hyper #tens #lesion #whit #matter

  17. Adapting Morphosemantic Indexing of BootStrep • BootStrep terminology mostly disjoint from existing clinical terminology • Enhancement of data resources (e.g. for acronym resolution, multi-term equivalences) • BootStrep Terms for multilingual access • Gene Ontology , InterPro, IntAct, Gene Regulation Ontology, Species • Medline subcorpus (about E. coli gene regulation)

  18. Ongoing/Completed Tasks • Manual Training of MorphoSaurus-Lexica by means of the BootStrep corpora (en, de, fr) • Multilingual Terminology Browser • 2268 GO terms + translations • 6925 InterPro terms + translations • 2082 IntAct terms + translations • URL: http://www.medinf.uni-freiburg.de/demo/BootStrepBrowser/ • Multilingual Search Engine: • Document collection: BootStrep-Medline subset • Languages: English, German, French • Query modes: Author, Title, title + keywords, All

  19. Terminology Browser Search Results Navigation Further Information

  20. Terminology Browser

  21. Multilingual Search Engine

  22. To do: Tools and Resources • BootStrep-Browser • Integration of Species • Integration of the Gene Regulation Ontology • Multilingual Search Engine • Multilingual treatment of acronyms • Inclusion of species synonym list • Dealing with mixed queries (German-English, English-French) • Integration with the fact store • Continue lexicon population • Italian terms ?

  23. To do: Evaluation • Creation of a gold standard • Typical English queries • Find all relevant documents in the E.coli subset • CLIR experiments • Translate queries to French and German • Compare mean average precision • Reuse of already existing routines on standard benchmarks (OHSUMED, IMAGEClef)

  24. ImageCLEFMed Benchmark • Baseline:monolingual • Stemmed English queries • Stemmed English texts • Query translation • Google translator • Multilingual dictionary compiled from UMLS • Morphosemantic Indexing • Interlingual representation of user queries and documents • Morphosemantic Indexing • incorporating disambiguation module Top 20 Average Precision Percent ofBaseline EnglishGermanPortugueseSpanish French Swedish Average

More Related