1 / 23

SemanticMining WP20 meeting

SemanticMining WP20 meeting. Freiburg, March 29 – 20, 2004. Agenda. March 30 9:00 - 10:30 Discussion of the description of WP 20. 10:30 – 10:45 coffee break 11:00-12:45 Workplan for WP20 Discussion and elaboration of deliverables 13:00-14:00 Lunch. March 29

muriel
Download Presentation

SemanticMining WP20 meeting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

  2. Agenda March 30 9:00 - 10:30 Discussion of the description of WP 20.10:30 – 10:45 coffee break11:00-12:45 Workplan for WP20 Discussion and elaboration of deliverables13:00-14:00 Lunch March 29 12:30 - 13:30 Lunch13:30 Welcome, dicussion of agenda14:00 - 14:35 Linköping presentation14:35 - 15:10 Brighton presentation15:10 - 15:45 Göteborg presentation15:45 - 16:30 Coffee break16:30 - 17:05 Stockholm presentation17:05 - 17:40 Geneva presentation17:40 - 18:25 Paris presentation18:25 - 19:00 Freiburg presentation 20:00 Dinner

  3. Multi-lingual Medical DictionaryDescription of Work (I) The lack of a large-scale multi-lingual medical dictionary hampers the integration of European research activities in the medical field, and more seriously also the development of multi-lingual information retrieval services. An interesting language technology useful for this problem is corpus-based machine translation. The aim of this project is to develop techniques and systems for lexical data generation from parallel corpora, and to develop and apply methods for evaluation of machine translation systems. Parallel corpora exist e.g. as translations from English to other European languages of the official WHO classifications and some other terminology systems. Several of the NoE partners have extensive experience in multilingual lexical resources and computational lexicography, while others have an interest in applying such tools e.g. for semi-automated translation, semi-automated coding and indexing, and advanced systems for information retrieval.

  4. Multi-lingual Medical Dictionary Description of Work (II) • Tasks • 20.1 Facilitating short study visits of members of each others’ groups • 20.2 Sharing and exchange of methods, materials and collaboration on work in progress • 20.3 Proposal for a common data structure for a multi-lingual medical dictionary • 20.4 Generation of multi-lingual medical lexicon in English, German, French, Portuguese, Italian, • Spanish, Swedish in a range of 4.000-40.000 entries per language • Deliverables • D20.1 Report Multi-lingual Medical Dictionary m11 • D20.2 Report Multi-lingual Medical Dictionary m17

  5. Topics for Discussion • Lexeme features (morphology, syntax, semantics) • Application context (IR, NLG, …) • Linguistic framework (grammar theory) • Languages covered • Domain (sublanguages, general language) • Size of the lexicon • Implementation framework (sources, exchange templates, • Interfaces to terminological resources (UMLS, WordNet) • Methods for lexical acquisition (manual, semi-automatic)

  6. MorphoSaurusSubword Lexicon & Thesaurus Freiburg University HospitalDepartment of Medical Informatics Freiburg UniversityComputational Linguistics Lab

  7. Motivation – Intra- and Crosslingual Indexing for Information Retrieval • Requirements: Elimination of inflectional e derivational variation: • {nucleus,nuclei}, {diagnosis,diagnoses,diagnostic}{foot, feet}, {Lymphozyten, lymphozytär} • Decomposition of compound terms: • procto|sigmoid|o|scop|ie, para|sympath|ectomy,Rechts|herz|insuffizienz, psic|o|s|somát|ic|o • Resolution of Synonyms and Spelling Variants: • {oesophagus, esophagus}, {leuko, leuco}, {cutis, skin},{hemorrhage,bleeding}, {ascorbic,Vitamin C, {ancylostoma, hookworm} • Mapping of interlingual synonyms: • {blood, blut, sangue}, {liver, hepat..., fígado}{kidney, nephr.., nefr.., nier.., ren, rim, },

  8. What is a subword ? • An atomic linguistic sense unit: • Morphemes: nephr, anti, thyr, scler, hepat, cardi • Morpheme aggregates: diaphys, ascorb, anabol, diagnost • Words: amyloid, bone, fever, liver • exceptionally: noun groups: vitamin c,… • Taming the growth rates of lexical resources at a sublinear level

  9. Subword Delimitation Criteria • Semantic (compositionality)Hyper | cholesterol | emia • Lexical (enabling synonym matching)schleimhaut = mucosa (schleim | haut) • Data-driven (avoiding ambiguities and false segmentation), e.g.relationship, schwangerschaft(relation|ship, schwanger|schaft)

  10. The MorphoSaurus system • Extracts semantically relevant subwords from medical texts in different language • Transforms IR relevant content to concept-like semantic identifiers.(MID = MorphoSaurus identifiers)

  11. High TSH values suggest the diagnosis of primary hypo-thyroidism ... Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo-thyreose ... Original Example:

  12. high tsh values suggest the diagnosis of primary hypo-thyroidism ... High TSH values suggest the diagnosis of primary hypo-thyroidism ... Orthographic Normalization Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo-thyreose ... erhoehte tshwerte erlauben die diagnose einer primaeren hypo-thyreose ... Orthographic Rules Original Example:

  13. high tsh values suggest the diagnosis of primaryhypo-thyroidism ... High TSH values suggest the diagnosis of primary hypo-thyroidism ... Orthographic Normalization Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo-thyreose ... erhoehte tsh-werte erlauben die diagnose einer primaerenhypo-thyreose ... Orthographic Rules Original Morphosyntactic Parser Lexicon high tsh value s suggest the diagnos is of primar yhypo thyroid ism er hoeh te tsh wert e erlaub en die diagnos e einer primaer enhypo thyre ose Example:

  14. high tsh values suggest the diagnosis of primary hypo-thyroidism ... High TSH values suggest the diagnosis of primary hypo-thyroidism ... Orthographic Normalization Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo-thyreose ... erhoehte tsh-werte erlauben die diagnose einer primaeren hypo-thyreose ... Orthographic Rules Original Morphosyntactic Parser Lexicon MID-Representation upiiiij tsh valueiiqrij suggestiipzzr diagnostiiiryz primariiiyiy smalliiiqqi thyreiiprzw hightsh value s suggest the diagnos is of primar y hypothyroid ism Semantic Normalization upiiiij tsh valueiiqrij permitiji diagnostiiiryz primariiiyiy smalliiiqqi thyreiiprzw er hoeh te tsh wert e erlaub en die diagnos e einer primaer en hypo thyre ose Thesaurus Example:

  15. high tsh values suggest the diagnosis of primary hypo-thyroidism ... High TSH values suggest the diagnosis of primary hypo-thyroidism ... Orthographic Normalization Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo-thyreose ... erhoehte tsh-werte erlauben die diagnose einer primaeren hypo-thyreose ... Orthographic Rules Original Morphosyntactic Parser Lexicon MID-Representation upiiijtsh valueiiqrij suggestiipzzr diagnostiiiryz primariiiyiy smalliiiqqithyreiiprzw high tsh value s suggest the diagnos is of primar y hypo thyroid ism Semantic Normalization upiiijtsh valueiiqrij permitiji diagnostiiiryz primariiiyiy smalliiiqqithyreiiprzw er hoeh te tsh wert e erlaub en die diagnos e einer primaer en hypo thyre ose Thesaurus Example:

  16. Morphosaurus Thesaurus Features • Only two semantic relations: • Syntagmatical expansion:nephrotomiiqwjja = nephriikwjza + tomyiiqjqqa (To avoid known mis-segmentations, e.g.nephr + oto + mie) • Ambiguous readings:seitiiyqyqa = lateraliijwira OR pagerijjrja • Transforms IR relevant content to concept-like semantic identifiers.(MID = MorphoSaurus identifiers)

  17. MorphoeditLexicon Editor

  18. State of the Project • Domain: clinical language and lay expressions, partly • Validated entries: • 21,397 English, 22,053 German, • 15,029 Portuguese. • Automatically generated entries • 8,992 Spanish subwords from Portuguesesubwords

  19. CLIR Experiments (OHSUMED) • Manual translation of 106 English queries to German and Portuguese by medical experts • Baseline: machine translation/bilingual dictionaries QTR • Google-Translator to re-translate German/Portuguese queries to English • additional search in a bilingual lexeme dictionary, derived from the UMLS-Metathesaurus. • stemmed by the Porter stemming algorithm / stop word elimination • MorphoSaurus: normalization of queries/documents MSI • Boolean search engine: frequency and adjacency measure • Results German: QTR: 68%, MSI: 93% • Results Portuguese: QTR: 54%, MSI: 62% (RIAO’04)

  20. Multilingual MeSH Mapping • Morpho-semantic normalization of 35,000 English, manual MeSH annotated Medline abstracts • Statistical learning of indexing patterns • Using indexing patterns for mapping of normalized English/German/Portuguese texts • Results: gold standard human indexers • English: 33% (68%) • German: 30% (62%) • Portuguese: 27% (56%) (RIAO’04) agreement with agreement with

More Related