1 / 41

Information Extraction, Service Discovery and Semantic Services in HealthGrid Applications

Information Extraction, Service Discovery and Semantic Services in HealthGrid Applications. Martin Hofmann Department of Bioinformatics. Challenges in HealthGrids. Information explosion in the Life Sciences Highly parallel experimental procedures (e.g. 30,000 genes on one microarray)

vito
Download Presentation

Information Extraction, Service Discovery and Semantic Services in HealthGrid Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction, Service Discovery and Semantic Services in HealthGrid Applications Martin Hofmann Department of Bioinformatics

  2. Challenges in HealthGrids • Information explosion in the Life Sciences • Highly parallel experimental procedures (e.g. 30,000 genes on one microarray) • Dominance of descriptive information, mostly in poorly structured text • Insufficient representation of knowledge in databases / knowledge bases • Insufficient integration of biomedical data

  3. Data Integration in HealthGRIDs • Excursion on the SIMDAT Project • Pharma Activity • LION bioscience Ltd. • NEC Research Laboratories (CCRLE) • Fraunhofer SCAI Dept. of Bioinformatics • Free University of Brussels / EMBnet Node Belgium • Glaxo Smith Kline (GSK) • University of Karlsruhe

  4. SRS Screenshot

  5. SRS Servers Within Multi Site Organisation Site with SRS server • Central Server • low maintenance • no independence • no fault tolerance Site without SRS server • Cooperative Servers • maximum independence • data exchange necessary • high maintenance • Federated Servers • max. resource sharing • max. cooperation • technologically more difficult

  6. Characteristics of SRS Federation • One server knows all resources shared among servers • databanks • tools • canned queries, data views • users: access rights, user sessions, personalization • Fail safe mechanism if local databank or tool fails • Servers in SRS federation can automatically synchronize and exchange • meta information • indices (optional) • flat files (optional) • SRS Federation behaves like a single SRS server • redundancy can be configured to improve performance and fail safety

  7. Technical Challenges and Opportunities • Collaboration of servers to • follow and execute cross database queries • build composite objects and reports with information from several databases • Optimization • through finding optimal paths for database linking (cross database queries) • use ‘clever’ algorithms to minimize number of transactions between servers to build composite objects and reports

  8. Information Extraction for HealthGRIDs • Towards a “Semantic Hub“ Covering the BioMedical Name Space

  9. Growth of life science data(here: entries in EMBL) exceeds the growth rate of compute (CPU) power as described by Moore´s law An update every second Megabases Moore’s Law schema taken (with permission) from Graham Cameron, Deputy Director of the European Bioinfrormatics Institute (EBI)

  10. Growth rate of Medline • Updates: Since 2002, between 1,500-3,500 completed references are added each day Tuesday through Saturday; over 571,000 total added during 2004. Source: http://www.nlm.nih.gov/pubs/factsheets/medline.html

  11. Effort for Information (better: knowledge) Retrieval as Defined by PubMed Searches Source: http://www.nlm.nih.gov/pubs/factsheets/medline.html

  12. A Basic Observation The more complex a subject …. …, the more likely you will find it only in unstructuredtext

  13. One Possible Solution: Information Extraction • Computer – Aided, Automated Information Extraction

  14. WAS, STEP, iCE, StAR Interleukin 1 alphaTumor necrosis factor beta p21, EPO, large T antigen TNF receptor 1collagen, type I, alpha receptor Neuronectin, GMEM, tenascin, HXB, cytotactin, hexabrachion F12A Collagen, type I, alpha 1Collagen alpha 1(I) chainAlpha 1 collagenAlpha-1 type I collagen COL1A1 Some Specific Problems of Information Extraction from Life Science Publications • Multiple names for one gene • Ambiguous names in databases • Common word names • Multi-word terms • Spelling variants • Permutations • Nested names

  15. Protein and Gene Name Recognition Semi-automated generation of biomedical dictionaries- Example: Human Protein Dictonary with ~20.000 objects, ~160.000 synonyms. Mapping tables allow linkage of extracted protein and gene objects to experimental data (e.g. gene expression data, Gene Ontology, …). Scoring algorithm for multi-word term disambiguation based on token classes (Hanisch et al., 2003)* Fast, approximative matching algorithm for rapid, distributed entity recognition in scientific text. Fraunhofer SCAI ProMiner and the Biological Entity Recognition Module search ~13,000,000 MEDLINE abstracts for all human protein and gene names over night on a 8 CPU parallel computer** * Hanisch D, Fluck J, Mevissen HT, Zimmer R. Pac Symp Biocomput. 2003;:403-14. ** if a 36 node SUN cluster is used, it takes about 90 minutes

  16. Critical Evaluation of Information Extraction Approaches in Molecular Biology and Genome Research: BioCreative

  17. mouse yeast fly Overview on Results of the BioCreative Competition

  18. Example for the Application of Text Mining

  19. Gene – Disease Associations: Osteoarthrose Relationship Between a Specific Disease and Protein Names • used co-occurance of disease terms (MESH) • and genes • use statistical measure to determine significant associations protein-protein-interaction networks representing the top-scoring 70 proteins associated with osteoarthritis red: significant associations white: no significant association

  20. Osteoarthrose Sub-Network zooming into a protein-protein- interaction sub-network and relationships between proteins involved in osteoarthritis

  21. Use of a Concept – Based Semantic Hub in HealthGRIDs • Using the named-entity recognition machinery for distributed indexing of databases and documents • Information extraction from distributed documents • Large scale information extraction NOT limited to MedLine • Semantic mediation through populated ontologies

  22. Reconstruction of Pharmaceutical Information Recognition and Reconstruction of Chemical Structures

  23. Information on Chemical Structures in Scientific Text

  24. Aim: Multimodal Extraction and Reconstruction of Chemical Structure Information from Patents and other Scientific Text Image Analysis / Structure Reconstruction -CH3 -CH2-CH3 -CH3 -CH2-CH3 -CH2-CNHS -CH2-CNHS -COOH Text Analysis / Entity Recognition -COOH Reconstruction of Published ChemSpace including PatentSpace

  25. Reconstruction of Chemical Structure Information from Images Design of the „chemOCR“ System

  26. Structure Reconstruction Workflow chemical cartridge SVG converterer line filter modul approx. graphmatcher molecular graph converter BMP PDF filter rules common fragments modul chemical rules modul molecule database machine learning tool manual curation tool

  27. Character Recognition and Resolution of Superatoms:

  28. Correction of conversion errors: BMP SVG Identifying disconnected bonds: Relative Neighbourhood Graph (RNG)

  29. Most common fragment patterns:

  30. Input-Graph found Subgraphisomorphismen used fragments for the reconstruction Graph Matching Example Decomposition network

  31. chemOCR - Prototype: SVG graphics Line filtering and matching Graph editor and file conversion

  32. First Results:

  33. Grid Service Annotation and Discovery A Universal, Easy-to-use Tool for Grid Service and Data Annotation [a first result of our work in the SIMDAT project]

  34. Top level classes • Scientific Domain • Scientific Theory • Method • Tool • Workflow • Experiment • Data • Repository Methodology

  35. Domains

  36. method Scientific Methodology

  37. Data

  38. From Domain to Data

  39. Grid Service Annotation Ontology Classes Grid Services • Just like ontologies, semantic annotations build on those ontologies have to be stored centrally • Annotations should ideally not disturb the annotated entities (GS, data,...) • => nondestructive annotation, store away safely in a third place S P O Annotations as Subject – Predicate – Object e.g. „Service „xyz“ provides BLASTX search“

  40. Acknowledgement • Chemical Structure Reconstruction • Le Thuy Bui Thi • Marc Zimmermann • Tanja Fey • Grid Service Discovery and Annotation • Kai Kumpf • Extraction of Biological Information • Juliane Fluck • Theo Mevissen • Hartwig Deneke • Daniel Hanisch • Prof. Ralf Zimmer* • Florian Sohler* • Katrin Fundel*

More Related