1 / 37

Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature

Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature. Hammad Afzal, Robert Stevens, Goran Nenadic School of Computer Science University of Manchester. G.Nenadic@manchester.ac.uk. Motivation.

mayle
Download Presentation

Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Semantic Descriptions of Bioinformatics WebResources from the Literature Hammad Afzal, Robert Stevens, Goran Nenadic School of Computer Science University of Manchester G.Nenadic@manchester.ac.uk

  2. Motivation • A number of bioinformatics tools and resources available for service use and composition • guessimate is 3000+ Web Services publically available • how to find a service, what is out there to use? • provenance? • Semantic annotation of bioinformatics services • annotate functional capabilities • e.g. Taverna, myGrid, myExperiment, EBI, BioMOBY • Not only services and tools • databases, repositories, corpora

  3. Motivation • Manual curation • e.g. myGrid, BioCatalogue etc. • e.g. Taverna/Feta: only ~15-20% functionally described • backlog – and the number of services is growing • Annotations combine • textual descriptions • ontological mappings

  4. Example text • multiple local align. • Soaplab ontological descriptions

  5. BioCatalogue • Single registration point for Web Service providers • Single search site for scientists and developers • Place where the community can find contacts and meet the experts and maintainers of these services • Community-sourced annotation, expert oversee • Mixed annotations: free text, tags, controlled vocabularies, community ontologies

  6. BioCatalogue Beta version at http://beta.biocatalogue.org/ Launch June 2009 at ISMB

  7. Our approach • Collect service semantic descriptions by extracting and integrating information from text resources • full text bioinformatics journal publications • Approach: • identify descriptors that are used for service and resource annotations • locate them in text • infer the annotations • textual evidence and mappings to an ontology

  8. The rest of the talk • Methodology • mining bioinformatics terminology • extraction of service description profiles • Experiments and results • semi-automated curation • What next?

  9. Corpus Information Retrieval Domain Ontology (e.g. myGrid) Identifying Topic Related Terms Sentence Filtering Text Mining Engine (Information Extraction) Semantic Description of Services Semantic Network of Services Service Discovery Methodology

  10. Bioinformatics terminology Learn bioinformatics terms from literature 1) get a corpus 2) get all terms 3) get seed examples 4) find relevant ones using term profiling and comparison to seed examples

  11. Bioinformatics terminology • Use seed terms to bootstrap • e.g. known descriptors used in existing service descriptions, either in literature or service repositories • 250 terms identified, manual pruning after automatic term recognition • examples of lexical constituents and textual behaviour (pragmatics) • lexical profiling • contextual profiling

  12. Bioinformatics terminology • Lexical profiling • what is in the name • Contextual profiling • characterise sentences in which terms appear (nouns, verbs and context-patterns) • Comparing candidate term profiles to • average seed term • best-match

  13. Bioinformatics terminology Two domain experts evaluated the top 300 terms

  14. Semantic classes – myGrid • Informatics concepts • general concepts of data, data structures, databases, metadata • Bioinformatics concepts • domain-specific data sources and algorithms for searching and analysing data • e.g. Smith-Waterman algorithm

  15. Semantic classes – myGrid • Molecular biology concepts • higher level concepts used to describe bioinformatics data types, used as inputs and outputs in services • e.g. protein sequence, nucleic acid sequence • Task concepts • generic tasks a service operation can perform • e.g. retrieving, displaying, aligning

  16. Semantic classes • Engineered from MyGrid bioinformatics sub-ontology

  17. Semantic classes and instances

  18. Semantic classes and instances

  19. Service mentions • Named-entity recognition (NER) task • Recognition of service mentions using • terminological (semantic) heads of automatically recognised terms • Apollo2Go Web Serviceis an Application • BIND database is a Data source • assign the corresponding semantic class • Hearst patterns (co-ordinations, appositions, enumerations, etc.)

  20. Semantic descriptors • Recognition of phrases depicting semantic roles • used to describe services • Flexible dictionary look-up • terms from myGrid ontology • terms/noun phrases from existing descriptions of bioinformatics resources (collected from Taverna and other Web service providers).

  21. Mining service descriptions

  22. Extraction/functional rules Predicate-driven rules: each verb associated with the type of “information content” it provides

  23. Extraction/functional rules • Manually designed predicate-driven rules: Subject (Arg) – Verb (Predicate) – Object (Arg) • Applied on dependencyparsed sentences • Stanford parser • no phrase structures • complex sentences • information in sub-clause

  24. Extraction/functional rules Phrase structuresidentified and integratewith the dependency Predicate-dependent rules applied to extractspecific ‘content’ andprofile the services Profiles collated for all mentions service name variation

  25. Semantic service profiles • For a given service, collection of • descriptors, including parameters • links to other related instances • related myGrid ontology semantic labels • “informative” sentences

  26. Example – GeneClass • Descriptors

  27. Example – GeneClass • Functions, parameters

  28. Example – GeneClass • Sentences We extend the original GeneClass algorithm to use all target genes for which both motif and expression data is available. In order to study different aspects of target gene regulation we use different sets of motifs and parents with the GeneClass algorithm. The GeneClass algorithm for predicting differential gene expression starts with a candidate set of motifs; representing known or putative regulatory element sequence patterns and a candidate set of regulators or parentSS.

  29. Experiments • 2120 BMC Bioinformatics articles • full-text articles before March 2008 • Service descriptors dictionary • 471 descriptors from myGrid/Feta • 450 descriptors collected from other bioinformatics service/tools providers • 108 predicates used

  30. Experiments • Number of candidate resources

  31. Experiments • Number of descriptions collected using rules

  32. Evaluation of semantic profiles • Evaluated for their capability to be used for semantic description of a given bioinformatics resource • irrelevant • partially useful • useful HeatMapper The HeatMapper tool has already proven to be very useful in several studies Kalign To compare Kalign to other MSA programs, the following test sets were used. Cognitor To add a new species to the COG system, the annotated protein sequences from the respective genome were compared to the proteins in the COG database by using the BLAST program and assigned to pre-existing COGs by using the COGNITOR program

  33. Evaluation of semantic profiles • Two experiments: • 5 well-known resources with descriptions already available • excellent rating for sentences • average rating for semantic descriptors • predicate functions • 5 new, unknown resources • excellent rating for sentences • average rating for semantic descriptors • predicate functions

  34. What next? • Good recall, poor precision • context needs a better model • Mining parameter values • sub-language of parameters • Candidate service/resource mentions • an entity whose profile looks like a service • comparison of semantic profiles • network of services [ISMB 2009] • Do we have good service ontologies? http://gnode1.mib.man.ac.uk/bioinf

  35. Conclusion • Literature mining approach to service description and annotation • Aims • reduce curation efforts • provide semantic synopses of services for the Semantic Web • Potential of text mining • integration with other annotation approaches • extracting the entire service context is still challenging

  36. Acknowledgements • gnTEAM(text extraction,analitics,mining)H. Yang, I. Spasic, H. Afzal, A. Gledson, J. Eales, M. Greenwood, F. Sarafraz • myGrid team:Franck Tanoh • BBSRC • “Mining term associations from literature to support knowledge discovery in biology” (2005-2008) • “pubmed2ensembl” (2009-2010) • “BioCatalogue” (2008-2011)

  37. Announcement • Journal of BioMedical Semantics • published by BioMed Central • launched at ISMB 2009 • Topics include • Infrastructure for biomedical semantics • semantic resources and repositories • meta-data management and resource description • knowledge representation and semantic frameworks • Biomedical Semantic Web • life-long management of semantic resources • Semantic mining, annotation and analysis

More Related