250 likes | 370 Views
We inhabit a data-rich world where biological data such as sequences, microarrays, and proteomics are abundant. This article discusses utilizing Linked Data and RDF (Resource Description Framework) to forge new connections among biological entities and concepts, thus improving discovery in biology. We explore the applications of SPARQL for querying essential biological data, like gene functions and pathways, enabling integration of multi-omics data for better analysis. This approach promises to expand knowledge and drive innovations in life sciences, akin to high-energy physics for the early web.
E N D
Improving Discovery in Biology through Linked Data Helena F. Deus
Data, data everywhere Sequences Microarrays Electrophoresis Chrystalography In vitro experiments sources: http://www.lbl.gov/publicinfo/newscenter/pr/2008/PBD-microarray.html; http://www.biologyreference.com/Dn-Ep/Electrophoresis.html; http://biology.kenyon.edu/courses/biol114/Chap08/Chapter_08a.html
Ingredients for Linked Data Use resource description framework (RDF) to create relationships between named things Discover new links by reusing ontologies and vocabularies • Name things and concepts using URI (Universal Resource Identifiers) label EGFR http://uniprot.org/EGFR genomicLocation sameAs http://geneontology.org/EGFR 7p12.1 westernBlot rdfs:subClassOf image rdf:type
Ingredients for Linked Data • SPARQL, the query language of the Web of Data :overExpressedIn http://uniprot.org/EGFR Alzheimer’s SPARQL ?Gene :overExpressedIn ?Disease ?Gene :hasFunction ?GOterm ?Pathway :hasParticipant ?Gene
Integrate Biological Data - the easy way NCBI Reactome epidermal growth factor receptor rea:Membrane nci:has_description rea:keyword CCCCGGCGCAGCGCGGCCGCAGCAGCCTCCGCCCCCCGCACGGTGTGAGCGCCCGACGCGGCCGAGGCGG … nih:sequence rea:Receptor nih:EGFR rea:EGFR rea:keyword nih:organism rea:keyword sameAs Homo sapiens nih:interacts rea:Transferase nih:organism nih:EGF
The Linked Data Cloud “Life sciences will drive adoption of the Semantic Web, just as high-energy physics drove the early Web.” - Sir Tim Berners-Lee, 2005
Building a Knowledge Continuum Knowledge Top-down approaches Formal Logical Models to be validated by reality Knowledge re-engineering bottleneck Linked Data Cloud Bottom-up approaches Knowledge Generation, data-driven Data
Biological Knowledge Continuum Metabolomics Knowledge Continuum Protein 3D structure Microarrays Proteomics Transcriptomics Genomics Electrophoresis Sequencing
Mapping genes to their functional roles Src: Science Jan 2010: Vol. 327 no. 5964 pp. 425-431
Querying the UCSC Genome Browser • Look up annotation for all genes with functions similar to protein P04637 select uniProt.gene.val, go.association.term_id, go.term.name from uniProt.gene, go.gene_product, go.association, go.term where uniProt.gene.acc ='P04637' and go.gene_product.symbol = uniProt.gene.val and go.gene_product.id = go.association.gene_product_id and go.association.term_id = go.term.id SQL uniprot:P04637 ?gene :product SPARQL go:term ?goterm Ack: Nigam Shah & Eric Prud’hommeaux
How about Experimental Results? ~20 000 genes ~100 interesting genes/proteins ~ 10 interesting pathways ~5 proteins testable in the lab Linked Data High-throughput technologies Literature Browse databases Computational statistics Hypothesis Generation “I like to call it low-input, high-throughput, no-output biology.”
From genes to discovery Drugbank ClinicalTrial OMIM MDM2 EGFR PTEN KIT PDGFRA NME4ARL6IP6 NOTCH1 unknown MTHFD2
Linking genes to diseases to drugs Sources: Marc Vidal; Albert-Laszlo Barabasi; Michael Cusick;Proceedings of the National Academy of Sciences
Linked data to follow MRSA spread UK MRSA Portugal MRSA
Can we model Systems Biology? Src: Nature Reviews 2010:11; 414-426 Ras CPLA2 RAF MEK ERK
Start using Linked Data NOW!! http://sindice.com HELENA.DEUS@DERI.ORG http://www.w3.org/wiki/HCLSIG/LODD/Data
Who are we talking to? • At NUIG: • Professor CathalSeoighe • Professor Frank Barry