200 likes | 363 Views
Web Services, Workflows & Taverna. Superglue for the Semantic Web Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk. http://mygrid.org.uk http://taverna.sf.net. Who are we?. my Grid An EPSRC funded ‘eScience Pilot Project’ Based across multiple sites in the UK Taverna
E N D
Web Services, Workflows & Taverna Superglue for the Semantic Web Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk http://mygrid.org.ukhttp://taverna.sf.net
Who are we? • myGrid • An EPSRC funded ‘eScience Pilot Project’ • Based across multiple sites in the UK • Taverna • A tethered spin-off of the myGrid project • Aimed at producing powerful tools to complement the basic research work EBI Hinxton Campus
What is Taverna? • Allows scientists to graphically construct complex processes in the form of workflows • What is a workflow? • Set of activities that make up a process • Definitions about how data moves between these activities • The user specifies what to do but not how to do it • Insulates users from the complexity of distributed computing
myGrid, Taverna and WBS • One of several early adopters of Taverna • Manchester based group working on Williams-Beuren Syndrome in the medical genetics department • Workflows written by life scientists not computer scientists • Following slides stolen at the last minute from Hannah Tipney at Manchester!
Williams-Beuren Syndrome (WBS) • Contiguous sporadic gene deletion disorder • 1/20,000 live births, caused by unequal crossover (homologous recombination) during meiosis • Haploinsufficiency of the region results in the phenotype • Multisystem phenotype – muscular, nervous, circulatory systems • Characteristic facial features • Unique cognitive profile • Mental retardation (IQ 40-100, mean~60, ‘normal’ mean ~ 100 ) • Outgoing personality, friendly nature, ‘charming’
~1.5 Mb 7q11.23 CTA-315H11 Physical Map CTB-51J22 Gap GTF2IRD2P FKBP6T POM121 GTF2IP NOLR1 NCF1P PMS2L STAG3 Chr 7 ~155 Mb Block B Block A Block C Williams-Beuren Syndrome Microdeletion POM121 C-cen Eicher E, Clark R & She, X An Assessment of the Sequence Gaps: Unfinished Business in a Finished Human Genome. Nature Genetics Reviews (2004) 5:345-354 Hillier L et al. The DNA Sequence of Human Chromosome 7. Nature (2003) 424:157-164 NOLR1 A-cen FKBP6 B-cen FZD9 C-mid BAZ1B BCL7B TBL2 WBSCR14 WBSCR18 WBSCR22 STX1A WBSCR21 CLDN3 CLDN4 ELN LIMK1 WBSCR1/E1f4H WBSCR5/LAB RFC2 B-mid CYLN2 A-mid GTF2IRD1 B-tel GTF2I A-tel NCF1 C-tel GTF2IRD2
Query nucleotide sequence Experiment RepeatMasker BLASTwrapper GenBank Accession No Promotor Prediction URL inc GB identifier TF binding Prediction Translation/sequence file. Good for records and publications prettyseq Regulation Element Prediction GenBank Entry Amino Acid translation Sort for appropriate Sequences only Identifies PEST seq epestfind Identify regulatory elements in genomic sequence Seqret Identifies FingerPRINTS pscan MW, length, charge, pI, etc Nucleotide seq (Fasta) pepstats 6 ORFs Predicts Coiled-coil regions RepeatMasker pepcoil tblastn Vs nr, est, est_mouse, est_human databases. Blastp Vs nr Coding sequence GenScan BlastWrapper Restriction enzyme map restrict SignalP TargetP PSORTII sixpack Predicts cellular location transeq CpG Island locations and % cpgreport Identifies functional and structural domains/motifs InterPro RepeatMasker Repetitive elements ORFs Hydrophobic regions Pepwindow? Octanol? Blastn Vs nr, est databases. ncbiBlastWrapper
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa Analysis via ‘Cut and Paste’
Workflows A B C A: Identification of overlapping sequence B: Characterisation of nucleotide sequence C: Characterisation of protein sequence
WBSCR21 WBSCR27 WBSCR24 WBSCR18 WBSCR22 WBSCR28 STX1A CLDN3 CLDN4 RP11-148M21 RP11-731K22 RP11-622P13 314,004bp extension All nine known genes identified (40/45 exons identified) The Biological Results Four workflow cycles totalling ~ 10 hours The gap was correctly closed and all known features identified WBSCR14 ELN CTA-315H11 CTB-51J22
Different Kinds of Services • Pure web services are not always the solution • Abstraction Level? • Typing? • Description? • Data Volumes? • Taverna employs a hybrid architecture which includes web services amongst other components
Complex Invocation Patterns • E.g. Soaplab – has a typical factory pattern ‘create job’, ‘set parameter’, ‘run task’, ‘wait’, ‘get results’, ‘destroy task’. • Multiple web service calls per conceptual operation • Handled in Taverna by embedding this invocation pattern within a Soaplab processor.
Large Data Sets • No explicit limit to message size in WS specs but… • Most common toolkits equally terrible at handling large data. • WS Standards for bulk data transfer insufficiently mature or lacking interoperability. • Transfer references across WS calls, transfer actual data ‘out of band’ • More info from Jon later, handled in Taverna via a Styx Grid Service plugin.
Service Description • WS standards fail to address the description of a service. • Registries – UDDI is an old standard and predates work on semantic description • BioMoby and myGrid include Semantic Description and Discovery components. • Search for services by task, by input or by past involvement in another workflow • Essential for AI assisted workflow construction
Multiple Service Types BioMoby (orange), Soaplab (wheat), Workflow (red), SOAP Service (green), SeqHound (blue), Local Java operation (purple), String constant (pale blue)
Taverna Demo • There should be a live demo of the Workflow Workbench here…
Obtaining Taverna • Taverna is available under the LGPL from our project site on Sourceforge.net • http://taverna.sourceforge.net • Release 1.0 as of the 20th Jan 2005 (after twelve beta releases) • Includes online and downloadable user manual, examples etc. • Support via project mailing lists
myGrid and WBS People! Core Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe. Users Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle, UK Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK Postgraduates Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair Hampshire Industrial Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM) Robin McEntire (GSK) Collaborators Keith Decker
Acknowledgements myGrid is an EPSRC funded UK eScience Program Pilot Project Particular thanks to the other members of the Taverna project, http://taverna.sf.net