An Introduction to Taverna

An Introduction to Taverna Dr. Georgina Moulton and Stian Soiland The University of Manchester (georgina.moulton@manchester.ac.uk; ssoiland@cs.man.ac.uk ) (on behalf of the myGRID team)

Outline of the day • Introduction to workflows • Introduction to Taverna • Case-studies • Hands-on Taverna workshop • Build you own workflows • Explore features of Taverna • Taverna in a little more detail

What you will learn • No prior knowledge of workflow technology • By the end of the tutorial participant will know how to • install the workbench software, import and run existing workflows and build their own from components available on the public internet. • use the semantic search technologies in myGrid assist this process by enabling service discovery • do basic troubleshooting of workflows using Taverna's fault tolerance and debug mechanisms • manage the import and export of data to and from the workflow system.

What is Taverna? Taverna enables the interoperation between databases and tools by providing a toolkit for composing, executing and managing workflow experiments • Access to local and remote resources and analysis tools • Automation of data flow • Iteration over large data sets

Workflows • Workflow language specifies how processes (web services) fit together • Describes what you want to do, not how you want to do it • High level workflow diagram separated from any lower level coding – you don’t have to be a coder to build workflows • Workflow is a kind of script or protocol that you configure when you run it. • Easier to explain, share, relocate, reuse and repurpose. • Workflow <=> Model • Workflow is the integrator of knowledge Predicted Genes out Sequence RepeatMasker Web service GenScan Web Service BlastWeb Service

Two types of workflows • Data workflows • A task is invoked once its expected data has been received, and when complete passes any resulting data downstream • Control workflows • A task is invoked once its dependant tasks have completed A B C D E F

Williams-Beuren Syndrome (WBS) • Contiguous sporadic gene deletion disorder • 1/20,000 live births, caused by unequal crossover (homologous recombination) during meiosis • Haploinsufficiency of the region results in the phenotype • Multisystem phenotype – muscular, nervous, circulatory systems • Characteristic facial features • Unique cognitive profile • Mental retardation (IQ 40-100, mean~60, ‘normal’ mean ~ 100 ) • Outgoing personality, friendly nature, ‘charming’

~1.5 Mb 7q11.23 Patient deletions * * WBS SVAS GTF2IRD2P Physical Map CTA-315H11 ‘Gap’ CTB-51J22 FKBP6T POM121 GTF2IP NOLR1 NCF1P PMS2L STAG3 Chr 7 ~155 Mb Block B Block A Block C Williams-Beuren Syndrome Microdeletion Eicher E, Clark R & She, X An Assessment of the Sequence Gaps: Unfinished Business in a Finished Human Genome. Nature Genetics Reviews (2004) 5:345-354 Hillier L et al. The DNA Sequence of Human Chromosome 7. Nature (2003) 424:157-164 A-cen B-cen C-cen C-mid B-mid A-mid B-tel A-tel C-tel WBSCR1/E1f4H WBSCR5/LAB GTF2IRD1 WBSCR21 WBSCR22 WBSCR18 WBSCR14 GTF2IRD2 POM121 NOLR1 BAZ1B BCL7B FKBP6 GTF2I CLDN3 CLDN4 CYLN2 STX1A LIMK1 NCF1 TBL2 RFC2 FZD9 ELN

Filling a genomic gap in silico • Two steps to filling the genomic gap: • Identify new, overlapping sequence of interest • Characterise the new sequence at nucleotide and amino acid level • Number of issues if we are to do it the traditional way: • Frequently repeated – info rapidly added to public databases • Time consuming and mundane • Don’t always get results • Huge amount of interrelated data is produced

Traditional Bioinformatics 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

Requirements • Automation • Reliability • Repeatability • Few programming skill required • Works on distributed resources

The Williams Workflows A B C A: Identification of overlapping sequence B: Characterisation of nucleotide sequence C: Characterisation of protein sequence

WBSCR21 WBSCR27 WBSCR24 WBSCR18 WBSCR22 WBSCR28 STX1A CLDN3 CLDN4 RP11-148M21 RP11-731K22 RP11-622P13 314,004bp extension All nine known genes identified (40/45 exons identified) The Biological Results Four workflow cycles totalling ~ 10 hours The gap was correctly closed and all known features identified WBSCR14 ELN CTA-315H11 CTB-51J22

Workflow Advantages • Automation • Capturing processes in an explicit manner • Tedium! Computers don’t get bored/distracted/hungry/impatient! • Saves repeated time and effort • Modification, maintenance, substitution and personalisation • Easy to share, explain, relocate, reuse and build • Releases Scientists/Bioinformaticians to do other work • Record • Provenance: what the data is like, where it came from, its quality • Management of data (LSID - Life Science Identifiers)

Benefit to the Scientist? • Automated plumbing • Systematic. Making boring stuff easier so can do more funky stuff. Data chaining replaces manual hand-offs. Accelerated creation of results. Repetitive and unbiased analysis. Potentially reproducible but not always. • Easier to use (but maybe not design) • Gives non-developers access to sophisticated codes and applications. Avoids need to download-install-learn how to use someone else's code. • A framework to leverage a community’s applications, services, datasets and codes • Honours original codes and applications. Heterogeneous coding styles and tools sets. The best applications. • Promoting community metadata and common formats & standards • A framework for extensibility, adaptability & innovation. • Add my code, reuse and repurpose

It’s more than plumbing…. • Workflows are protocols and records. • Explicit and precise descriptions of a scientific protocol • Scientific transparency. Easier to explain, share, relocate, reuse and repurpose and remember. • Provenance of results for credibility. • Workflows are know-how. • Specialists create applications; experts design and set parameters; inexperienced punch above their weight with sophisticated protocols • Workflows are collaborations. • Multi-disciplinary workflows promote even broader collaborations.

In silico experiment lifecycle

Finding and Sharing Tools 3rd Party Applications and Portals Taverna Workbench myExperiment DAS Utopia Feta Workflow Enactor Clients Workflow enactor Service Management LSIDs Log Metadata DefaultData Store Custom Store Results Management KAVE BAKLAVA Part of a bigger picture (which we will talk in more detail later)

Taverna Workflow Workbench

Taverna • Taverna is : • A workflow language based on a dataflow model. • A graphical editing environment for that language. • An invocation system to run instances of that language on data supplied by a user of the system. • When you download it you get all this rolled into a single piece of desktop software • The enactor can be run independently of the GUI • Java based, runs on Windows, Mac OS, Linux, Solaris …. • It doesn't necessarily run "on a grid". • Can be used to access resources, either on a grid, or anywhere else.

OMII-UK • Funded through the Open Middleware Infrastructure Institute (OMII-UK) as part of the myGrid project run by Carole Goble • Four years old, funding secured through 2008 and beyond. • Development team at Manchester & Hinxton, UK • Wide group of ‘friends and allies’ across the world particularly within UK eScience • Implemented in Java, released under LGPL licence.

Workflow diagram Available services Biomart query Soaplab operation wrapping an EMBOSS tool Tree view of workflow structure Version 1.5.1 Shown running on a Mac but written in Java, Runs & developed on Windows, OS X and Linux.

An Open World • Open domain services and resources. • Taverna accesses 3500+ operation. • Third party. • All the major providers • NCBI, DDBJ, EBI … • Enforce NO common data model. • Quality Web Services considered desirable • .

Services • Taverna can interoperate the following by default : • SOAP based web services • Biomart data warehouses • Soaplab wrapped command line tools • BioMoby services and object constructors (talk tomorrow) • Inline interpreted scripting (Java based) • Other service classes can be added through an extension point (but you probably don’t need to)

Multi-disciplinary • ~37000 downloads • Ranked 210 on sourceforge • Users in US, Singapore, UK, Europe, Australia, • Systems biology • Proteomics • Gene/protein annotation • Microarray data analysis • Medical image analysis • Heart simulations • High throughput screening • Phenotypical studies • Plants, Mouse, Human • Astronomy • Aerospace • Dilbert Cartoons

What do Scientists use Taverna for? • Data gathering and annotating • Distributed data and knowledge • Data analysis • Distributed analysis tools and • Data mining and knowledge management • Hypothesis generation and modelling

Case Study – Graves Disease • Autoimmune disease that causes hyperthyroidism • Antibodies to the thyrotropin receptor result in constitutive activation of the receptor and increased levels of thyroid hormone • Original myGrid Case Study Ref: Li P, Hayward K, Jennings C, Owen K, Oinn T, Stevens R, Pearce S and Wipat A (2004) Association of variations in NFKBIE with Graves? disease using classical and myGrid methodologies. UK e-Science All Hands Meeting 2004

Graves Disease The experiment: • Analysing microarray data to determine genes differentially-expressed in Graves Disease patients and healthy controls • Characterising these genes (and any proteins encoded by them) in an annotation pipeline • From affymetrix probeset identifier, extract information about genes encoded in this region. • For each gene, evidence is extracted from other data sources to potentially support it as a candidate for disease involvement

Annotation Pipeline Evidence includes: • SNPs in coding and non-coding regions • Protein products • Protein structure and functional features • Metabolic Pathways • Gene Ontology terms

Data Analysis • Access to local and remote analysis tool • You start with your own data / public data of interest • You need to analyse it to extract biological knowledge

Case study: Investigating Genotype-Phenotype Correlations in Trypanotolerance Fisher P, Hedeler C, Wolstencroft K, Hulme H, Noyes H, Kemp S, Stevens R, Brass A. (2007) A systematic strategy for large-scale analysis of genotype phenotype correlations: identification of candidate genes involved in African trypanosomiasis.Nucleic Acids Res.35(16):5625-33

Susceptibility Infected host Why is one mouse resistant and another one susceptible? What did the immune system of the susceptible mouse do inappropriately? General Which of the strain differences between resistant/susceptible mice are significant? Which of the differently activated pathways in resistant/susceptible mice are significant? Biological Lessons Top-down: Immunology-driven Which strain differences can be found in resistant/ susceptible mice? Which pathways are differently activated in resistant/susceptible mice? Complex Which genes in a region are differently expressed in resistant/ susceptible mice and have SNPs? What are the expression levels of all genes involved in a particular pathway in resistant/susceptible mice? Composition Which genes are between two genes? Which genes are up-regulated in a data set? In which pathways is a set of genes involved? Bottom-up: Data driven Simple Data sources Transcriptome Genome Pathway

Bioinformatics Challenges • Linking from genotype to phenotype • Integrated ‘omics (GIMS) • Microarray analysis • Working with the literature • Presentation of results to non-bioinformaticians • Separating cause and effect

Genotype to Phenotype

Current Methods Genotype Phenotype 200 ? What processes to investigate?

Phenotype Genotype 200 ? Metabolic pathways Phenotypic response investigated using microarray in form of expressed genes or evidence provided through QTL mapping Genes captured in microarray experiment and present in QTL (Quantitative Trait Loci ) region Microarray + QTL

Trypanosomiasis (“Sleeping sickness”) • Trypanosoma species parasite • Human sleeping sickness => T.brucei • Cattle => T.congolense and T.vivax • Major problem on cattle production in sub-Saharan Africa • Symptoms include: • severe anaemia - weight loss • foetal abortion • cachexia and associated intermittent fever • Oedema - general loss of condition • Some breeds of cattle are tolerate mild and moderate infections

Trypanosomiasis • Quantitative Trait Loci data available for cattle and mouse • Issues – to identify the genetic difference responsible for resistance and breed them into productive cattle. • Only need to be right (not for the right reasons) • Model system in mice

Mouse Model • A/J or Balb/C strains are susceptible • C57BL/6 (B6) are resistant • QTL regions defined (Iraqi, Kemp, Gibson) • Tir1 (17.4-18.3cM Chr 17), Tir2 and Tir3 • Which genes are responsible for resistance?

Tir1 region • Contains > 130 genes including TNF and MHC region • Markers not mapped • Can microarray help? • Issues: What tissue, what time?

Data used • Samples were taken from Liver, Spleen and Kidney at time points 0, 3, 7, 9 and 17 days post infection for all three strains of mouse. In total 225 oligonucleotide arrays were used to capture cellular responses to infection, with 5 biological replicates per condition, and samples from 5 mice were used to create each biological replicate LOTS OF DATA RIGOUROUS STATISTICAL ANALYSIS

Phenotype Pathway A CHR literature Pathway linked to phenotype – high priority QTL Gene A Pathway B Gene B literature Pathway not linked to phenotype – medium priority Gene C Pathway C Genotype literature Pathway not linked to QTL – low priority

Key: A – Retrieve genes in QTL region B – Annotate genes with external database Ids C – Cross-reference Ids with KEGG gene ids D – Retrieve microarray data from MaxD database E – For each KEGG gene get the pathways it’s involved in F – For each pathway get a description of what it does G – For each KEGG gene get a description of what it does

Workflow Breakdown • Stage 1: Microarray Analyses using MADAT and R • Stage 2: Finding genes in the QTL • Stage 3: Finding pathways • Stage 4: Ranking gene lists by SNPs in susceptible vs resistant strains

Finding Genes in the QTL • Find where probesets are in genomic sequence • Get list of genes in that region – by searching mmusculus_Ensembl and then both Uniprot and Entrez gene • Get the corresponding ids from KEGG • Concatenate gene lists and remove duplicates

An Introduction to Taverna

An Introduction to Taverna

Presentation Transcript

Taverna: From Biology to Astronomy

An Introduction to Taverna Workflows

An Introduction to Running, Reusing and Sharing Workflows with Taverna – part 2

An introduction to Taverna workflows Franck Tanoh

AN INTRODUCTION TO:

An Introduction to:

AN INTRODUCTION TO:

AN INTRODUCTION TO:

Taverna Workflow

An Introduction to Designing and Executing Workflows with Taverna

AN INTRODUCTION TO:

Taverna Roadmap

An Introduction to Taverna Workflows

Taverna

Taverna

An Introduction to Designing, Executing and Sharing Workflows with Taverna

An Introduction to Designing and Executing Workflows with Taverna

An Introduction to

AN INTRODUCTION TO:

Introduction to Workflows with Taverna

An Introduction to Designing, Executing and Sharing Workflows with Taverna