Bioinformatics workflow integration

Bioinformatics workflow integration Yike Guo/Jiancheng LinInforSense Ltd.23 September 2014

Life Science Challenges • Information resides on different: • Granularity levels (individual records vs. massive repositories) • Abstraction levels (models ranging from entire systems to compound patterns) • Domain levels (clinical, sequence, instrument…) • Researchers • Grouped in Virtual Organizations (VOs) • Working on the Grid • Need to communicate across physical and scientific/cultural barriers • Tools • Legacy, well-established in the process • Novel, essential to innovation • In need of a consistent infrastructure to connect the two groups

Discovery Informatics in Post-Genome Era secondary structure tertiary structure polymorphism patient records epidemiology expression patterns physiology sequences alignments receptors signals pathways ATGCAAGTCCCT AAGATTGCATAA GCTCGCTCAGTT linkage maps cytogenetic maps physical maps

Oracle DM Portal Integrative Analytics Workflow Environment Workflow Warehouse Informatician Deployed Web App for End Users Data Analysis Group Integrative Analytics Workflow Environment Matlab Files Web Services Oracle Data Preprocess 3rd Party & Custom Apps S-Plus Data Applications Components R Inbuilt Analytics SAS BioTeam iNquiry MDL WEKA Spotfire KXEN Daylight DB Healthcare

InforSense Workflow Life Cycle • Constructing a ubiquitous workflow : by scientists • Integrate your information resources/software applications cross-domain • Support innovation and capture the best practice of your scientific research • Warehousing workflows: for scientists • Manage discovery processes in your organisation • Construct an enterprise process knowledge bank • Deployment workflow: to scientists • Turn your workflows into reusable applications • Turn every scientist into a solution builder

1 Select: Workflow Creation, Integration, and Deployment Data Sources Data Mining / Statistics Data Processing / Transformation 3rd Party applications (e.g.Haploview) Interactive data visualization / reporting 2 Connect: Connect data and components in GUI Workflow describes complex data processing and analysis Execute: 3 “In database” processing & analytics “Cluster / Grid” execution 4 Deploy: Define parameters of workflow to expose Publish as: portlet, web application, SOAP service, command line app

Biology to Chemistry • Novel sequences are compared to known protein structures • The resulting set of ligands on these matching structures is used to search small molecule databases for similar compounds • Compounds are then analyzed using KDE tools such as PCA and clustering to provide a diverse, representative subset for further assays

Navigating KEGG pathways • Gene names from EMBL are used to query KEGG via their Webservice API for appropriate pathways • Further Webservice API calls allow navigation of the data to find: • Pathway compounds • Other genes in the pathways • Visualization of query genes on their pathways

cDNA sequence annotation and alignment • A novel cDNA is annotated using EMBOSS tools, and a BLAST similarity search perfomed against human proteins • Annotations used to aid identification of predicted proteins derived from the cDNA

Ortholog analysis using BLAST • Sequence libraries from 2 organisms are cross-compared using BLAST to determine the best bi-directional matches of sufficient quality

Clustering of Affymetrix data with R • Native Affymetrix CEL files are loaded using R/Bioconductor • Differentially expressed genes calculated using KDE statistical nodes • The resulting list of genes is then clustered using HCLUST in R

Microarray analysis using text mining • Microarray data normalized in KDE • Upregulated genes annotated from Pubmed to obtain a set of related scientific papers • Text mining used to mine the paper collection and extract information most relevant to the researcher

BAIR project Biological Atlas of Insulin Resistance Normal Diet 6 to 10 animals Endpoint Culling or death Time  Fat Fed • Genetic data • Mouse ID • Cage ID • Environmental conditions • Management records • Physiological Data after change In Diet. • One time point in end-point experiment • Several time points in longitudinal study • Weight • Blood analysis • Physiological parameters • Metabonomics • Urine analysis • Physiological parameter • Metabonomics • Tissue sampling • Liver,Fat, Muscle, Kidney • Metabonomics • Proteomics (general, glyco-, phospho-proteomics) • Transcriptomics • Culling conditions • Physiological • Data prior change • In Diet • Weight • Blood analysis • Urine analysis • Data Formats • Affymetrix • XLS files • Chromatograms • Filemaker Pro • Metabonomics • NMR spectra • Raw Data • Normalised Data • Processed Data • Sampling conditions • Sample Storage conditions • Ref of Biological assays used across the study Similar data will be recorded regarding experiments performed with cells lines cDNA arrays ATF, GAL files

Collaborative Visualisation

Literature mining and compound analysis

Grid Computing

BAIR Portal

Integrative support • Information: • Data models to support individual domains (sequences, NMR profiles…) and methods to map them into generic analysis (tables, text) • Annotation databases integrated through Web Service APIs • Researchers • Sharing of work and knowledge through reusable workflow components • Aim for minimum technical overhead when linking new resources • Tools • Focus on integration methods rather than one-off tool linkage • Researchers able to link to standard tools without the need for an IT specialist • Databases accessed through aggregators (SRS, BioMart…)

Bioinformatics workflow integration