Plant genomics bioinformatics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

Plant Genomics & Bioinformatics PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on
  • Presentation posted in: General

Plant Genomics & Bioinformatics. Jen Taylor : Bioinformatics Leader CSIRO Plant Industry EMBL Australia April 2011. Overview. Introductions Definitions - Bioinformatics Our scope Plant Genomics and Bioinformatics Research aims “Unique” challenges

Download Presentation

Plant Genomics & Bioinformatics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Plant genomics bioinformatics

Plant Genomics & Bioinformatics

Jen Taylor : Bioinformatics Leader

CSIRO Plant Industry

EMBL Australia April 2011


Overview

Overview

  • Introductions

    • Definitions - Bioinformatics

    • Our scope

  • Plant Genomics and Bioinformatics

    • Research aims

    • “Unique” challenges

    • Case Study 1 : Solving QTLs with data integration

  • Our challenges and activities

    • Building the biology informatics interface

      • Peopleware

      • Infrastructure – software and systems

    • Ideas for CSIRO-EMBL interaction

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Plant genomes research aims

Plant Genomes – Research Aims

  • Molecular drivers of phenotype

    • Performance traits

      • Food security

    • Stress resistance

      • Environment

      • Disease

  • Molecular profiles of diversity

    • Genotypic diversity

    • Robustness to

      • Climate change

      • Human impact

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Crop bioinformatics

Crop Bioinformatics

  • Large, complex, repeat-rich genomes

  • Lack of a reference genome – denovo

  • Ploidy

    • Genome duplication, Gene duplication – large gene families

  • Pan genome / kingdom

    • Large evolutionary distances

    • Plants, microbial, viral, fungal

CSIRO. EMBL Australia - April 2011 - Jen Taylor


And yet large gains are being made

…and yet, large gains are being made

Rice

Wheat

Feuillet et al., Trends in Plant Science February 2011, Vol. 16, No. 2

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Plant genomes unique challenges

Plant Genomes – “Unique Challenges”

  • No / partially sequenced reference genome

  • Ploidy

    • Genome duplication, Gene duplication – large gene families

  • Genome Size

    • Large range of genome sizes

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Plant genomes unique challenges1

Plant Genomes – “Unique Challenges”

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Plant genomes haploid size

Plant Genomes – Haploid Size

Human

Arabidopsis

Rice

Potato

Sugarcane

Cotton

Barley

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Plant genomes total size

Wheat

Plant Genomes – Total Size

Human Cotton Barley Sugarcane

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Plant genomes unique challenges2

Plant Genomes – “Unique Challenges”

  • Pan genome / kingdom

    • Large evolutionary distances

    • Plants, microbial, viral, fungal, insect

75 MYA

155 MYA

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Plant bioinformatics unique challenges

1. Data deluge

Rapid capacity increases

Large, complex genomes

High-throughput potential

2. Heterogeneous and asynchronous data release

Many large public sequence data sets and annotations emerging…..

PlantEnsembl, 1001 arabidopsis genomes

Genome efforts : Barley, Cotton, Wheat, Sugarcane, Lupin, Eucalyptus, Compositae, Brachypodium

3. Deep and universal customisation

Lack of universally applicable analysis strategies

Need flexible modular structures

4. Diverse analytical needs - Different architectures

Parallel processing – many processors ( 100’s – 1000’s)

Moderate to high RAM needs (16 - 250 – 1 TB RAM) i.e. “fat node”

Plant Bioinformatics – “Unique Challenges”

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Crop bioinformatics @ csiro

Crop Bioinformatics @ CSIRO

  • Rice / Cotton

    • Transcriptome, SNP detection

    • Methylome - MeDIP

    • Small RNA profiling

  • Sugarcane

    • SNP detection

  • Barley

    • Transcriptome

    • Small RNA and PARE projects

  • Wheat

    • Genome sequencing

    • Transcriptome

    • Pathogenomics

High-throughput platforms

RNA-Sequencing

Small RNA sequencing

PARE

RNA- IP

ChIP-Seq

Soil Metagenomics

Illumina, SOLiD, 454

Genome Sequencing

Bisulfite Sequencing

MeDIP Sequencing

Expression arrays

Tiling arrays

Custom Arrays

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Case study solving qtls

Case Study – Solving QTLs

  • Quantitative Trait Loci (QTLs)

    • A region of the genome thought to contribute significantly to the control of a phenotypic trait

  • Key questions:

    • What are the functional elements within a QTL?

    • Which of these control the trait?

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Case study solving qtls1

Case Study – Solving QTLs

  • Aim : Deeply annotate the region to find the causative gene

  • 1. Compare with public sequence collections from related species

  • Hieracium – 1,080,343 expressed sequences, 25,743 proteins

  • Wheat – 4 sequenced genomes and 2 incomplete genomes

  • 2. Integrate computational predictions and evidence

  • > 20 different annotation algorithms

  • > 12 different sequence comparisons x parameter optimisations

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Current activities wheat cr genome

Current activities – Wheat CR genome

PE1

PE2

Wheat : Colin Cavanagh, Matt Hayden, Darren Cullerne

  • Complexity reduced genome in Yitpi, Baxter, Westonia and Chara.

Random Shear

Pst1

475 ± 50 bp

Baird et al., (2008)

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Current activities wheat cr genome1

Current activities – Wheat CR Genome

PE1

PE2

Random Shear

Pst1

  • Build consensus stacks

  • Map reads to consensus stacks

  • Map PE reads between stacks

    • Exclude “foreign” reads from stacks

Baird et al., (2008)

CSIRO. EMBL Australia - April 2011 - Jen Taylor


K mer counts and frequencies

k-mer counts and frequencies

GCGAGATCCAACGGTGAACAGCTGCCCAAAAGAAAAaCCGCCTGGAAGTCCGAGGACCTTTAGTACTGTACTCTACCCCCGAACCAGCAGCCTTCGtGCCAaGCAAGACCGCCCTTGTCCCTTTCCTTTATCCATTCCGCcTCCTTCTTTGCTTTGTTCCAATAGAGTCTAAGGCAAAGCTAAAGTGGTTCGTaTGCCTACTTTACCTACTTGACGAAAGGGAACGAACTTCGTTTCGTTTCCGGGTTTATGGATTGGATTCAGTCAGCCTCACTCCTTCCTTTTTATGTTGTCGTGATGGTTACCGGCGAACGCTCCCAAAGGCGACCCTCTCGAGTTTCCGGCTGTTTTCTAGATTGAAGTAGCCTTTCGTCGCCCCGAAAGAAGTCACTATCAAAGAGCTCGCCCTACTGAAGTACCAAAGGTGCGCTCAGCCCGGTGACTAAGAAATGGGTTTGCGCTTGAATTGAAGTGATGAGGTTTTTCGAGGGAAGTAGGGCTCTTATTGACTAAAAGTGGGTTCTTCGCTTTCCTTTAGAATGAAAGTTGCTATGAAGCCCCTACTACTTACTTTGTTTGATTCAAAAGGCGAACGGCCCCCCAACAAGTCGTATGGGGTGGGGTGCTTGTGATAAGCTGCCTTGGATATGAGGAATTCTCAAATTGGGAAAGCATTTCTTGATTTGAAGAAACAAGAAAGTTAGGGTTTTTGGAATTGGATTCGGATAATGTTTGTTGTTTTTtGTAAGTGTGAGATTAGAGGTTCACGAAATTTTGATGGG

k = 8

Total n = 782

8-mers = 775

Unique = 98%

Multiple = 10 x k-mers

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Knomes genomes as k mers

Knomes – Genomes as k-mers

1. The majority of k-mers are at low frequencies within genomes

Maize Genome

Proportion of Unique of k-mers

k-mer size - bases

Kurtz et al., 2008

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Knomes genomes as k mers1

VIOLIN

BIOTIN

DARKER

MASKER

MARKED

Knomes – Genomes as k-mers

2. k-mers are lonely

K-mer neighbourhoods can be defined by min distances and numbers of matching neighbours.

MARKER

MARKET

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Knomes transcriptomes as k mers

Knomes – transcriptomes as k-mers

  • Transcriptome k-mer profiles have been less well studied

    • Barley RNASeq 50-mer [Barrero-Sanchez]

    • The majority of k-mers are at low counts

98% of k-mers < 50 counts

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Why do we kare k mers are useful

K

Why do we kare? k-mers are useful

k-mers used in analysis :

  • Sequence alignment

    • k-mer overlaps used as seeds

    • k-mer pair-wise distances used to robustify alignments

  • Genome Assembly

    • Look for overlaps between k-mers to generate k-mer graphs

  • Repeat annotation [e.g. Campagna et al., 2005]

    • Looking for small groups of highly frequent k-mers

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Why do we kare k mers are useful1

K

Why do we kare? k-mers are useful

k-mers used in analysis :

  • Error correction in NGS [Kelley et al., 2010]

    • Removing abnormal k-mer frequencies.

  • Genome size and coverage [Marcais and Kingsford, 2011]

    • If a large fraction of k-mers occur C times, then coverage ~ C

    • Genome size can be inferred from C and total read length

  • Clustering of mixtures of sequences

    • Metagenomics

    • Haplotypes and SNPs

  • RNA sequencing

    • “annotation free” detection of differential expression

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Error correction using k mers

edit

Poisson

Gaussian

Error correction using k-mers

High coverage = trusted

Low coverage = non trusted

  • Minimizing edit distances

  • Assembly

  • Alignments

  • QUAKE – weights edit sites

  • Incorporates quality scores

  • Error properties of Illumina sequencing

Kelley et al., 2010

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Kelley error correction

Kelley error correction

Kelley et al., 2010

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Transcriptome analysis differential expression

Genomic loci

Align

Transcriptome Analysis – Differential Expression

Align Reads

Contigs

Assemble

Differential expression

SNPs

Sequence Reads

Profile

K-mer spectra

CSIRO. EMBL Australia - April 2011 - Jen Taylor


K mers in denovo differential expression

k-mers in denovo differential expression

9,458 (0.3%) Significant 50-mers

Barley : Barrero-Sanchez, Stephen, Gubler, Helliwell,

  • RNA sequencing to compare transcriptomes with different dormancy phenotypes

Next steps:

Look for 1 base differences between k-mers that might be SNPs

Map DE k-mers to public databases

Compare k-mer measures against contig assemblies

0hrs 3hrs 6hrs 0hrs 3hrs 6hrs

Genotype 35 Genotype 36

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Acknowledgements

Acknowledgements

CSIRO Plant Industry Bioinformatics

Darren Cullerne

Jose Robles

Andrew Spriggs

Stuart Stephen

Hua Ying

Paul Greenfield

David Lovell

CSIRO Transformational Biology Capability Platform

Projects

Iain Wilson (Cotton)

Jose Barrero Sanchez (Barley)

Frank Gubler (Barley)

[email protected]

CSIRO. EMBL Australia - April 2011 - Jen Taylor


Bio jen taylor

Bio - Jen Taylor

Current appointments

CSIRO Plant Industry Bioinformatics Leader

Adjunct Fellow Mathematical Sciences Institute, ANU

Brief CV

University of Queensland & QIMR

  • PhD (Biochemistry, Genetics)

    University of Oxford (Department of Statistics)

  • Functional genomics

    Wellcome Trust Centre for Human Genetics

  • Head of Functional Analysis, Bioinformatics Core

CSIRO. EMBL Australia - April 2011 - Jen Taylor


  • Login