Informatics challenges and computer tools for sequencing 1000s of human genomes
This presentation is the property of its rightful owner.
Sponsored Links
1 / 41

Informatics challenges and computer tools for sequencing 1000s of human genomes PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on
  • Presentation posted in: General

Informatics challenges and computer tools for sequencing 1000s of human genomes. Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory Personal Genomes meeting October 9-12, 2008. Large-scale individual human resequencing. Next-gen sequencers offer vast throughput….

Download Presentation

Informatics challenges and computer tools for sequencing 1000s of human genomes

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Informatics challenges and computer tools for sequencing 1000s of human genomes

Informatics challenges and computer tools for sequencing 1000s of human genomes

Gabor T. Marth

Boston College Biology Department

Cold Spring Harbor Laboratory

Personal Genomes meeting

October 9-12, 2008


Large scale individual human resequencing

Large-scale individual human resequencing


Next gen sequencers offer vast throughput

Next-gen sequencers offer vast throughput…

Illumina, AB/SOLiD short-read sequencers

10 Gb

(5-15Gb in 25-70 bp reads)

1 Gb

454 pyrosequencer

(100-400 Mb in 200-450 bp reads)

bases per machine run

100 Mb

10 Mb

ABI capillary sequencer

1 Mb

10 bp

100 bp

1,000 bp

read length


The resequencing informatics pipeline

IND

(ii) read mapping

(iii) read assembly

(v) SV calling

(iv) SNP and short INDEL calling

IND

(i) base calling

(vi) data validation, hypothesis generation

The resequencing informatics pipeline

REF


The variation discovery toolbox

The variation discovery “toolbox”

  • base callers

  • read mappers

  • SNP callers

  • SV callers

  • assembly viewers


1 base calling

1. Base calling

base sequence

base quality (Q-value) sequence

  • early manufacturer-supplied base callers were imperfect

  • third party software made substantial improvements

  • machine manufacturers are now focusing more on base calling


2 read mapping

… and they give you the picture on the box

2. Read mapping

Read mapping is like doing a jigsaw puzzle…

…you get the pieces…

Larger, more unique pieces are easier to place than others…


Next gen reads are generally short

Next-gen reads are generally short

20-60 (variable)

25-50 (fixed)

25-70 (fixed)

~200-450 (variable)

400

100

200

300

0

read length [bp]


Base error rates are low

Base error rates are low

Illumina

454


Strategies to deal with non unique mapping

Strategies to deal with non-unique mapping


Mapping probabilities qualities

0.8

0.19

0.01

read

Mapping probabilities (qualities)


Error types are very different

Error types are very different

Illumina

454


Gapped alignments

Gapped alignments


Mosaik

MOSAIK

  • fast

  • accurate

  • gapped

  • versatile (short + long reads)


3 snp and short indel calling

3. SNP and short-INDEL calling

  • deep alignments of 100s / 1000s of individuals

  • trio sequences


Allele discovery is a multi step sampling process

Allele discovery is a multi-step sampling process

Samples

Reads

Population


Capturing the allele in the sample

Capturing the allele in the sample


Allele calling in the reads

Allele calling in the reads

number of individuals

allele call in read

base quality


How many reads needed to call an allele

How many reads needed to call an allele?

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaCgtacctac

aatgtagtaCgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac


The need for accurate data

The need for accurate data…


And realistic base quality values

… and realistic base quality values


Recalibrated base quality values illumina

Recalibrated base quality values (Illumina)


More samples or deeper coverage sample

More samples or deeper coverage / sample?

…or deeper coverage from fewer samples?

Shallower read coverage from more individuals …

simulation analysis by Aaron Quinlan


Analysis indicates a balance

Analysis indicates a balance


Snp calling in trios

SNP calling in trios

  • the child inherits one chromosome from each parent

  • there is a small probability for a mutation in the child


Snp calling in trios1

P=0.86

SNP calling in trios

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaCgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

father

mother

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaCgtacctac

P=0.79

child


4 structural variation discovery

4. Structural variation discovery

DNA

reference

pattern

LM

LF

LM ~ LF+Ldel & depth: low

Deletion

Ldel

Tandemduplication

LM ~ LF-Ldup & depth: high

Ldup

LM ~ LF+LT1LM~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2

LM

LM

Translocation

LT2

LT1

LM

LM ~ +Linv & ends flipped LM ~ -Linv depth: normal

Inversion

Linv

un-paired read clusters & depth normal

Insertion

Lins

LM ~LF+LT & depth: normal& cross-paired read clusters

Chromosomaltranslocation

LT

Read pair mapping pattern (breakpoint detection)


Copy number estimation

Copy number estimation

Depth of read coverage


Deletion aberrant positive mapping distance

Deletion: Aberrant positive mapping distance


Tandem duplication negative mapping distance

Tandem duplication: negative mapping distance


Het deletion revealed by normalization

Het deletion “revealed” by normalization

Chip Stewart

Saturday poster session


5 data visualization

5. Data visualization

  • software development

  • data validation

  • hypothesis generation


Summary

Summary

  • Next-generation sequencing is a boon for large-scale individual human resequencing

  • Basic data mining tools are getting applied and tested in the 1000 Genomes Project

  • There is still a lot of fine-tuning to do

  • A different set of tools are needed for comparative analysis and effective visualization of 100s/1000s of genomes


Credits

Credits

Michael Stromberg

Chip Stewart

Aaron Quinlan

Michele Busby

Damien

Croteau-Chonka

Eric Tsung

Derek Barnett

Weichun Huang

Several postdoc positions are available…

… mail [email protected]


Software tools for next gen data

Software tools for next-gen data

http://bioinformatics.bc.edu/marthlab/Beta_Release


Positions

Positions

Several postdoc positions are available… mail [email protected]


Individual genotype directly from sequence

A/C

C/C

A/A

Individual genotype directly from sequence

AACGTTAGCATA

AACGTTAGCATA

AACGTTCGCATA

AACGTTCGCATA

individual 1

AACGTTCGCATA

AACGTTCGCATA

AACGTTCGCATA

AACGTTCGCATA

individual 2

AACGTTAGCATA

AACGTTAGCATA

individual 3


Genotyping from primary sequence data

Genotyping from primary sequence data


Most reads contain no or few errors

Most reads contain no or few errors


Paired end reads help unique read placement

Paired-end reads help unique read placement

PE

  • fragment amplification: fragment length 100 - 600 bp

  • fragment length limited by amplification efficiency

MP

  • circularization: 500bp - 10kb (sweet spot ~3kb)

  • fragment length limited by library complexity

Korbel et al. Science 2007


How many reads needed to call an allele1

How many reads needed to call an allele?

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaCgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaCgtacctac

aatgtagtaCgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaCgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

P=0.08

P=0.82

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac


  • Login