Informatics challenges and computer tools for sequencing 1000s of human genomes
Download
1 / 41

Informatics challenges and computer tools for sequencing 1000s of human genomes - PowerPoint PPT Presentation


  • 118 Views
  • Uploaded on

Informatics challenges and computer tools for sequencing 1000s of human genomes. Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory Personal Genomes meeting October 9-12, 2008. Large-scale individual human resequencing. Next-gen sequencers offer vast throughput….

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Informatics challenges and computer tools for sequencing 1000s of human genomes' - ranit


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Informatics challenges and computer tools for sequencing 1000s of human genomes

Informatics challenges and computer tools for sequencing 1000s of human genomes

Gabor T. Marth

Boston College Biology Department

Cold Spring Harbor Laboratory

Personal Genomes meeting

October 9-12, 2008



Next gen sequencers offer vast throughput
Next-gen sequencers offer vast throughput… 1000s of human genomes

Illumina, AB/SOLiD short-read sequencers

10 Gb

(5-15Gb in 25-70 bp reads)

1 Gb

454 pyrosequencer

(100-400 Mb in 200-450 bp reads)

bases per machine run

100 Mb

10 Mb

ABI capillary sequencer

1 Mb

10 bp

100 bp

1,000 bp

read length


The resequencing informatics pipeline

IND 1000s of human genomes

(ii) read mapping

(iii) read assembly

(v) SV calling

(iv) SNP and short INDEL calling

IND

(i) base calling

(vi) data validation, hypothesis generation

The resequencing informatics pipeline

REF


The variation discovery toolbox
The variation discovery “toolbox” 1000s of human genomes

  • base callers

  • read mappers

  • SNP callers

  • SV callers

  • assembly viewers


1 base calling
1. Base calling 1000s of human genomes

base sequence

base quality (Q-value) sequence

  • early manufacturer-supplied base callers were imperfect

  • third party software made substantial improvements

  • machine manufacturers are now focusing more on base calling


2 read mapping

… and they give you the picture on the box 1000s of human genomes

2. Read mapping

Read mapping is like doing a jigsaw puzzle…

…you get the pieces…

Larger, more unique pieces are easier to place than others…


Next gen reads are generally short
Next-gen reads are generally short 1000s of human genomes

20-60 (variable)

25-50 (fixed)

25-70 (fixed)

~200-450 (variable)

400

100

200

300

0

read length [bp]


Base error rates are low
Base error rates are low 1000s of human genomes

Illumina

454



Mapping probabilities qualities

0.8 1000s of human genomes

0.19

0.01

read

Mapping probabilities (qualities)


Error types are very different
Error types are very different 1000s of human genomes

Illumina

454


Gapped alignments
Gapped alignments 1000s of human genomes


Mosaik
MOSAIK 1000s of human genomes

  • fast

  • accurate

  • gapped

  • versatile (short + long reads)


3 snp and short indel calling
3. SNP and short-INDEL calling 1000s of human genomes

  • deep alignments of 100s / 1000s of individuals

  • trio sequences


Allele discovery is a multi step sampling process
Allele discovery is a multi-step sampling process 1000s of human genomes

Samples

Reads

Population


Capturing the allele in the sample
Capturing the allele in the sample 1000s of human genomes


Allele calling in the reads
Allele calling in the reads 1000s of human genomes

number of individuals

allele call in read

base quality


How many reads needed to call an allele
How many reads needed to call an allele? 1000s of human genomes

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaCgtacctac

aatgtagtaCgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac


The need for accurate data
The need for accurate data… 1000s of human genomes




More samples or deeper coverage sample
More samples or deeper coverage / sample? 1000s of human genomes

…or deeper coverage from fewer samples?

Shallower read coverage from more individuals …

simulation analysis by Aaron Quinlan


Analysis indicates a balance
Analysis indicates a balance 1000s of human genomes


Snp calling in trios
SNP calling in trios 1000s of human genomes

  • the child inherits one chromosome from each parent

  • there is a small probability for a mutation in the child


Snp calling in trios1

P=0.86 1000s of human genomes

SNP calling in trios

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaCgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

father

mother

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaCgtacctac

P=0.79

child


4 structural variation discovery
4. Structural variation discovery 1000s of human genomes

DNA

reference

pattern

LM

LF

LM ~ LF+Ldel & depth: low

Deletion

Ldel

Tandemduplication

LM ~ LF-Ldup & depth: high

Ldup

LM ~ LF+LT1LM~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2

LM

LM

Translocation

LT2

LT1

LM

LM ~ +Linv & ends flipped LM ~ -Linv depth: normal

Inversion

Linv

un-paired read clusters & depth normal

Insertion

Lins

LM ~LF+LT & depth: normal& cross-paired read clusters

Chromosomaltranslocation

LT

Read pair mapping pattern (breakpoint detection)


Copy number estimation
Copy number estimation 1000s of human genomes

Depth of read coverage




Het deletion revealed by normalization
Het deletion “revealed” by normalization 1000s of human genomes

Chip Stewart

Saturday poster session


5 data visualization
5. Data visualization 1000s of human genomes

  • software development

  • data validation

  • hypothesis generation


Summary
Summary 1000s of human genomes

  • Next-generation sequencing is a boon for large-scale individual human resequencing

  • Basic data mining tools are getting applied and tested in the 1000 Genomes Project

  • There is still a lot of fine-tuning to do

  • A different set of tools are needed for comparative analysis and effective visualization of 100s/1000s of genomes


Credits
Credits 1000s of human genomes

Michael Stromberg

Chip Stewart

Aaron Quinlan

Michele Busby

Damien

Croteau-Chonka

Eric Tsung

Derek Barnett

Weichun Huang

Several postdoc positions are available…

… mail [email protected]


Software tools for next gen data
Software tools for next-gen data 1000s of human genomes

http://bioinformatics.bc.edu/marthlab/Beta_Release


Positions
Positions 1000s of human genomes

Several postdoc positions are available… mail [email protected]


Individual genotype directly from sequence

A/ 1000s of human genomesC

C/C

A/A

Individual genotype directly from sequence

AACGTTAGCATA

AACGTTAGCATA

AACGTTCGCATA

AACGTTCGCATA

individual 1

AACGTTCGCATA

AACGTTCGCATA

AACGTTCGCATA

AACGTTCGCATA

individual 2

AACGTTAGCATA

AACGTTAGCATA

individual 3



Most reads contain no or few errors
Most reads contain no or few errors 1000s of human genomes


Paired end reads help unique read placement
Paired-end reads help unique read placement 1000s of human genomes

PE

  • fragment amplification: fragment length 100 - 600 bp

  • fragment length limited by amplification efficiency

MP

  • circularization: 500bp - 10kb (sweet spot ~3kb)

  • fragment length limited by library complexity

Korbel et al. Science 2007


How many reads needed to call an allele1
How many reads needed to call an allele? 1000s of human genomes

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaCgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaCgtacctac

aatgtagtaCgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaCgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

P=0.08

P=0.82

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac


ad