informatics challenges and computer tools for sequencing 1000s of human genomes
Download
Skip this Video
Download Presentation
Informatics challenges and computer tools for sequencing 1000s of human genomes

Loading in 2 Seconds...

play fullscreen
1 / 41

Informatics challenges and computer tools for sequencing 1000s of human genomes - PowerPoint PPT Presentation


  • 127 Views
  • Uploaded on

Informatics challenges and computer tools for sequencing 1000s of human genomes. Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory Personal Genomes meeting October 9-12, 2008. Large-scale individual human resequencing. Next-gen sequencers offer vast throughput….

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Informatics challenges and computer tools for sequencing 1000s of human genomes' - ranit


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
informatics challenges and computer tools for sequencing 1000s of human genomes

Informatics challenges and computer tools for sequencing 1000s of human genomes

Gabor T. Marth

Boston College Biology Department

Cold Spring Harbor Laboratory

Personal Genomes meeting

October 9-12, 2008

next gen sequencers offer vast throughput
Next-gen sequencers offer vast throughput…

Illumina, AB/SOLiD short-read sequencers

10 Gb

(5-15Gb in 25-70 bp reads)

1 Gb

454 pyrosequencer

(100-400 Mb in 200-450 bp reads)

bases per machine run

100 Mb

10 Mb

ABI capillary sequencer

1 Mb

10 bp

100 bp

1,000 bp

read length

the resequencing informatics pipeline

IND

(ii) read mapping

(iii) read assembly

(v) SV calling

(iv) SNP and short INDEL calling

IND

(i) base calling

(vi) data validation, hypothesis generation

The resequencing informatics pipeline

REF

the variation discovery toolbox
The variation discovery “toolbox”
  • base callers
  • read mappers
  • SNP callers
  • SV callers
  • assembly viewers
1 base calling
1. Base calling

base sequence

base quality (Q-value) sequence

  • early manufacturer-supplied base callers were imperfect
  • third party software made substantial improvements
  • machine manufacturers are now focusing more on base calling
2 read mapping

… and they give you the picture on the box

2. Read mapping

Read mapping is like doing a jigsaw puzzle…

…you get the pieces…

Larger, more unique pieces are easier to place than others…

next gen reads are generally short
Next-gen reads are generally short

20-60 (variable)

25-50 (fixed)

25-70 (fixed)

~200-450 (variable)

400

100

200

300

0

read length [bp]

mosaik
MOSAIK
  • fast
  • accurate
  • gapped
  • versatile (short + long reads)
3 snp and short indel calling
3. SNP and short-INDEL calling
  • deep alignments of 100s / 1000s of individuals
  • trio sequences
allele calling in the reads
Allele calling in the reads

number of individuals

allele call in read

base quality

how many reads needed to call an allele
How many reads needed to call an allele?

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaCgtacctac

aatgtagtaCgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

more samples or deeper coverage sample
More samples or deeper coverage / sample?

…or deeper coverage from fewer samples?

Shallower read coverage from more individuals …

simulation analysis by Aaron Quinlan

snp calling in trios
SNP calling in trios
  • the child inherits one chromosome from each parent
  • there is a small probability for a mutation in the child
snp calling in trios1

P=0.86

SNP calling in trios

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaCgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

father

mother

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaCgtacctac

P=0.79

child

4 structural variation discovery
4. Structural variation discovery

DNA

reference

pattern

LM

LF

LM ~ LF+Ldel & depth: low

Deletion

Ldel

Tandemduplication

LM ~ LF-Ldup & depth: high

Ldup

LM ~ LF+LT1LM~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2

LM

LM

Translocation

LT2

LT1

LM

LM ~ +Linv & ends flipped LM ~ -Linv depth: normal

Inversion

Linv

un-paired read clusters & depth normal

Insertion

Lins

LM ~LF+LT & depth: normal& cross-paired read clusters

Chromosomaltranslocation

LT

Read pair mapping pattern (breakpoint detection)

copy number estimation
Copy number estimation

Depth of read coverage

het deletion revealed by normalization
Het deletion “revealed” by normalization

Chip Stewart

Saturday poster session

5 data visualization
5. Data visualization
  • software development
  • data validation
  • hypothesis generation
summary
Summary
  • Next-generation sequencing is a boon for large-scale individual human resequencing
  • Basic data mining tools are getting applied and tested in the 1000 Genomes Project
  • There is still a lot of fine-tuning to do
  • A different set of tools are needed for comparative analysis and effective visualization of 100s/1000s of genomes
credits
Credits

Michael Stromberg

Chip Stewart

Aaron Quinlan

Michele Busby

Damien

Croteau-Chonka

Eric Tsung

Derek Barnett

Weichun Huang

Several postdoc positions are available…

… mail [email protected]

software tools for next gen data
Software tools for next-gen data

http://bioinformatics.bc.edu/marthlab/Beta_Release

positions
Positions

Several postdoc positions are available… mail [email protected]

individual genotype directly from sequence

A/C

C/C

A/A

Individual genotype directly from sequence

AACGTTAGCATA

AACGTTAGCATA

AACGTTCGCATA

AACGTTCGCATA

individual 1

AACGTTCGCATA

AACGTTCGCATA

AACGTTCGCATA

AACGTTCGCATA

individual 2

AACGTTAGCATA

AACGTTAGCATA

individual 3

paired end reads help unique read placement
Paired-end reads help unique read placement

PE

  • fragment amplification: fragment length 100 - 600 bp
  • fragment length limited by amplification efficiency

MP

  • circularization: 500bp - 10kb (sweet spot ~3kb)
  • fragment length limited by library complexity

Korbel et al. Science 2007

how many reads needed to call an allele1
How many reads needed to call an allele?

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaCgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaCgtacctac

aatgtagtaCgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaCgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

P=0.08

P=0.82

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

aatgtagtaAgtacctac

ad