cs 6293 advanced topics current bioinformatics l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CS 6293 Advanced Topics: Current Bioinformatics PowerPoint Presentation
Download Presentation
CS 6293 Advanced Topics: Current Bioinformatics

Loading in 2 Seconds...

play fullscreen
1 / 34

CS 6293 Advanced Topics: Current Bioinformatics - PowerPoint PPT Presentation


  • 187 Views
  • Uploaded on

CS 6293 Advanced Topics: Current Bioinformatics. Genome Assembly: a brief introduction. Slides Adapted from Mihai Pop, Art Delcher, and Steven Salzberg. Homework #2. #1: questions will be posted online before Monday class #2: Form groups of 3 Each group reads two papers on a topic:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CS 6293 Advanced Topics: Current Bioinformatics' - fiona


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cs 6293 advanced topics current bioinformatics

CS 6293 Advanced Topics: Current Bioinformatics

Genome Assembly: a brief introduction

Slides Adapted from Mihai Pop, Art Delcher, and Steven Salzberg

homework 2
Homework #2
  • #1: questions will be posted online before Monday class
  • #2: Form groups of 3
    • Each group reads two papers on a topic:

Short reads alignment or assembly

    • Present the papers and do some comparison
    • ~8 minutes presentation
      • You can choose to go to some really cool details
      • Or give the main idea of the paper
    • Other teams (and me) will judge you
    • Send me names in your group and optionally papers you want to present
    • List of papers:

http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html

genome sequencing

AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

~500 nucleotides

Genome sequencing

3x109 nucleotides

genome sequencing4

AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

Genome sequencing

3x109 nucleotides

A big puzzle

~60 million pieces

Computational Fragment Assembly

Introduced ~1980

1995: assemble up to 1,000,000 long DNA pieces

2000: assemble whole human genome

shotgun dna sequencing technology
Shotgun DNA Sequencing (Technology)

DNA target sample

SHEAR

SIZE SELECT

e.g., 10Kbp

± 8% std.dev.

End Reads (Mates)

550bp

LIGATE & CLONE

Primer

SEQUENCE

Vector

whole genome shotgun sequencing
Whole Genome Shotgun Sequencing

+ single highly automated process

+ only three library constructions

– assembly is much more difficult

  • Collect 10x sequence in a 1-to-1 ratio of two types of read pairs:

~ 35million reads for Human.

Short

Long

10Kbp

2Kbp

  • Collect another 20X in clone coverage of 50Kbp end sequence pairs:

~ 1.2million pairs for Human.

  • Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously.

BAC 3’

BAC 5’

slide8

Celera’s Sequencing Factory(circa 2001)

  • 300 ABI 3700 DNA Sequencers
  • 50 Production Staff
  • 20,000 sq. ft. of wet lab
  • 20,000 sq. ft. of sequencing space
  • 800 tons of A/C (160,000 cfm)
  • $1 million / year for electrical service
  • $10 million / month for reagents
slide9
Collected 27.27 Million reads = 5.11X coverage

21.04 Million are paired (77%) = 10.52 Million pairs

2Kbp 5.045M 98.6% true * <6% std.dev.

10Kbp 4.401M 98.6% true * <8% std.dev.

50Kbp 1.071M 90.0% true * <15% std.dev.

* validated against finished Chrom. 21 sequence

The clones cover the genome 38.7X times

Data is from 5 individuals (roughly 3X, 4 others at .5X)

Human Data (April 2000)

slide10

Pairs Give Order & Orientation

Contig

Assembly without pairs results in contigs whose order and orientation are not known.

Consensus (15- 30Kbp)

Reads

?

2-pair

Pairs, especially groups of corroborating ones, link the contigs into scaffolds where the size of gaps is well characterized.

Mean & Std.Dev.

is known

Scaffold

slide11

Anatomy of a WGS Assembly

STS

Chromosome

STS-mapped Scaffolds

Contig

Gap (mean & std. dev. Known)

Read pair (mates)

Consensus

Reads (of several haplotypes)

SNPs

External “Reads”

assembly gaps
Assembly gaps

Physical gaps

Sequencing gaps

sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap

physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap

12

assembly paradigms
Assembly paradigms

Overlap-layout-consensus

greedy (TIGR Assembler, phrap, CAP3...)

graph-based (Celera Assembler, Arachne)

Eulerian path (especially useful for short read sequencing)

13

tigr assembler phrap
TIGR Assembler/phrap

Greedy

Build a rough map of fragment overlaps

Pick the largest scoring overlap

Merge the two fragments

Repeat until no more merges can be done

14

slide15

(A) Overlap between two reads—note that agreement within overlapping region need not be perfect; (B) Correct assembly of a genome with two repeats (boxes) using four reads A–D; (C) Assembly produced by the greedy approach.

Pop M Brief Bioinform 2009;10:354-366

© The Author 2009. Published by Oxford University Press. For Permissions, please email: journals.permissions@oxfordjournals.org

overlap layout consensus
Overlap-layout-consensus

Main entity: read

Relationship between reads: overlap

1

4

7

2

5

8

3

6

9

2

3

4

5

6

7

8

9

1

ACCTGA

ACCTGA

AGCTGA

ACCAGA

1

2

3

2

3

1

1

2

3

16

paths through graphs and assembly
Paths through graphs and assembly

Hamiltonian circuit: visit each node (city) exactly once, returning to the start

Hamiltonian path: visit each node (city) exactly once

Genome

overlap between two sequences
Overlap between two sequences

overlap (19 bases)

overhang (6 bases)

GGATGCGCGGACACGTAGCCAGGAC

CAGTACTTGGATGCGCTGACACGTAGC

overhang

% identity = 18/19 % = 94.7%

  • overlap - region of similarity between regions
  • overhang - un-aligned ends of the sequences
  • The assembler screens merges based on:
  • length of overlap
  • % identity in overlap region
  • maximum overhang size.

18

all pairs alignment
All pairs alignment

Needed by the assembler

Try all pairs – must consider ~ n2 pairs

Smarter solution: only n x coverage (e.g. 8) pairs are possible

Build a table of k-mers contained in sequences (single pass through the genome)

Generate the pairs from k-mer table (single pass through k-mer table)

k-mer

19

bwt based overlap detection
BWT-based overlap detection
  • Efficient construction of an assembly string graph using the FM-index, Jared T. Simpson and Richard Durbin, Bioinformatics, 26 (12): i367-i373 (2010)
  • Read it yourself for more details

ACT

ACT$......

ACT…..

ACT….. $

ACT….

ACT

BWT for multiple sequences

overlap graph
OVERLAP GRAPH

A

A

B

B

B

A

B

A

A

B

A

B

Edge Types:

Regular Dovetail

Prefix Dovetail

Suffix Dovetail

E.G.:

Edges are annotated with deltas of overlaps

the unitig reduction
The Unitig Reduction

A

C

A

B

C

B

1. Remove “Transitively Inferrable” Overlaps:

the unitig reduction23
The Unitig Reduction

A

412

352

A

B

B

45

2. Collapse “Unique Connector” Overlaps:

celera assembly pipeline
Celera Assembly Pipeline

A

B

implies

TRUE

A

B

OR

A

B

REPEAT-INDUCED

Trim & Screen

Find all overlaps  40bp allowing 6% mismatch.

Overlapper

Unitiger

Scaffolder

Repeat Rez I, II

celera assembly pipeline25
Celera Assembly Pipeline

Trim & Screen

Compute all overlap consistent sub-assemblies: Unitigs(Uniquely Assembled Contig)

Overlapper

Unitiger

Scaffolder

Repeat Rez I, II

celera assembly pipeline26
Celera Assembly Pipeline

Mated reads

Scaffold U-unitigs with confirmed pairs

Trim & Screen

Overlapper

Unitiger

Scaffolder

Repeat Rez I, II

celera assembly pipeline27
Celera Assembly Pipeline

Trim & Screen

Fill repeat gaps with doubly anchored positive unitigs

Overlapper

Unitig>0

Unitiger

Scaffolder

Repeat Rez I, II

handling repeats
Handling repeats

Repeat detection

pre-assembly: find fragments that belong to repeats

statistically (most existing assemblers)

repeat database (RepeatMasker)

during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001)

post-assembly: find repetitive regions and potential mis-assemblies.

Reputer, RepeatMasker

"unhappy" mate-pairs (too close, too far, mis-oriented)

Repeat resolution

find DNA fragments belonging to the repeat

determine correct tiling across the repeat

28

statistical repeat detection
Statistical repeat detection

Significant deviations from average coverage flagged as repeats.

- frequent k-mers are ignored

- “arrival” rate of reads in contigs compared with theoretical value

Problem 1: assumption of uniform distribution of fragments - leads to false positives

non-random libraries

poor clonability regions

Problem 2: repeats with low copy number are missed - leads to false negatives

29

mis assembled repeats
Mis-assembled repeats

excision

collapsed tandem

rearrangement

30

eulerian path based assembly
Eulerian path-based assembly
  • Break each read into k-mers (typically k >= 19)
  • Construct a de Bruijn graph using the k-mers from all reads
    • Each k-mer is a node
    • v1 has a directed edge to v2 if v1 can be expressed by removing the last char from v2 and adding a new char at the beginning of v2, E.g.

v1 = acgtctgact

v2 = cgtctgactg

  • Find a Eulerian path in the graph
    • visits each edge exactly once
slide32

4. Error removal

3. Simplification

1. Sequencing

2. Constructing a de Bruijn graph

eulerian path based assembly33
Eulerian path-based assembly
  • No need to compute pairwise overlaps – important for NGS data
  • Eulerian paths are much easier to find than Hamiltonian path
    • Catch: multiple Eulerian paths may exist
    • Loss of information
  • Repeats appear as cycles in the graph
    • Less likely to cause mis-assembly
  • More suitable for short-reads assembly
    • Newbler
    • VELVET
    • EDENA
    • ABySS
    • See Flicek & Birney, Nat Methods, 2009
references
References
  • Sense from sequence reads: methods for alignment and assembly, Paul Flicek & Ewan Birney, Nature Methods 6, S6 - S12 (2009)
  • Genome assembly reborn: recent computational challenges, Mihai Pop, Briefings in Bioinformatics, 10(4): 354-366 (2009)