Hierarchical Sequencing

1 / 19

# Hierarchical Sequencing - PowerPoint PPT Presentation

Hierarchical Sequencing. a BAC clone. map. Hierarchical Sequencing Strategy. Obtain a large collection of BAC clones Map them onto the genome (Physical Mapping) Select a minimum tiling path Sequence each clone in the path with shotgun Assemble Put everything together. genome.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## Hierarchical Sequencing

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Hierarchical Sequencing

a BAC clone

map

Hierarchical Sequencing Strategy
• Obtain a large collection of BAC clones
• Map them onto the genome (Physical Mapping)
• Select a minimum tiling path
• Sequence each clone in the path with shotgun
• Assemble
• Put everything together

genome

a BAC clone

map

Hierarchical Sequencing Strategy
• Obtain a large collection of BAC clones
• Map them onto the genome (Physical Mapping)
• Select a minimum tiling path
• Sequence each clone in the path with shotgun
• Assemble
• Put everything together

genome

Methods of physical mapping

Goal:

Make a map of the locations of each clone relative to one another

Use the map to select a minimal set of clones to sequence

Methods:

• Hybridization
• Digestion
1. Hybridization

Short words, the probes, attach to complementary words

• Construct many probes
• Treat each BAC with all probes
• Record which ones attach to it
• Same words attaching to BACS X, Y  overlap

p1

pn

2. Digestion

Restriction enzymes cut DNA where specific words appear

• Cut each clone separately with an enzyme
• Run fragments on a gel and measure length
• Clones Ca, Cb have fragments of length { li, lj, lk }  overlap

Double digestion:

Cut with enzyme A, enzyme B, then enzymes A + B

### Online Clone-by-cloneThe Walking Method

The Walking Method
• Build a very redundant library of BACs with sequenced clone-ends (cheap to build)
• Sequence some “seed” clones
• “Walk” from seeds using clone-ends to pick library clones that extend left & right

Some Terminology

insert a fragment that was incorporated in a

circular genome, and can be copied

(cloned)

vector the circular genome (host) that

incorporated the fragment

BACBacterial Artificial Chromosome, a type

of insert–vector combination, typically

of length 100-200 kb

read a 500-900 long word that comes out of

a sequencing machine

coveragethe average number of reads (or

inserts) that cover a position in the

target DNA piece

shotgun the process of obtaining many reads

sequencing from random locations in DNA, to

detect overlaps and assemble

cut many times at random

Whole Genome Shotgun Sequencing

genome

plasmids (2 – 10 Kbp)

known dist

cosmids (40 Kbp)

~800 bp

~800 bp

### Fragment Assembly(in whole-genome shotgun sequencing)

Fragment Assembly

Where N ~ 30 million…

We need to use a linear-time algorithm

Steps to Assemble a Genome

Some Terminology

read a 500-900 long word that comes

out of sequencer

mate pair a pair of reads from two ends

of the same insert fragment

contig a contiguous sequence formed

with no gaps

supercontig an ordered and oriented set

(scaffold) of contigs, usually by mate

pairs

consensus sequence derived from the

in a contig

2. Merge some “good” pairs of reads into longer contigs

3. Link contigs to form supercontigs

4. Derive consensus sequence

..ACGATTACAATAGGTT..

aaactgcag

aactgcagt

actgcagta

gtacggatc

tacggatct

gggcccaaa

ggcccaaac

gcccaaact

actgcagta

ctgcagtac

gtacggatc

tacggatct

acggatcta

ctactacac

tactacaca

aaactgcag

aactgcagt

acggatcta

actgcagta

actgcagta

cccaaactg

cggatctac

ctactacac

ctgcagtac

ctgcagtac

gcccaaact

ggcccaaac

gggcccaaa

gtacggatc

gtacggatc

tacggatct

tacggatct

tactacaca

aaactgcagtacggatct

aaactgcag

aactgcagt

gtacggatct

tacggatct

gggcccaaactgcagtac

gggcccaaa

ggcccaaac

actgcagta

ctgcagtac

gtacggatctactacaca

gtacggatc

tacggatct

ctactacac

tactacaca

T GA

TACA

| ||

||

TAGA

TAGT

• Find pairs of reads sharing a k-mer, k ~ 24
• Extend to full alignment – throw away if not >98% similar

TAGATTACACAGATTAC

|||||||||||||||||

TAGATTACACAGATTAC

• Caveat: repeats
• ALU k-mers could cause up to 1,000,0002 comparisons
• Solution:
• Discard all k-mers that occur “too often”
• Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available

Create local multiple alignments from the overlapping reads

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAG TTACACAGATTATTGA

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAG TTACACAGATTATTGA

TAGATTACACAGATTACTGA

• Correcterrors using multiple alignment

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAGATTACACAGATTATTGA

TAG-TTACACAGATTATTGA

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAG-TTACACAGATTACTGA

TAG-TTACACAGATTATTGA

insert A

correlated errors—

probably caused by repeats

 disentangle overlaps

replace T with C

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

In practice, error correction removes

up to 98% of the errors

TAG-TTACACAGATTATTGA

TAG-TTACACAGATTATTGA

• Overlap graph:
• Edges: overlaps (ri, rj, shift, orientation, score)

from two regions of

the genome (blue

and red) that contain

the same repeat

Note:

of course, we don’t

know the “color” of

these nodes