Bioinformatics workshop 1 sequences and similarity searches
This presentation is the property of its rightful owner.
Sponsored Links
1 / 84

Bioinformatics Workshop 1 Sequences and Similarity Searches PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on
  • Presentation posted in: General

Bioinformatics Workshop 1 Sequences and Similarity Searches. Open a web browser and type in the URL: informatics.gurdon.cam.ac.uk/online/workshops Bookmark this page Click on the link to the file: useful-websites.html Bookmark this page too

Download Presentation

Bioinformatics Workshop 1 Sequences and Similarity Searches

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Bioinformatics workshop 1 sequences and similarity searches

Bioinformatics Workshop 1Sequences and Similarity Searches

  • Open a web browser and type in the URL:

    • informatics.gurdon.cam.ac.uk/online/workshops

    • Bookmark this page

  • Click on the link to the file:

    • useful-websites.html

    • Bookmark this page too

    • It also contains links to the example sequence files used in the workshop, and the presentations themselves


The basic questions

The Basic Questions

Where, and how, do I find something?

How do I know it’s real?

Exercise 0:

Write a concise definition of what a gene is.


Part 1 structural genomics

Part 1: Structural Genomics

DNA arranged in chromosomes

Vertebrate ~ 109 base pairs


Chromosomes and genes

Chromosomes and Genes

Total of ~30,000 genes on ~20 chromosomes

1000 – 2000 genes per chromosome


Gene to protein

Gene to Protein

~ gene

locus

genome

primary transcript

mRNA

protein


Sequence signals

Sequence Signals

mRNA

CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA

CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA

M

L

T

I

L

L

A


Genomic signals

promoters

splice sites

===CACGATCGAGTC===================

==ACGTA…………CAGTA====================

enhancers

transcription start site

===CGCTATAAGCG====================

===CGCAATAAAGCG===================

polyadenylation signal

Genomic Signals


Derivative sequences

5’ EST

3’ EST

Derivative Sequences

mRNA

5’

3’

capture by cloning into cDNA library

EST: single pass sequence from each end of the clone

cDNA sequence

cDNA: multiple pass sequencing over whole length of the clone


Gene models

Gene Models

exons

gene model


Sequences and genes accession numbers and names

mRNAs/cDNAs

S43105.1

‘similar to Cyclin B1 [mus musculus]’

BT006437.1

‘Cyclin B1, isoform 1 [mus musculus]’

X58708.1

‘Cyclin B1, isoform 2 [mus musculus]’

NM_111985.3

‘CCNB1, Cyclin B1 [mus musculus]’

AAB22970.1

proteins

AAP21245.1

CAA41545.1

NP_187759.2

Sequences and Genes(Accession Numbers and Names)

gene


Gene symbols names etc

Gene Symbols, Names, Etc.

Gene Symbol: CCNB1

Gene Name: cyclin B1 [Homo sapiens]

Description: G2/mitotic-specific cyclin B1

Aliases:CCNB, CYCB1


A gene centric view

genomic location

S43105.1

AAB22970.1

BT006437.1

AAP21245.1

X58708.1

CAA41545.1

NM_111985.3

NP_187759.2

expression data

A Gene-Centric View

Entrez Gene

http://www.ncbi.nlm.nih.gov/

Cyclin B1

Exercise 1:

Go to Entrez Gene and look for your favourite gene or genes.


Sequences and accession numbers

Sequences and Accession Numbers

NM_001015922.1 gi=62860271

GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA

NM_001015922.2 gi=62860589

GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAA

BC009638.1 gi=16307106

GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA

NP_001015922.1 protein translated from mRNA

XM_001102567.1 predicted mRNA

XP_001089765.1 predicted protein translated from predicted mRNA


Mrna splicing signals

genome

exon intron exon intron exon

mRNA Splicing Signals

mRNA

CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

gene model

splice sites

CTACCATCCATGCTAACCATTCTAC CATTTTATACTCATGCAACGGACCGT AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

GTAAG.donor

.TTTCAG acceptor


Gene predictions

Gene Predictions

  • Given:

  • coding sequence must run from ATG – STOP codon in-frame

  • introns GT. . . . . . AG can be spliced out

  • Also take a statistical approach:

  • coding and non-coding sequence are slightly different in composition

  • some ‘possible’ splice sites are more likely than others

scan genomic sequence …

. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .

. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .

. . .CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . .

. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .

. . .CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . .

. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .

. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . .

. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . .

. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .

most likely gene model

. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .


Supporting evidence

Supporting Evidence!

exons: 1 2 3 4

gene model

genome

EST evidence

We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even then…).

So predicted genes based on computational gene models alone will usually lack UTR regions, which has some important consequences.


Theoretical predicted sequences

Theoretical/Predicted Sequences

exons: 1 2 3 4

predicted gene model

genome

predicted transcript

predicted protein

We’ve now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence, but we shouldn’t lose sight of the fact that we don’t really know if these predicted proteins exists – especially where supporting EST evidence is weak, or non-existent.


Sequences for a model organism

Sequences for a model organism

ESTs – millions @ £10 each

Cheap to sequence – so we get millions per organism

But lots of errors

And incomplete gene sequences

Can give us relative expression levels

cDNAs – tens of thousands @ £1000 each

Expensive – but only need to do one (or a small number) per gene

Few errors with multipass sequencing

Gives us protein sequences

Genomes – one ! @ £30,000,000

Extremely expensive

But the only way to get the whole picture

Gives us gene regulation


So what s in the databases now

So What’s in the Databases Now?

NCBI July 2005

15,000,000

ESTs

3,300,000

cDNAs

DNA

nr

RefSeq

2,700,000

proteins

950,000

proteins

Proteins


Part 2 comparative genomics

Part 2: Comparative Genomics

Evolution by sequence mutation

Gene sequence

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCAACGACTGCCGTG

Imagine one mutation gets fixed every 100,000 years in this gene sequence…

ATGCATGCTGCCAACGACTGCCGTG

ATGCATGCTGCCAACGACTGCCCTG

ATGCATGCTGCCAACGGCTGCCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGCATGCCGCCAACGGATGCCCTG

ATGCATGCCGCCAACGGATGTCCTG


Speciation

Speciation

Gene A

ATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCAACGACTGTCGTG

ATGCAGGCTGCCAACGACTGCCGTG

ATGAAAGCCGCCAACGACTGTCGTG

ATGCATGCTGCCAACGACTGCCGTG

ATGAAAGCCGCCAACGACAGTCGTG

ATGCATGCTGCCAACGACTGCCCTG

ATGCATGCTGCCAACGGCTGCCCTG

ATGAAAGCCGCCTACGACAGTCGTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

||| | || ||| ||| | ||||

ATGAAAGCCGCCTACGACAGTCCTG

If the genetic difference means they can no longer interbreed, with fertile offspring – then we have a new species…


Residual similarity

ATGCATGCTGCCAACGGATGCCCTG

||| | || ||| ||| | ||||

ATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

||| | | || | | | || |

ATGGAAGGCGCTTAGGATAGTCCAG

Residual Similarity

We can still easily detect residual similarity between these sequences, this is what we call homology – detectable similarity because of common evolutionary origin.

After longer periods of evolution, homology may no longer be detectable in the DNA sequence…


Computers can detect homology

ATGCATGCTGCCAACGGATGCCCTG

||| | || ||| ||| | ||||

ATGAAAGCCGCCTACGACAGTCCTG

GCTGACTCGTAGCGCTTAGCTAGCT

| || | | | |

CCAACATCTAGCCAGATTAGTTAGT

Computers Can Detect Homology

In fact computers are very good at this task – the two primary challenges are:

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientist’s attention span

(b) at low levels of similarity, being able to distinguish between biologically related sequences and chance matches…


Orthologs

A

A

A

Orthologs

Gene duplication though speciation

The two copies of Gene A will now evolve independently, but will continue to have the ~same function

They are ORTHOLOGS


Paralogs

A

A

A

A

A’

Paralogs

The two copies of Gene A will now evolve independently, but will probably not continue to have exactly the same function

Gene duplication though internal genome duplication

They are PARALOGS


Other logs

A

A

A

A

A’

‘Other’-logs

What about gene duplication after speciation ?

How can we describe the relationship(s) between the various copies of gene A in the two frogs?

Bear in mind that understanding gene function is more important than semantics…

The two copies of A in the orange frog are sometimes called IN-PARALOGS.

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS.


The essential paradigm

A

A

A

A

cyclin b1

cyclin b1

The Essential Paradigm

1. any group of modern species can be traced back to some extinct common ancestor

2. in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3. If we can experimentally determine the function of a gene in one of these organisms, then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function


Function conserved longer than detectable similarity

living organisms

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

whole genome duplication

local duplication

same function

detectable similarity


Redundancy in the genetic code

Redundancy in the Genetic Code

GCA Aalanine

GCC A

GCG A

GCT A

TGC Ccystine

TGT C

GAC Daspartate

GAT D

GGA Gglycine

GGC G

GGG G

GGT G

‘Synonymous’ or ‘silent’ mutations in the third position of the codon triplets have no effect on the amino acid coded for – so there is no evolutionary pressure against this…


Protein similarity persists longer

LSREPV

CTATCACGAGAACCTGTG

||||||

|| || || || || ||

CTGTCCCGTGAGCCAGTT

LSREPV

LSREPV

CTATCACGAGAACCTGTG

||| ||

| || | || ||

TTGTCCCGGTCGCCAGTT

LSRFPV

Protein Similarity Persists Longer

CTATCACGAGAACCTGTG

CTATCCCGAGAACCTGTG

CTATCCCGAGAACCAGTG

CTATCCCGTGAACCAGTG

CTATCCCGTGAGCCAGTG

CTATCCCGTGAGCCAGTT

CTGTCCCGTGAGCCAGTT

100%

67%

80%

44%


Always compare protein sequences

Always Compare Protein Sequences

amino acid comparison

DNA comparison

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR

||||| || || || || || || || ||||| || || | ||||||||+||

ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

The DNA sequence can change while the amino acid sequence stays the same, so always look for similarities by comparing amino acid sequences.


Exercise 1 nucleotide vs amino acid search

Exercise 1nucleotide vs amino acid search

Go to the file example-sequences.html, and locate the section for this exercise. There should be two sequences: ‘surfeit1’ for frog and fly.

Go to NCBI Blast home page, then ‘Align two sequences’ (bottom left ‘special’ panel), paste one sequence into each window and hit ‘Align’ – this will do a direct DNA/DNA comparison.

Now find the open reading frames of the two genes, and translate them into amino acid protein sequences, then repeat the two sequences comparison.

Go to NCBI ORF Finder – paste sequence – hit OrfFind – identify longest ORF – click on it – next screen, hit Accept – change View to Fasta protein – hit View – copy sequence to Blast2Seqs. Do the same with the other sequence.

Before you hit ‘Align’ change the ‘Program’ (top left) to blastp…


Answers exercise 1

Answers: Exercise 1


The essential task

The Essential Task

experiment

data mining

what is its function?

gene sequence

Cyclin-A

FoxA1

database of proteins in other species

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

we can only do this because of implied function based on orthology

frizzled

Wint8

Troponin T3


Functional orthologs

Xenopus gene

function unknown

Functional Orthologs ?

Human gene

function known, annotation ‘Gravin’ available

sequence similarity

orthologs

same function ?

But we know that function is largely determined by shape

similar shape?

Which in general we cannot determine – but it is probably SHAPE not SEQUENCE that is conserved!

We make an assumption that the same gene function is likely to be present in the two organisms, and the ones that have this function are likely to be the most similar in sequence


Finding orthologs

Finding Orthologs

So how do we find orthologs, and can we know when we have?

The simplest is Reciprocal Best BLAST, but it implicitly relies on having all the protein sequences of you own organism, and the one you wish to find an ortholog in.

database of human proteins

database of frog proteins

best match human protein

frog protein

x


Using synteny is better

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another.

Mouse chromosome 10

Human chromosome 5

Mouse chromosome 2

And we find the same genes (i.e. orthologs!) in more or less the same order in the syntenic sections.

These of course represent chromosomal re-arrangements since these organisms diverged.


Metazome

Metazome

Fortunately someone has done all the hard work for us….

Dan Rokhsar

http://www.metazome.net/


Metazome exercise

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again.

Pick probable ortholog vertebrate genes from common organisms (human, mouse, rat, chicken, frog, fish) and paste their protein sequences into a temporary space.

Go to Metazome (http://www.metazome.net/), find the blast window, open two versions of it, and blast your sequences against the Tetrapod or Jawed vertebrate node.

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)…


Part 3 finding sequence similarities

ATGCATGCTGCCAACGGATGTCCTG

ATGCATGCTGGCCAACGGATGTCCTG

||| | || ||| ||| | ||||||

||| | || ||| | | | |

ATGAAAGCCGCCTACGAAAGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG

Part 3: Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments, looking for a degree of similarity greater than we would expect to find by chance.

But first we have to consider the implication of gaps…

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments:


Gaps in alignments

Gaps in Alignments

Consider these two obviously similar sequences:

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA

| || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |

TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence:

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA

|||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||

TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps, until we find the optimal alignment.

But where should this process stop?


The downside of gaps

The Downside of Gaps

Take two random sequences, with no ‘real’ similarity:

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow ‘cost free’ gaps:

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA

|| | | | | | ||| |||| ||

ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly, although the alignment has no mismatches, it is obviously not biologically meaningful!

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches – and this is the essence of ‘finding gapped alignments’.

We want to find the ‘alignment’ between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps …


Blast

BLAST

There are many programs used to find similarities between sequences.

They range from relatively slow programs which find the exact best matching alignment, through ones which take progressively inexact shortcuts to speed things up. Of this latter class, the best known, and easily most widely used is BLAST, developed by Stephen Altschul and others, and continuously refined over the last 10-15 years.

The essential idea is to compare your query sequence against a collection or ‘database’ of target sequences, looking for the one(s) that match the query sequence the best.

database

>target1

AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAG

>target2

CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG

>target3

GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC

>target4

CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

>query

AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

COMPARE

LIST MATCHES


Flavours of blast

query sequence

other operation?

database sequences

Flavours of BLAST

BLASTn

ACGATAGATCCCATCCATAAAT

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

FAST

BLASTp

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MQWCGYRWTYQGYRW

FAST

BLASTx

ACGATAGATCCCATCCATAAAT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

6 frame translation

SLOW

tBLASTn

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MQWCGYRWTYQGYRW

SLOWER

tBLASTx

HORRIBLY SLOW!

ACGATAGATCCCATCCATAAAT

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT


How does it work

How does it work?

The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is:

query

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC

||||||||||||||||||||||||| ||||||||||||||||||||||||

| | | | | |||||||||||||||||||||||||

|||||||||||||||||||||||| | | | | | |

|| | | | | | | | |

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

1st database sequence

This would actually be a very slow search process if implemented like this…

BLAST achieves its speed through two strategies:

- it takes a WORD based approach

- it pre-INDEXES database sequences


Blast words and indexing

BLAST: WORDS and INDEXING

1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

Database of sequences

Numbered list of all possible ‘words’

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 : ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 : GACAAATC 33568 GACAAATG 33569 : TCCAAACC 64321 TCCAAACC 64322 :

Build a position index of all words in the database

sequence position word

1 133658

1207967

1316210 :

315 33568

3 16 07967 :


Analyse the query sequence

Analyse the Query Sequence

>query AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

QUERY SEQUENCE

Numbered list of all possible ‘words’

Analyse QUERY SEQUENCE

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 : ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 : GACAAATC 33568 GACAAATG 33569 : TCCAAACC 64321 TCCAAACC 64322 :

position word

1 14236

2 33658

3 07967 :

Index of database

sequence position word

1 133658

1207967

1316210 :

315 33568

3 16 07967 :


Expand from word based matches

Expand from Word Based Matches

We ‘instantly’ know which sequences in the database have at least a word length match with our query sequence, and at what relative position.

Next, the potential alignments are expanded, adding up a score for (total matches – mismatches – gap penalties), to make the best possible alignment. But this is usually for a tiny proportion of the sequences in the database – so overall it is much quicker.

The highest scoring alignments are reported.

But we can potentially miss alignments with no word-size bits in common, consider BLASTn with a default word-size of 11:

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes needed…


Blast typical output

BLAST –Typical Output

INPUT:

>partial cDNA sequence, Xenopus tropicalis

CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUT:

Query= (311 letters)

Database: NCBI Protein Reference Sequences 954,378 sequences; 347,895,532 total letters

>gi|41055060|ref|NP_957420.1|   similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)

Expect = 6e-31

Identities = 76/98 (77%)

Positives = 82/98 (83%)

Gaps = 4/98 (4%)

Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196

MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK +

Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310

SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA

Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97


When is a match significant

When is a match significant?

Here is a ‘typical’ weak alignment from BLASTp:

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++

NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

In fact the sequences were randomly generated, so there

is no biologically significant alignment…

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS


E values

E-values

The number of matches like the discovered match that I would expect to find by chance.

An E-value of 0.0 implies that I would expect no matches like this to arise by chance, therefore…

An E-value of 1 implies I would expect 1 match like this to arise by chance, so if I have a match with such an E-value…

Also “expect value“ or “expectation”


E values from first principles

E-values From First Principles

Notation:

1.2e-35 = 1.2 x 10-35

4.8 x 106 = 4,800,000

Some database statistics (23rd July 2005):

Database: NCBI RefSeq mRNA

272,619 sequences; 503,566,580 total letters (~5.0 x 108)

Database: NCBI nr

3,329,110 sequences; 14,601,814,750 total letters (~1.4 x 1010)

We will consider first searching a nucleotide sequence (‘ACGTAGACGT’) against a nucleotide database, e.g. the RefSeq mRNA above.

Then we will consider the more complex case of amino acid sequence (protein) searches. Which is of course what we mostly do.


Calculating an e value

Calculating an E-value

The RefSeq mRNA database has ~ 5.0 x 108 letters

There are 4 possible nucleotides - ACGT

How many matches do we expect to find by chance?

Query = ‘A’

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG

A

A

A

A

A

A

A

A

A

A

A

A

A

A

Expected number of matches = (5.0 x 108) / 4 = ~1.2 x 108

Query = ‘AC’

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG

AC

AC

AC

AC

Expected number of matches = (5.0 x 108) / (4x 4) = ~3.1 x 107

Query = ‘ACG’

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG

ACG

Expected number of matches = (5.0 x 108) / (4 x 4 x 4) = ~8.1 x 106

Query = ‘ACGTCGA…..CTGATTCG’ - 60-mer

Expected number of matches = (5.0 x 108) / (4 x 4 x 4 x 4 … 60 times )

= (5.0 x 108) / 1036

= 5.0 x 10-28

E-value = 5.0 x 10-28


E values in practice

E-values In Practice

So if I take a 60 nt sequence:

>sequence

ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA

and actually BLAST it against the RefSeq mRNA database, I get:

BLAST OUTPUT:

>gi|27469838|gb|BC041710.1| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1, transcript variant 2, mRNA (cDNA clone MGC:49019 IMAGE:6051007), complete cds Length=6060

Score = 119 bits (60), Expect = 2e-26

Identities = 60/60 (100%), Gaps = 0/60 (0%) Strand=Plus/Plus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr database?

BLAST OUTPUT:

>gi|27469838|gb|BC041710.1| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1, transcript variant 2, mRNA (cDNA clone MGC:49019 IMAGE:6051007), complete cds Length=6060

Score = 119 bits (60), Expect = 6e-25

Identities = 60/60 (100%), Gaps = 0/60 (0%) Strand=Plus/Plus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 5.0e-28 - !?


E value exercise

E-value Exercise

Given a transcription factor binding site:

ACC[T/G]TA

How many would you expect to find by chance in a 10k promoter sequence?

How would this differ if there was an optional additional base between the 4th and 5th positions?

I.e.

ACC[T/G]TA

OR

ACC[T/G]?TA


E value exercise answer

E-value Exercise: Answer

ACC[T/G]TA

Expect ‘A’ every 4 nt

Expect ‘ACC’ every 4x4x4 = 64 nt

Expect ‘T or G’ every 2nd nt

Expect ‘ACC[T/G]’ every 64x2 nt = 128 nt

Expect ‘TA’ every 4x4 = 16 nt

Expect ‘ACC[T/G]TA’ every 128x16 nt = 2048 nt

(4x4x4x2x4x4)

We would expect ~5 of these promoter sites every 10k by chance

If also ACC[T/G]?TAA allowed?

The two motifs independently have the same E-value.

To allow either means we expect twice as many.


E values effect of database size

E-values: Effect of Database Size

The nr mRNA database has ~ 1.4 x 1010 letters (was RefSeq and 5.0 x108)

There are 4 possible nucleotides - ACGT

How many matches do we expect to find by chance?

Query = ‘A’

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG

A

A

A

A

A

A

A

A

A

A

A

A

A

A

Expected number of matches = (1.4 x 1010) / 4 = ~1.2 x 108

Query = ‘AC’

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG

AC

AC

AC

AC

Expected number of matches = (1.4 x 1010) / (4x 4) = ~3.1 x 107

Query = ‘ACG’

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG

ACG

Expected number of matches = (1.4 x 1010) / (4 x 4 x 4) = ~8.1 x 106

Query = ‘ACGTCGA…..CTGATTCG’ - 60-mer

Expected number of matches = (1.4 x 1010) / (4 x 4 x 4 x 4 … 60 times )

= (1.4 x 1010) / 1036

= 1.4 x 10-26

E-value = 1.4 x 10-26

(was E-value = 5.0 x 10-28)


E values effect of database size1

E-values: Effect of Database Size

1.4 x 1010 letters

5.0 x108 letters

nr

RefSeq

30 x bigger

E-value = 5.0e-28

BLAST the same sequence

against each

E-value = 1.4e-26

The database was ~30 times bigger and so the E-value was ~30 times bigger.

The E-value is simply dependent on database size.


Why were the values different

Why were the values different?

Our calculated E-value for searching against the RefSeq mRNA database was 5.0 x 10-28.

But our actual BLAST search at NCBI gave a value of

2.0 x 10-26 - about 40x larger - why is this?

Gapped alignments

If we were expecting N matches for a query sequence ‘ACGTACGTACGT’, imagine what would happen to N if we allowed gaps in our matches.

ACGTAC?GTACGT

This would now give us additional possible alignments that would meet our ‘match’ criteria:

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc.

|||||||||||| |||||| |||||| |||||| ||||||

ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database, if we allow our alignments to have gaps. The E-value will be larger.


E values effect of query length

E-values: Effect of Query Length

BLAST 500 nt sequence against a database

>sequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLASTn

Get a full length match with sequence XYZ at an E-value = 5.0e-160

BLAST half of the same sequence against the same database

>sequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

database

Get a match with sequence XYZ again, but at an E-value = 5.0e-80

BLASTn

Biologically it’s the same match! Does it mean we are any less sure that this match didn’t occur by chance?

The E-value is simply dependent on match length.


Why not just use identity

Why not just use % identity?

At some levels this a good question.

But consider two very different searches, both of which give a 75% identity match

Query1 was 60 nt long:

CGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG

||||||||||| || | || | || || |||| | | | |||||| | ||||||||||

CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCG

Which would have an E-value ~ 5.0 x 10-19

And, Query2 only 16 nt long:

ACGTACGTACGTACGT

||| || | |||| ||

ACGCACCTTCGTAGGT

Which would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chance…


So what s the real problem

The difficulty is because:

BLAST

ORTHOLOGY

BLAST

Similarity + Probability

So what’s the real problem?

Basically you are usually trying to answer the question:

Can I find the ortholog of my gene in some other species, so that I can work out what it might be doing in my organism?

Basically you are usually trying to answer the question:

Can I find the ortholog of my gene in some other species, so that I can work out what it might be doing in my organism?

biological knowledge

nature of query sequence

phylogenetic relationship

match length, PI, size of database…

Are there any useful guidelines though, at least for biological meaningfulness?


Rules of thumb

Rules of Thumb

How good does an E-value have to be before we might even think we have an ortholog?

 larger/worse smaller/better 

E-values 10-510-10 10-40 10-100 0.0

fantasy borderline encouraging pretty good can’t get better

But note that in some gene families with closely related members you can get an E-value of 0.0 for several different matches, and then % identity may be more sensitive. Also bear in mind, in cases like this, that ideas of ‘functional’ orthology may break down, with more than one locus producing identical proteins which share the same function…


Protein blast

Protein BLAST

It’s (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level, because the amino acid sequence is more conserved than the underlying DNA sequence.

Does this cause us to treat expected values any differently?

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database, each additional amino acid will reduce the E-value by 1/20th (there are 20 different amino acids). And as there are 347,895,532 letters in that database,

E-value = ~3.5 x 108 / (20 x 20 x 20 …20 times) = ~3.5 x 10-18.

But this is what we get if we run the blast at NCBI:

Score = 43.1 bits (100), Expect = 8e-04

Identities = 20/20 (100%), Positives = 20/20 (100%), Gaps = 0/20 (0%)

Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62

SSSSFRAYRAALSEVEPPCI

Sbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand waving…


Amino acid substitutions

A S

C

F LWY

G

I LMV

L IMFV

M ILV

P

V ILM

W FY

NDHS

Q REHK

SANT

T S

Y HFW

H NQY

K RQE

R QK

DNE

E DQK

Amino Acid Substitutions

In fact we need to take into account both amino acid substitutability, as well as, as before, allowing gapped alignments. On average any residue can be substituted for by about 2 others, so each position has about 1/7th chance of ‘matching’ rather than 1/20th.

So now we get:

E-value = ~3.5 x 108 / (7 x 7 x 7 …20 times) = ~4.4 x 10-9,

which is much closer to the actual BLAST value.

These substitutabilities are dealt with by the BLOSUM and PAM matrices


Exercises

Exercises

Go to the file random-DNA-sequences.html, select one of the 20 randomly generated nucleotide sequences, and do a BLASTx (translated DNA->protein) at NCBI against the nr protein database.

Did you find any ‘significant’ hits?

Repeat with a second sequence.

What conclusions might you draw from this exercise?

Try the same sequence(s) against the nr nucleotide database.

Is there any general difference?


Part 4 tweaking blast

Part 4: Tweaking BLAST

Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that can be run just by typing the appropriate command and options, e.g.

prompt> blastall –p <blast type> –i <input sequence> –d <database>

This is the simplest form: where the basic program ‘blastall’ takes a number of different options, or parameters, indicated by the –x and followed by its value.

-p <which blast flavour to run>

-i <file with query sequence in>

-d <pre-indexed database name>


Not all parameters are

Not All Parameters are …

There are many other parameters, and if not listed explicitly they will use a default value most appropriate to the blast flavour requested.

E.g. for –W <word size> blastn uses –W 11, where blastx uses –W 3.

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way. One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search.


The many parameters of blast

The Many Parameters of BLAST

There are almost literally hundreds of parameters, but most are way too obscure even for die-hard techies like me! Very few of them are regularly useful in any but their default value, but just occasionally they are very necessary.

Here are some of the ones that I have used:

-e max expected value

-moutput format(graphical or tabular/spreadsheet)

-F filter query sequence for low complexity(default TRUE)

-U use only upper case regions of query (default FALSE)

-Ggap opening cost

-E gap extension cost

-q nucleotide mismatch penalty (BLASTx uses matrices)

-r nucleotide match reward

-b number of matching sequences to report

-g allow gaps (default TRUE)

-W word size

-z effective database size (removes effect of actual database size!)

-S query strands to search(default both directions)

-lrestrict database sequences to given list of ‘gi‘ numbers


Blast parameters exercises

BLAST Parameters Exercises

1. BLASTn vs. BLASTx

Open the file example-sequences.html, copy the sequence:

>blastn-vs-blastx

This is a Xenopus tropicalis cDNA sequence.

Go to the NCBI BLAST Home Page/Nucleotide-nucleotide BLAST (blastn)section. Paste your sequence into the box.

Run BLASTn against the nr nucleotide database using all default options.

Then hit [format] to wait for the results in a new page.

(hint if you paste the sequence definition line ‘>name’ into the box as well, your results will be labelled accordingly, which can be useful)

Now repeat but go to the TRANSLATED BLAST section, and BLAST against the nr protein database using BLASTx.

How might the different results help us view the presence of this gene in other vertebrates?


Results for exercise 1

Results for Exercise 1.

BLASTn

BLASTx


Blast parameters exercises1

BLAST Parameters Exercises

2. Low complexity filtering

Open the file example-sequences.html, copy the sequence:

>low-complexity-filtering-A

This sequence contains a long AT tandem repeat.

Go to the NCBI BLAST Home Page/TRANSLATED BLAST section/BLASTx. Paste your sequence into the box.

Carefully UNTICK the “Choose filter [ ] Low complexity” BOX in the second section. And then run BLASTx against the nr database.

What do you feel about these alignments?

Re-run, but leave the low-complexity filter ON this time.

Does this change our view of the protein matches?

Now continue with >low-complexity-filtering-B and –C.

C is an especially interesting case – what can we deduce about the cDNA sequence? Annotators beware!


Results for exercise 2a off

Results for Exercise 2A (OFF)

BLASTn – low complexity filtering OFF


Results for exercise 2a on

Results for Exercise 2A (ON)

BLASTn – low complexity filtering ON


Results for exercise 2b

Results for Exercise 2B

ON

OFF


Results for exercise 2c

Results for Exercise 2C

ON

OFF

There is a sequence error, an extra G at position 117 in the sequence:

cDNA (117)

AGAAAAGAAGAAACATGGCAATGGATCAGAA

|||||||||||||||| ||||||||||||||

AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence


Blast parameters exercises2

BLAST Parameters Exercises

3. Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items.

For instance to find only matching sequences in fruit fly, enter ‘Drosophila melanogaster[ORGN]’ in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list).

To combine items use logical AND, OR or NOT.

Open the file example-sequences.html.

Copy the sequence >cyclin-D1-Xt and go to the NCBI BLAST Home Page/ TRANSLATED BLAST section/BLASTx, and paste the sequence.

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1. At what E-value do we expect we are no longer looking at cyclins? Try running the search again with that E-value as a limit…


Blast parameters exercises3

BLAST Parameters Exercises

4. BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequences.html.

Also open the NCBI BLAST Home Page/SPECIAL – Align two sequences section.

There are several Xenopus tropicalis cyclins in the examples file.

Copy the sequence >cyclin-A1-Xt to the Sequence 1 BLAST window

Copy the sequence >cyclin-A2-Xt to the Sequence 2 BLAST window

(i) Run the default comparison, should be BLASTn. Note the alignment.

Now run again using tBLASTx – what does this do to our understanding of the relationship between these two sequences? Are they homologs, orthologs or paralogs – or none of these?

(ii) Revert to BLASTn, and try varying the values for mismatch penalties and gapping – start by reducing the mismatch penalty to -1. Then try reducing the gap open and gap extension penalties….

What do we learn from this?

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2…


Results for exercise 4 i

Results for Exercise 4 (i)

BLASTn

tBLASTx


Results for exercise 4 ii

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default)

Mismatch penalty = -1


Blast parameters exercises4

BLAST Parameters Exercises

5. E-Value maximum for reporting

Open the file example-sequences.html.

Copy the sequence >sumo-binding-motif and go to the NCBI BLAST Home Page.

Go to the PROTEIN BLAST section, BLASTp, and paste the sequence.

Run the search with the default values.

Now re-run the search, setting the maximum E-value in the box:

Expect 100

What difference does this make?


Blast parameters exercises5

BLAST Parameters Exercises

6. Word Size

Open the file example-sequences.html.

Copy the sequence >morpholino and go to the NCBI BLAST Home Page.

Go to the NUCLEOTIDE BLAST section, BLASTn, and paste the sequence.

Check OFF the low complexity filter, and then run the search.

Now re-run the search, setting the following parameters:

Low complexity OFF

Expect 100

Word Size7

Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make?


Bioinformatics workshop 1 sequences and similarity searches

END


  • Login