Ebi roadshow
Download
1 / 211

EBI Roadshow - PowerPoint PPT Presentation


  • 104 Views
  • Uploaded on

EBI Roadshow. James Watson, PhD Senior Scientific Training Officer EBI-EMBL [email protected] Sequence Searching and Alignments. Andrew Cowley External Services, EMBL-EBI. External Services. Andrew Cowley Bioinformatics Trainer. Hamish McWilliam Software engineer. Rodrigo Lopez

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'EBI Roadshow' - roscoe


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Ebi roadshow

EBI Roadshow

James Watson, PhD

Senior Scientific Training Officer

EBI-EMBL

[email protected]


Sequence searching and alignments

Sequence Searching and Alignments

Andrew Cowley

External Services, EMBL-EBI


External services
External Services

Andrew Cowley

Bioinformatics Trainer

Hamish McWilliam

Software engineer

Rodrigo Lopez

Head of External Services

+ many others!

Sequence searching and alignments - Andrew Cowley


Contents
Contents

  • Sequence databases

    • Database browsing tools

  • Similarity searching and alignments

    • Alignment basics

    • Similarity searching tools

    • More advanced tools

    • Alignment tools

    • Guidelines

  • (slightly) More advanced tools

  • Problem sequences

Sequence searching and alignments - Andrew Cowley


Materials
Materials

Presentations and tutorials can be found on

the roadshow course page at the EBI

Data files for exercises can be found at:

www.ebi.ac.uk/~watson/africa

Sequence searching and alignments - Andrew Cowley


Data

Simplistically, much of the data at the EBI can be thought of as a container

One part being the raw data (eg. Sequence)

Another part being annotation on this data

Sequence searching and alignments - Andrew Cowley


Example

ID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP.

XX

AC AJ131285;

XX

DT 24-APR-2001 (Rel. 67, Created)

DT 20-JUL-2001 (Rel. 68, Last updated, Version 4)

XX

DE Sabellaspallanzanii mRNA for globin 3

XX

KW globin; globin 3; globin gene.

XX

OS Sabellaspallanzanii

OC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata;

OC Sabellida; Sabellidae; Sabella.

XX

RN [1]

RP 1-919

RA Negrisolo E.M.;

RT ;

RL Submitted (11-DEC-1998) to the EMBL/GenBank/DDBJ databases.

RL Negrisolo E.M., Biologia, Universita degli Studi di Padova, via U. Bassi

RL 58/B, Padova,35131, ITALY.

FH Key Location/Qualifiers

FH

FT source 1..919

FT /organism="Sabellaspallanzanii"

FT /mol_type="mRNA"

FT /db_xref="taxon:85702"

FT CDS 73..552

FT /gene="globin"

FT /product="globin 3"

FT /function="respiratory pigment"

FT /db_xref="GOA:Q9BHK1"

FT /db_xref="InterPro:IPR000971"

FT /db_xref="InterPro:IPR014610"

FT /db_xref="UniProtKB/TrEMBL:Q9BHK1"

FT /experiment="experimental evidence, no additional details

FT recorded"

FT /protein_id="CAC37412.1"

FT /translation="MYKWLLCLALIGCVSGCNILQRLKVKNQWQEAFGYADDRTSXGTA

FT LWRSIIMQKPESVDKFFKRVNGKDISSPAFQAHIQRVFGGFDMCISMLDDSDVLASQLA

FT HLHAQHVERGISAEYFDVFAESLMLAVESTIESCFDKDAWSQCTKVISSGIGSGV"

XX

SQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other;

caaacagtcarttaattcacagagccctgaggtctctcgctcctttctgcgtcactctct 60

cttaccgtcatcatgtacaagtggttgctttgcctggctctgattggctgcgtcagcggc 120

tgcaacatcctccagaggctgaaggtcaagaaccagtggcaggaggctttcggctatgct 180

gacgacaggacatcccycggtaccgcattgtggagatccatcatcatgcagaagcccgag 240

//

Example

Sequence searching and alignments - Andrew Cowley


Data nucleotide
Data - Nucleotide

  • ENA/EMBL-Bank:

    • Release and updates

    • Divided into classes and divisions

    • Supplementary sets: EMBL-CDS, EMBL-MGA

  • Specialist data sets, e.g.:

    • Immunoglobulins: IMGT/HLA, IMGT/LIGM, etc.

    • Alternative splicing: ASD, ASTD, etc.

    • Completed genomes: Ensembl, Integr8, etc.

    • Variation: HGVBase, dbSNP, etc.

Sequence searching and alignments - Andrew Cowley


Individual sequencing
Individual sequencing

ACTGCTGCTAGCTAG

What sequence data is submitted?

Individual scientists

Sequence individual gene

ACTGCTGCTAGCTGGCTGACTATTCTAGCTTTAGCTGAGTGACTATTATCAGCTATTACAGCATCCG

add annotation

submission

submission


High throughput sequencing
High throughput sequencing

ACTGCTGCTAGCTAG

chromosome

fragment

sequencing library

sequence reads

assemble sequence

annotation

cyp30

cyp309

insv

cg343


High throughput sequencing1
High throughput sequencing

ACTGCTGCTAGCTAG

chromosome

Large-scale sequencing

projects

fragment

sequencing library

submission

sequence reads

e.g. whole genome shotgun

assemble sequence

submission

submission

annotation

cyp30

cyp309

insv

cg343


What are primary sequence databases
What are primary sequence databases?

ACTGCTGCTAGCTAG

Individual scientists

ACTGCTGCTAGCTAGCTGATCTATGCTAGC

TGTAGCTGAG

Large-scale sequencing

projects

Patent Offices

Primary sequence data

  • Original sequence data

    • Experimental data

    • Patent data

  • Submitter-defined

Primary

sequence

database


How do primary and derived databases differ
How do primary and derived databases differ?

ACTGCTGCTAGCTAG

Individual scientists

ACTGCTGCTAGCTAGCTGATCTATGCTAGC

TGTAGCTGAG

Large-scale sequencing

projects

Patent Offices

Primary sequence data

Primary

sequence

database

Derived

database

Derived data

e.g. protein sequence


Primary v. derived data

ACTGCTGCTAGCTAG

submit

DNA sequence

ACGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACAT

transcribe

Derived mRNA sequence

AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC

translate

Derived protein sequence

MRSNECCCAMSC


How do primary and derived databases differ?

ACTGCTGCTAGCTAG

Individual scientists

ACTGCTGCTAGCTAGCTGATCTATGCTAGC

TGTAGCTGAG

Large-scale sequencing

projects

Patent Offices

If anything in submission varies (e.g. source / submitter / sequence)  generates a new entry

Primary sequence data

may be

non-redundant

Primary

sequence

database

Derived

database

Derived data

e.g. protein sequence

redundant


How do primary and derived databases differ?

ACTGCTGCTAGCTAG

Individual scientists

ACTGCTGCTAGCTAGCTGATCTATGCTAGC

TGTAGCTGAG

Large-scale sequencing

projects

Patent Offices

Primary sequence data

regenerate data

Primary

sequence

database

Derived

database

Derived data

e.g. protein sequence

data lost


  • INSDC:

  • International Nucleotide

  • Sequence Database

  • Collaboration

  • Daily exchange of data

Primary nucleotide sequence databases

ACTGCTGCTAGCTAG

DDBJ

GenBank

GenBank

DDBJ

ENA

(U.S.A.)

(Japan)

Submission can be made to any INSDC database

ENA

(Europe)


Sequence information
Sequence information

ACTGCTGCTAGCTAG

How is sequence data processed?

DDBJ

GenBank

ENA

  • Sequence machine output (reads)

  • Quality scores

Reads

  • Fragmented sequence reads

  •  assembled into contigs

  •  mapped onto chromosomes

Assembly

Annotation

  • Functional information assigned to assembled regions


Sequence information1
Sequence information

ACTGCTGCTAGCTAG

What type of sequence data is submitted?

  • Input information:

  • Sample

  • Set-up

  • Machine configuration

  • Output machine data:

  • Sequence traces

  • Reads

  • Quality scores

  • Metagenomic data:

  • Where originated

DDBJ

GenBank

ENA

Reads

Annotated /

Raw

Raw data

Assembled sequences

Assembly

  • Interpreted information:

  • Assembly

  • Mapping

  • Functional annotation

  • Sample information

Annotated sequence

Annotation


European Nucleotide Archive

ACTGCTGCTAGCTAG

How does ENA store the data?

DDBJ

GenBank

ENA

Large-scale sequencing

projects

Annotated /

Raw

Trace Archive

Ann

SRA

Trace

ENA

Raw data

Sequence Read

Archive (SRA)

Individual scientists

Assembled sequences

ENA-Annotation

(formerly EMBL-Bank)

Annotated sequence

Patent Offices


European Nucleotide Archive

ACTGCTGCTAGCTAG

How does ENA store the data?

DDBJ

GenBank

ENA

Large-scale sequencing

projects

  • Trace sequence reads

  • Capillary sequencing

  • instruments

Annotated /

Raw

Trace Archive

Ann

SRA

Trace

ENA

Raw data

Sequence Read

Archive (SRA)

  • Intensity reads

  • Next-generation

  • sequencing instruments

Individual scientists

Assembled sequences

ENA-Annotation

(formerly EMBL-Bank)

Annotated sequence

Patent Offices


Indsc sequencing projects
INDSC Sequencing Projects

ACTGCTGCTAGCTAG

Can data be traced to an Institute?

DDBJ

GenBank

Complete genome / metagenome

ENA

Database records

Pulls information together

Annotated /

Raw

Ann

SRA

Trace

genomic

genomic

Assembly & annotation

Track projects

ESTs...

ESTs...

Institute

shotgun

shotgun

Comparative analysis

Consortium

Assembly & annotation

(single organism / metagenomic study)


Nucleotides european nucleotide archive ena
Nucleotides: European Nucleotide Archive (ENA)

The ENA has a three-tiered data architecture.

It consolidates information from EMBL-Bank, the European Trace Archive (containing raw data from electrophoresis-based sequencing machines) and the Sequence Read Archive (containing raw data from next-generation sequencing platforms).

Figure adapted from: Cochrane, G. et al. Public Data Resources as the Foundation for a Worldwide Metagenomics Data Infrastructure. In: Metagenomics: Theory, Methods and Applications (Chapter 5), Caister Academic Press, Universidad Nacional de Cordoba, Argentina. Ed. D. Marco (2010).

Sequence searching and alignments - Andrew Cowley


Data quality
Data Quality

ACTGCTGCTAGCTAG

Is the data cleaned up?

DDBJ

GenBank

ENA

Validation of submitted data:

Annotated /

Raw

Ann

SRA

Trace

  • Automatic quality checks

Clean-up

  • Some manual inspection and curation

Errors can still exist in sequence and annotation


Database Structure

ACTGCTGCTAGCTAG

How is the data organized?

DDBJ

GenBank

ENA

Data in ENA Annotation is divided in 2 ways:

Annotated /

Raw

1) Data classes

Ann

SRA

Trace

  • Type of data or

  • Methodology used to obtain data

  • Each entry belongs to one data class

Clean-up

Class

Taxon

2) Taxonomic Divisions

  • Each entry belongs to one taxonomic division


Data Classes

ACTGCTGCTAGCTAG

CON

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

DDBJ

GenBank

EST

Genome Survey Sequence (high-throughput short sequence)

ENA

GSS

Annotated /

Raw

HTC

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

HTG

Ann

SRA

Trace

Mass Genome Annotation

MGA

Clean-up

Patent sequences

PAT

Sequence Tagged Site (short unique genomic sequences)

STS

Class

Taxon

Standard (high quality annotated sequence)

STD

Third Party Annotation (re-annotated and re-assembled)

TPA

Transcriptome Shotgun Assembly (computational assembly)

TSA

Whole Genome Shotgun

WGS


Data Classes

ACTGCTGCTAGCTAG

CON

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

DDBJ

GenBank

EST

Genome Survey Sequence (high-throughput short sequence)

ENA

GSS

  • Single pass reads  variable quality

  • Need to search both EST and RNA data

Annotated /

Raw

HTC

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

HTG

Ann

SRA

Trace

Mass Genome Annotation

MGA

Clean-up

Patent sequences

PAT

Sequence Tagged Site (short unique genomic sequences)

STS

Class

Taxon

Standard (high quality annotated sequence)

STD

Third Party Annotation (re-annotated and re-assembled)

TPA

Transcriptome Shotgun Assembly (computational assembly)

TSA

Whole Genome Shotgun

WGS


Data Classes

ACTGCTGCTAGCTAG

CON

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

DDBJ

GenBank

EST

Genome Survey Sequence (high-throughput short sequence)

ENA

GSS

Annotated /

Raw

HTC

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

HTG

Ann

SRA

Trace

Mass Genome Annotation

MGA

Clean-up

Patent sequences

PAT

Sequence Tagged Site (short unique genomic sequences)

STS

Class

Taxon

  • Often copies of existing entries

  • Records not clean, even for taxonomy

Standard (high quality annotated sequence)

STD

Third Party Annotation (re-annotated and re-assembled)

TPA

Transcriptome Shotgun Assembly (computational assembly)

TSA

Whole Genome Shotgun

WGS


Data Classes

ACTGCTGCTAGCTAG

CON

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

DDBJ

GenBank

EST

Genome Survey Sequence (high-throughput short sequence)

ENA

GSS

Annotated /

Raw

HTC

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

HTG

Ann

SRA

Trace

  • Bulk of entries

  • Highest level of tracked information

Mass Genome Annotation

MGA

Clean-up

Patent sequences

PAT

Sequence Tagged Site (short unique genomic sequences)

STS

Class

Taxon

Standard (high quality annotated sequence)

STD

Third Party Annotation (re-annotated and re-assembled)

TPA

Transcriptome Shotgun Assembly (computational assembly)

TSA

Whole Genome Shotgun

WGS


Data Classes

ACTGCTGCTAGCTAG

CON

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

DDBJ

GenBank

EST

Genome Survey Sequence (high-throughput short sequence)

ENA

GSS

Annotated /

Raw

HTC

High-Throughput cDNA (unfinished)

  • Derived data entries

    • e.g. patch genomic and RNA data to construct complete coverage

  • Must have publication

  • Must show which entries data is derived from

High-Throughput Genome sequencing (unfinished)

HTG

Ann

SRA

Trace

Mass Genome Annotation

MGA

Clean-up

Patent sequences

PAT

Sequence Tagged Site (short unique genomic sequences)

STS

Class

Taxon

Standard (high quality annotated sequence)

STD

Third Party Annotation (re-annotated and re-assembled)

TPA

Transcriptome Shotgun Assembly (computational assembly)

TSA

Whole Genome Shotgun

WGS


Data Classes

ACTGCTGCTAGCTAG

CON

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

DDBJ

GenBank

EST

Genome Survey Sequence (high-throughput short sequence)

ENA

GSS

Annotated /

Raw

HTC

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

HTG

Ann

SRA

Trace

Mass Genome Annotation

MGA

  • Also derived data entries

  • ESTs assembled to construct RNA

  • Must show which EST/HTC entries data is derived from

Clean-up

Patent sequences

PAT

Sequence Tagged Site (short unique genomic sequences)

STS

Class

Taxon

Standard (high quality annotated sequence)

STD

Third Party Annotation (re-annotated and re-assembled)

TPA

Transcriptome Shotgun Assembly (computational assembly)

TSA

Whole Genome Shotgun

WGS


Data Classes

ACTGCTGCTAGCTAG

CON

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

DDBJ

GenBank

EST

Genome Survey Sequence (high-throughput short sequence)

ENA

GSS

Annotated /

Raw

HTC

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

HTG

Ann

SRA

Trace

Mass Genome Annotation

MGA

Clean-up

Patent sequences

PAT

  • Entries change over time (completely replaced)

  • Raw WGS entries assembled into contigs  CON entries

Sequence Tagged Site (short unique genomic sequences)

STS

Class

Taxon

Standard (high quality annotated sequence)

STD

Third Party Annotation (re-annotated and re-assembled)

TPA

Transcriptome Shotgun Assembly (computational assembly)

TSA

Whole Genome Shotgun

WGS


Data classes
Data Classes

ACTGCTGCTAGCTAG

How stable is the data?

DDBJ

GenBank

ENA

Data is always changing:

Annotated /

Raw

  • Assembly of sequences into larger fragments

  • Deletion of obsolete entries (i.e. once assembled)

  • Sequence modifications

  • Daily updates

  • Identifier changes

  • Corrections (databases can contain errors)

  • etc…

Ann

SRA

Trace

Clean-up

Class

Taxon


Data Classes

ACTGCTGCTAGCTAG

How does assembly affect entries?

DDBJ

GenBank

ENA

Example:

Annotated /

Raw

WGS

Shotgun

Ann

SRA

Trace

  • Fragments in separate entry

Clean-up

CON

Constructed

  • Join to make new CON entries

Class

Taxon

  • Old WGS entries archived

Standard

STD

  • Join into large STD entry

  • (e.g. Completed genome)

  • Add annotation

  • Old CON entries archived


Taxonomy

ACTGCTGCTAGCTAG

HUM

Human

Mouse

DDBJ

GenBank

MUS

Rodent

ENA

ROD

Annotated /

Raw

MAM

Mammal

Vertebrate

VRT

Ann

SRA

Trace

Fungi

FUN

Clean-up

Other:

Invertebrate

INV

Plant

PLN

Class

Environmental

ENV

Taxon

Prokaryote

PRO

Synthetic

SYN

Phage

PHG

Transgenic

TGN

Viral

VIR

Unclassified

UNC


Taxonomy

ACTGCTGCTAGCTAG

HUM

Human

DDBJ

GenBank

Mouse

MUS

ENA

Rodent

ROD

  • CAUTION: organism never isolated

  • May blast sequence to assign putative organism

Annotated /

Raw

MAM

Mammal

Ann

SRA

Trace

Vertebrate

VRT

Fungi

FUN

Clean-up

Other:

Invertebrate

INV

Environmental

Class

ENV

Plant

PLN

Taxon

Synthetic

Prokaryote

SYN

PRO

Transgenic

Phage

TGN

PHG

Unclassified

UNC

Viral

VIR


Taxonomy

ACTGCTGCTAGCTAG

HUM

Human

DDBJ

GenBank

Mouse

MUS

ENA

Rodent

ROD

Annotated /

Raw

MAM

Mammal

  • CAUTION: not consistently handled, variable quality

  • Transgenics may be from multiple organisms

Ann

SRA

Trace

Vertebrate

VRT

Fungi

FUN

Clean-up

Other:

Invertebrate

INV

Environmental

Class

ENV

Plant

PLN

Taxon

Synthetic

Prokaryote

SYN

PRO

Transgenic

Phage

TGN

PHG

Unclassified

UNC

Viral

VIR


Taxonomy

ACTGCTGCTAGCTAG

HUM

Human

DDBJ

GenBank

Mouse

MUS

ENA

Rodent

ROD

Annotated /

Raw

MAM

Mammal

Ann

SRA

Trace

Vertebrate

VRT

  • Division primarily used by GenBank

  • for PAT (patent) sequences

Fungi

FUN

Clean-up

Other:

Invertebrate

INV

Environmental

Class

ENV

Plant

PLN

Taxon

Synthetic

Prokaryote

SYN

PRO

Transgenic

Phage

TGN

PHG

Unclassified

UNC

Viral

VIR


Taxonomy exclusion

ACTGCTGCTAGCTAG

Some species excluded

from certain taxonomic ranges

DDBJ

GenBank

Rodent

ENA

ROD

Annotated /

Raw

MAM

Mammal

Vertebrate

VRT

Ann

SRA

Trace

 excludes mouse

Clean-up

human

mouse

rodent

 excludes

Class

Taxon

  • Applies to:

  • ftp files and

  • sequence search tools

  • But not:

  • ENA Browser

human

mouse

rodent

mammal

  • excludes


Taxonomy Database

ACTGCTGCTAGCTAG

Which taxonomy database does ENA use?

DDBJ

GenBank

ENA

All INSDC databases use the NCBI Taxonomy Browser

Annotated /

Raw

Only organisms with sequence are represented

Ann

SRA

Trace

Clean-up

EBI Taxonomy Portal

Class

  • EBI-wide service  maps resources into taxonomy service

  • Culture collection – physical data, e.g. sample or stored version

  • Biomaterial

  • Specimen voucher

Taxon

representation, e.g. picture


Database Structure

ACTGCTGCTAGCTAG

How does data organization differ from GenBank?

DDBJ

GenBank

GenBank

ENA-Annotation

ENA

Annotated /

Raw

Data classes

Data classes

...

con

gss

htg

sts

est

htc

pat

std

Ann

SRA

Trace

...

con

est

htc

pat

std

gss

htg

sts

hum

Clean-up

mus

Taxonomic

Divisions

rod

mam

Taxonomic Divisions

vrt

fun

Class

...

mus

mam

fun

pln

rod

vrt

inv

hum

...

Taxon

  • Data split into parallel slices

  • Large search sets

  • Classes incomplete for taxonomy

  • Taxonomy incomplete for classes

  • Data split into intersecting slices

  • Reduces search set

  • Ensures complete result set


Database Structure

ACTGCTGCTAGCTAG

How does data organization differ from GenBank?

DDBJ

GenBank

ENA-Annotation

GenBank

ENA

  • ‘EST’ set

  • large data set

  • includes all EST entries

Annotated /

Raw

Data classes

Data classes

...

con

gss

htg

sts

est

htc

pat

std

Ann

SRA

Trace

...

con

est

htc

pat

std

gss

htg

sts

hum

Clean-up

mus

Taxonomic

Divisions

rod

mam

Taxonomic Divisions

vrt

fun

Class

  • ‘Mouse’ set

  • large data set

  • includes all mouse entries

...

mus

mam

fun

pln

rod

vrt

inv

hum

...

Taxon

  • ‘Mouse’ + ‘EST’ intersection

  • small data set

  • ensured complete set of mouse ESTs

  • Data split into intersecting slices

  • Reduces search set

  • Ensures complete result set

  • Data split into parallel slices

  • Large search sets

  • Classes incomplete for taxonomy

  • Taxonomy incomplete for classes


Data protein sequence
Data – Protein Sequence

  • UniProt databases:

    • UniProtKB: human curated and automatic translation sections

    • UniRef: non-redundant sequence clusters

    • UniParc: non-identical sequence archive

  • Sequence from structures:

    • PDB

    • SGT

  • Specialist data sets, e.g.:

    • Immunoglobulins: IMGT/HLA

    • Alternative splicing: ASD, ASTD

    • Completed proteomes: Ensembl, Integr8

    • Protein Interactions: IntAct

    • Patent Proteins: EPO, JPO, KIPO and USPTO

Sequence searching and alignments - Andrew Cowley



Protein sequence uniprot

GO

Functional info

InterPro classification

Some data sources for annotation

Protein identification data

PRIDE

Signal prediction

Protein families and domains

InterPro

Molecular interactions

IntAct

Transmembrane prediction

IntEnz

Enzymes

  • Automated annotation

Other predictions

Microbial protein families

HAMAP

Protein

classification

Post-translational modifications

RESID

Protein sequence: UniProt

UniProt

Sequence searching and alignments - Andrew Cowley


UniRef

Pre-computed clusters of similar proteins

UniProtKB

UniProt Knowledgebase, the centrepiece of the UniProt Consortium’s activities. It provides a richly curated protein database.

UniRef 50

UniRef 90

Proteome

Sets

IPI

UniMes

UniProt Metagenomic and Environmental Sequences

(available by FTP only)

UniRef 100

UniSave

UniProtKB

UniMes

UniParc

UniProt Sequence Archive. Contains all current and obsolete UniProtKB sequences

UniParc

UniSave

UniProt protein entry archive. Contains all versions of each protein entry.

(Accessed via www.uniprot.org and www.ebi.ac.uk/unisave)

PDB

Sub/

Peptide

Data

FlyBase

WormBase

Patent

Data

INSDC

(incl. WGS,

Env.)

RefSeq

Ensembl

VEGA

Database sources

UniProt data sources and data flow


The two sides of uniprotkb
The Two Sides of UniProtKB

UniProtKB/TrEMBL

UniProtKB/Swiss-Prot

Redundant, automatically annotated - unreviewed

Non-redundant, high-quality manual annotation - reviewed


Databases
Databases

  • Many databases and they are getting bigger

  • Efficient searching involves knowledge of what is stored in these

  • Don’t assume that everything in the databases is correct

  • Nothing is constant, but changes...

    • Deletions, sequence modifications

    • Daily updates, identifier changes, etc.

Sequence searching and alignments - Andrew Cowley


Searching databases
Searching databases

Sequence searching and alignments - Andrew Cowley


What is the difference between a primary and secondary database?

What methods of searching databases do you know of?

?

What is the best protein sequence database to search(specific part)?

?

?

Sequence searching and alignments - Andrew Cowley


Searching
Searching database?

  • Many ways of searching databases

  • Annotation/title

    • Know something about your sequence

      • Gene name

      • Function

      • Accession

Sequence searching and alignments - Andrew Cowley


New search service
New search service database?

Access from the EBI’s homepage

Species selector allows for easy comparison

  • Data organised according to:

  • gene

  • expression

  • protein

  • structure

  • literature

Explore data, return easily to

your results


Database webpages
Database webpages database?

Sequence searching and alignments - Andrew Cowley


Database searching
Database searching database?

Sequence searching and alignments - Andrew Cowley


Searching1
Searching database?

  • Many ways of searching databases

  • Annotation/title

    • Know something about your sequence

      • Gene name

      • Function

      • Accession

  • Raw data

    • Don’t know!

    • Or want to check...

  • Infer extra information

    • Homology?

    • Annotation?

    • Function?

Sequence searching and alignments - Andrew Cowley


Sequence alignment
Sequence alignment database?

  • Relatively easy if we have an exact match

  • .. But sequence is variable

    • Between individuals, species, location etc.

  • That variability is useful data too!

  • Need a search method that allows for some variability

  • And even better – helps us assess that variability

Sequence searching and alignments - Andrew Cowley


Sequence alignment1
Sequence alignment database?

Query:

ACATAGGT

2

1

TCATAGAT

AAATTCTG

Sequence searching and alignments - Andrew Cowley


Sequence alignment2
Sequence alignment database?

Query:

ACATAGGT

ACATAGGT

ACATAGGT

1

2

TCATAGAT

AAATTCTG

Sequence searching and alignments - Andrew Cowley


Sequence alignment3
Sequence alignment database?

Query:

ACATAGGT

1

2

ACATAGGT

ACATAGGT

TCATAGAT

AAATTCTG

3/8

Score:

6/8

Sequence searching and alignments - Andrew Cowley


Sequence alignment4
Sequence alignment database?

atttcacagaggaggacaaggctactatcacaagcctgtggggcaaggtgaatgtggaag

atgctggaggagaaaccctgggaaggctcctggttgtctacccatggacccagaggttct

ttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaagg

cacatggcaagaaggtgctgacttccttgggagatgccattaaagcacctgggatgatct

caagggcacctttgcccagcttgagt

Query:

1

atggtgctctctgcagctgacaaaaccaacatcaagaactgctgggggaagattggtggc

catggtggtgaatatggcgaggaggccctacagaggatgttcgctgccttccccaccacc

aagacctacttctctcacattgatgtaagccccggctctgcccaggtcaaggctcacggc

aagaaggttgctgatgccctggccaaagctgcagaccacgtcgaagacctgcctggtgcc

ctgtccactctgagcgacctgc

cacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggct

cctggttgtntacccatggacccagaggttctttgacagctttggcaacctgtcctctgc

ctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttcctt

gggagatgccataaagcacctggatgatctcaagggca

2

Sequence searching and alignments - Andrew Cowley


Dot plot
Dot plot database?

GATACT

Sequence 1

A C A T A G

Query

Maybe a dot plot will help

Sequence searching and alignments - Andrew Cowley


Dot plot1
Dot plot database?

Query vs Sequence 1

Query vs Sequence 2

1

2

Query

Query

Sequence searching and alignments - Andrew Cowley


We can see the difference, but how to turn that into something a computer can evaluate?

Computers rely on algorithms which give them a score

They can then compare scores

Sequence searching and alignments - Andrew Cowley


0 something a computer can evaluate?

0

-10

0

0

-10

  • Simple algorithm – penalise movement away from diagonal – gap penalty

Sequence searching and alignments - Andrew Cowley


Gap extend
Gap extend something a computer can evaluate?

Actual implementation is usually to apply gap extension penalty to every gap

-10.5

-10

-0.5

-10

-10

0

0

-10

-10

-10

0

-10

-10.5

-10

-0.5

Sequence searching and alignments - Andrew Cowley

  • Having opened a gap, we should assign a lesser penalty to extending it


Why a lesser gap extend penalty
Why a lesser gap extend penalty? something a computer can evaluate?

NVELKAET

NVDEATNFELKAET

NV-ELKAET

NV------ELKAET

NVDE--A-TNFELKAET

NVDEATNFELKAET

Single block of insertions/deletions is more likely than multiple in/del events

Sequence searching and alignments - Andrew Cowley


Match mismatch
Match/mismatch something a computer can evaluate?

A C G T

A

C

G

T

5 -4 -4 -4

-4 5 -4 -4

-4 -4 5 -4

-4 -4 -4 5

Of course, we need to tell the algorithm that matching letters are better than mismatches too

This is done via a scoring matrix

Sequence searching and alignments - Andrew Cowley


A C G T something a computer can evaluate?

A

C

G

T

5 -4 -4 -4

A

-13.5

-13

6

-4 5 -4 -4

Gap

-4 -4 5 -4

C

-18

1

-13

-4 -4 -4 5

-10

-10.5

-10

-4

-0.5

-10

T

-4

-18

-22.5

-10

0

0

-10

Mismatch

-10

A

C

A

-10

0

-10

-10.5

-10

-0.5

Sequence searching and alignments - Andrew Cowley

  • Putting the two together gives us a scoring mechanism


A something a computer can evaluate?

-13.5

-13

6

C

-18

1

-13

T

-4

-18

-22.5

A

C

A

Sequence searching and alignments - Andrew Cowley

  • To pick the optimal alignment, start at the end and trace back the highest scoring route.


Needleman wunsch
Needleman-Wunsch something a computer can evaluate?

  • Congratulations! You’ve just reconstructed the Needleman-Wunsch algorithm!

    • An example of dynamic programming

  • Comparing the full length of both sequences is called a global-global or just global alignment

Sequence searching and alignments - Andrew Cowley


Global vs local
Global vs Local something a computer can evaluate?

  • But global-global might not be suitable for sequences that are very different lengths

  • A modified form of this algorithm for local alignment is called the Smith-Waterman algorithm.

    • Sets negative scores in matrix to 0, and allows trace back to end and restart

Sequence searching and alignments - Andrew Cowley


Question global vs local which is which
QUESTION: Global vs Local - which is which? something a computer can evaluate?

A T G T A T A C G C

A - T G T A T A C G C

- A G T A T A - G C

A G T A T A - - - G C

LOCAL

GLOBAL

Sequence searching and alignments - Andrew Cowley


Scoring
Scoring something a computer can evaluate?

  • Parameters so far:

    • Match/mismatch

    • Gap opening

    • Gap extending

  • Can we improve it?

Sequence searching and alignments - Andrew Cowley


Substitutions
Substitutions something a computer can evaluate?

  • Some substitutions are more likely than others

  • DNA:

    • Purines (A,G) – dual ring

    • Pyrimidines (C, T) – single ring

  • Substitutions of the same type are called transitions, where as exchanging one for another is called a transversion

  • Transistions occur more frequently than transversions, so we can score them higher in the scoring matrix

Sequence searching and alignments - Andrew Cowley


Sequence searching and alignments - Andrew Cowley something a computer can evaluate?


Proteins
Proteins something a computer can evaluate?

What about proteins?

Sequence searching and alignments - Andrew Cowley


Protein substitution matrices
Protein substitution matrices something a computer can evaluate?

  • Can look at closely related proteins to determine substitution rates

  • Two most commonly used models:

    • BLOSUM

    • PAM

Sequence searching and alignments - Andrew Cowley


Blosum
BLOSUM something a computer can evaluate?

Blocks of Amino Acid Substitution Matrix

Align conserved regions of evolutionary divergent sequences clustered at a given % identity

Count relative frequencies of amino acids and substitution probability

Turn that into a matrix where the more positive a substitution is, the more likely is it to be found, and the more negative, the less likely.

Higher BLOSUM number = more closely related

Sequence searching and alignments - Andrew Cowley


PAM something a computer can evaluate?

PAM 250

Point Accepted Mutation

Observed mutations in a set of closely related proteins

Markov chain model created to describe substitutions

Normalised so that PAM1 = 1 mutation per 100 amino acids

Extrapolate matrices from model

Higher PAM number = less closely related

Sequence searching and alignments - Andrew Cowley


Effect of applying pam10 500 matrices to the human ldl receptor sequence

10 something a computer can evaluate?

100

200

300

400

500

Effect of applying PAM10 -> 500 matrices to the human LDL receptor sequence

Sequence searching and alignments - Andrew Cowley


BLOSUM 45 something a computer can evaluate?

PAM 250

BLOSUM 62

PAM 160

BLOSUM 90

PAM 100

More divergent

Less divergent

Sequence searching and alignments - Andrew Cowley


Scoring1
Scoring something a computer can evaluate?

  • Parameters:

    • Match/mismatch

    • Gap opening

    • Gap extending

    • Substitution matrix

Sequence searching and alignments - Andrew Cowley


Dynamic programming alignments at the ebi
Dynamic programming alignments at the EBI something a computer can evaluate?

  • EMBOSS Pairwise Alignment Algorithms

  • European Molecular Biology Open Software Suite

    • Suite of useful tools for molecular biology

    • Command line based

    • Designed to be used as part of scripts/chained programs

  • We implement selected tools to provide web-based access

Sequence searching and alignments - Andrew Cowley


Where to find at the ebi
Where to find at the EBI? something a computer can evaluate?

http://www.ebi.ac.uk/Tools/sequence.html

Or...

Sequence searching and alignments - Andrew Cowley


Where to find at the ebi1
Where to find at the EBI? something a computer can evaluate?

Sequence searching and alignments - Andrew Cowley


Emboss align tools
EMBOSS align tools something a computer can evaluate?

Needle

  • Global alignment

  • Local alignment

Water

Sequence searching and alignments - Andrew Cowley


Program selection something a computer can evaluate?

Parameters

Sequence input

Submit!

Sequence searching and alignments - Andrew Cowley


Sequence searching and alignments - Andrew Cowley something a computer can evaluate?


Key something a computer can evaluate?

- Gap

: Positive match

. Negative match

| Identity

Sequence searching and alignments - Andrew Cowley


Pairwise alignments example sequences
Pairwise Alignments - Example sequences something a computer can evaluate?

www.ebi.ac.uk/~watson/africa

Pairwise_align1.fsa

Pairwise_align2.fsa

Pages 25-30 in full booklet: Questions 7-10

Sequence searching and alignments - Andrew Cowley


Dynamic programming sequence search methods at the ebi
Dynamic programming sequence search methods at the EBI something a computer can evaluate?

GGSEARCH

  • Global alignment

  • Local alignment

  • Global query vs local database

SSEARCH

GLSEARCH

Sequence searching and alignments - Andrew Cowley


Where to find at the ebi2
Where to find at the EBI? something a computer can evaluate?

www.ebi.ac.uk/Tools/sss/

Or...

Sequence searching and alignments - Andrew Cowley


Similarity search
Similarity search something a computer can evaluate?

Database selection

Sequence input

Parameters

Submit!

Sequence searching and alignments - Andrew Cowley


Dynamic programming methods are something a computer can evaluate?rigorous and guarantee an optimal result

But have to store the matrix of both sequences in memory

And evaluate each position of the matrix

Predictably, this makes them slow and demanding when you are aligning large sequences

Sequence searching and alignments - Andrew Cowley


Heuristics
Heuristics something a computer can evaluate?

  • Therefore we need methods of estimating alignments

  • Estimation methods are called heuristics

    • Try and take short cuts in an intelligent manner

    • Speed up the search

    • At the possible expense of accuracy

  • Accuracy in sequence searches is important for:

    • Aligning the right bits

    • Scoring the alignment correctly

    • Identifying similar sequences - sensitivity

Sequence searching and alignments - Andrew Cowley


Going back to our dot plot something a computer can evaluate?

Sequence searching and alignments - Andrew Cowley


Instead of searching the whole matrix, if we narrow the search space down to a likely region we will improve the speed.

Sequence searching and alignments - Andrew Cowley


Sequence searching and alignments - Andrew Cowley


Fasta step 1
FASTA – step 1 alignments will be as nice as that one!

Ktup parameter:

How many consecutive identities before considered a ‘run’

Also called ‘word size’

Increase Ktup = faster, but less sensitive

Identify runs of identical sequence and pick regions with highest density of runs

Sequence searching and alignments - Andrew Cowley


Fasta step 2
FASTA – step 2 alignments will be as nice as that one!

Parameter:

Substitution matrix

Weight scoring of runs using matrix, trim back regions to those contributing to highest scores

Sequence searching and alignments - Andrew Cowley


Fasta step 3
FASTA – step 3 alignments will be as nice as that one!

Joining threshold:

Internally determined

Discard regions too far from the highest scoring region

Sequence searching and alignments - Andrew Cowley


Fasta step 4
FASTA – step 4 alignments will be as nice as that one!

Parameters:

Gap open

Gap extend

Substitution matrix

Use dynamic programming to optimise alignment in a narrow band encompassing the top scoring regions

Sequence searching and alignments - Andrew Cowley


Fasta
FASTA alignments will be as nice as that one!

Repeat against all sequences in the database

Sequence searching and alignments - Andrew Cowley


Fasta programs available at ebi
FASTA – programs available at EBI alignments will be as nice as that one!

  • FASTA: ”a fast approximation to Smith & Waterman”

    • FASTA – scan a protein or DNA sequence library for similar sequences.

    • FASTX/Y – compare a DNA sequence to a protein sequence databases, comparing the translated DNA sequence in forward or reverse translation frames.

    • TFASTX/Y – compare a protein sequence to a translated DNA data bank.

    • FASTF – compares ordered peptides (Edman degradation) to a protein databank.

    • FASTS – compares unordered peptides (Mass Spec.) to a protein databank.

    • SSEARCH – Rigorous scan of protein or DNA sequence library (S&W Algorithm).

Sequence searching and alignments - Andrew Cowley


Where to find at the ebi3
Where to find at the EBI? alignments will be as nice as that one!

www.ebi.ac.uk/Tools/sss/

Or...

Sequence searching and alignments - Andrew Cowley


Similarity search1
Similarity search alignments will be as nice as that one!

Database selection

Sequence input

Parameters

Submit!

Sequence searching and alignments - Andrew Cowley


Fasta results
FASTA - results alignments will be as nice as that one!

Sequence searching and alignments - Andrew Cowley


Fasta results1
FASTA - results alignments will be as nice as that one!

Sequence searching and alignments - Andrew Cowley


Fasta results2
FASTA - results alignments will be as nice as that one!

Sequence searching and alignments - Andrew Cowley


Fasta results3
FASTA - results alignments will be as nice as that one!

Key

- Gap

: Identity

. Similarity

X Filtered

Sequence searching and alignments - Andrew Cowley


Using fasta example sequence
Using FASTA - Example sequence alignments will be as nice as that one!

www.ebi.ac.uk/~watson/africa

test_prot.fasta

Page 37-46 in full booklet: Questions 11-14

Sequence searching and alignments - Andrew Cowley


Blast basic local alignment search tool
BLAST – Basic Local Alignment Search Tool alignments will be as nice as that one!

Altschul et al (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.

Instead of narrowing the dynamic programming search space, BLAST works a different way

Firstly, it creates a word list both of the exact sequence and high scoring substitutions

Sequence searching and alignments - Andrew Cowley


Blast step 1
BLAST – step 1 alignments will be as nice as that one!

SEWRFKHIYRGQPRRHLLTTGWSTFVT

SEW

EWR

Parameter:

Word length (w)

WRF

Increase = faster, but less sensitive

w=3

Sequence searching and alignments - Andrew Cowley


Blast step 1 cont d
BLAST – step 1(cont.d) alignments will be as nice as that one!

SEWRFKHIYRGQPRRHLLTTGWSTFVT

GQP 18

GEP 15

GRP 14

GKP 14

GNP 13

GDP 13

Parameters:

Neighbourhood threshold (T)

Substitution matrix

AQP 12

NQP 12

w=3

T=13

Sequence searching and alignments - Andrew Cowley


Blast step 2
BLAST – step 2 alignments will be as nice as that one!

Then it scans database sequences for exact matches with these words

Sequence searching and alignments - Andrew Cowley


Blast step 3
BLAST – step 3 alignments will be as nice as that one!

Parameters:

Drop off

Substitution matrix

If two hits are found on the same diagonal the alignment is extended until the score drops by a certain amount

This results in a High-scoring Segment Pair (HSP)

Sequence searching and alignments - Andrew Cowley


Blast step 4
BLAST – step 4 alignments will be as nice as that one!

Parameters:

Extension threshold (Sg)

Substitution matrix

If the total HSP score is above another threshold then a gapped extension is initiated

Sequence searching and alignments - Andrew Cowley


Blast
BLAST alignments will be as nice as that one!

The steps rule out many database sequences early on

Large increase in speed

Sequence searching and alignments - Andrew Cowley


Blast programs available at the ebi
BLAST – programs available at the EBI alignments will be as nice as that one!

Combines several parameters into ‘sensitivity’ option

  • Basic Local Alignment Search Tool

  • NCBI-BLAST programs:

    • BLASTP – protein sequence vs. protein sequence library

    • BLASTN – nucleotide query vs. nucleotide database

    • BLASTX – translated DNA vs. protein sequence library

  • WU-BLAST programs:

    • BLASTP – protein query vs. protein database

    • BLASTN – nucleotide query vs. nucleotide database

    • BLASTX – translated nucleotide query vs. protein database

    • TBLASTN – protein query vs. translated nucleotide database

    • TBLASTX – translated nucleotide query vs. translated nucleotide database

Sequence searching and alignments - Andrew Cowley


Sequence searching and alignments - Andrew Cowley alignments will be as nice as that one!


Using blast example sequence
Using BLAST - Example sequence alignments will be as nice as that one!

www.ebi.ac.uk/~watson/africa

test_prot.fasta

Pages 47-50 in full booklet: Questions 15-17

Sequence searching and alignments - Andrew Cowley


Key alignments will be as nice as that one!

- Gap

[residue] Identity

+ Similarity

X Filtered

Sequence searching and alignments - Andrew Cowley


Differences between blast and fasta
Differences between BLAST and FASTA alignments will be as nice as that one!

  • BLAST

    • Fast

    • Good with proteins

    • Produces good local alignments + short global alignments

    • Produces HSP (reports internal matches in long sequences)‏

    • Might miss a potential alignment due to ruling out sequences early on in the process

    • Good at finding siblings

  • FASTA

    • Not as fast as BLAST

    • Much better with DNA than BLASTN

    • Produces S&W alignments

    • Checks each possible alignment with database sequences

    • Good at finding cousins

Sequence searching and alignments - Andrew Cowley


When to use what
When to use what? alignments will be as nice as that one!

Query length

NCBI BLAST

WU-BLAST

PSI-SEARCH

FASTA

Database size

Sequence searching and alignments - Andrew Cowley


When to use what1
When to use what? alignments will be as nice as that one!

time to search

NCBI BLAST

WU-BLAST

PSI-SEARCH

FASTA

PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc

Sequence searching and alignments - Andrew Cowley


Homology and similarity
Homology and Similarity alignments will be as nice as that one!

Sequence searching and alignments - Andrew Cowley


Similarity
Similarity alignments will be as nice as that one!

Sequence searching and alignments - Andrew Cowley


Homology
Homology alignments will be as nice as that one!

Sequence searching and alignments - Andrew Cowley


Unrelated
Unrelated! alignments will be as nice as that one!

Sequence searching and alignments - Andrew Cowley


Homology vs similarity
Homology vs. Similarity alignments will be as nice as that one!

  • Presence of similar features because of common decent

  • Cannot be observed since the ancestors are not anymore

  • Is inferred as a conclusion based on ‘similarity’

  • Homology is like pregnancy: Either one is or one isn’t! (Gribskov – 1999)

  • Quantifies a ‘likeness’

  • Uses statistics to determine ‘significance’ of a similarity

  • Statistically significant similar sequences are considered ‘homologous’

Sequence searching and alignments - Andrew Cowley


Sequence searching and alignments - Andrew Cowley


Score vs significance
Score vs significance alignments will be as nice as that one!

A A A

A C A T A A G G C T

A A A

A T A C A A G C C T

High score

High significance

Sequence searching and alignments - Andrew Cowley


Lies damn lies and statistics
“Lies, damn lies, and statistics” alignments will be as nice as that one!

Sequence searching and alignments - Andrew Cowley


Lies damn lies and statistics1
“Lies, damn lies, and statistics” alignments will be as nice as that one!

Not just interested in score...

...But how likely we are to get that alignment by chance alone

It is this ‘non-random’ alignment that infers homology

Statistics are used to estimate this chance

Sequence searching and alignments - Andrew Cowley


E value
E-value alignments will be as nice as that one!

‘Expect’ value

Probability of obtaining this alignment by chance

Best measure of how good an alignment is

Often used for ranking results by default

Sequence searching and alignments - Andrew Cowley


Calculated in different ways for BLAST and FASTA alignments will be as nice as that one!

Short query sequences are more likely to be found by chance so have higher E-values

Affected by parameter values like gap penalties and substitution matrices

Sequence searching and alignments - Andrew Cowley


Fasta statistics
FASTA statistics alignments will be as nice as that one!

Compares query sequence with every sequence in database

As most of these sequences are unrelated it is possible to use the distribution of scores to assign statistical significance

Sequence searching and alignments - Andrew Cowley


Fasta histogram
FASTA - histogram alignments will be as nice as that one!

Key

Predicted distribution of scores

Observed distribution of scores

*

=

High scoring region

Sequence searching and alignments - Andrew Cowley


Blast statistics
BLAST statistics alignments will be as nice as that one!

“Appears to yield fairly accurate results”

Main reason for speed is that it doesn’t compare query with lots of other sequences

Therefore it pre-estimates statistical values using a random sequence model

Sequence searching and alignments - Andrew Cowley


Search guidelines

Search Guidelines alignments will be as nice as that one!


Search guidelines 1
Search guidelines 1 alignments will be as nice as that one!

Whenever possible, compare at the amino acid level rather than at the nucleotide level (fasta, blastp, etc…)‏

Then with translated DNA sequences (fastx, blastx)

Search with DNA vs. DNA as the next resort

And then with translated DNA vs. translated DNA (tfastx, tblastx) as the VERY LAST RESORT!

Sequence searching and alignments - Andrew Cowley


Search guidelines 2
Search guidelines 2 alignments will be as nice as that one!

  • Search the smallest database that is likely to contain the sequence(s) of interest

  • Use sequence statistics (E()-values) rather than % identity or % similarity, as your primary criterion for sequence homology

Sequence searching and alignments - Andrew Cowley


Search guidelines 3
Search guidelines 3 alignments will be as nice as that one!

  • Check that the statistics are likely to be accurate by looking for the highest scoring unrelated sequence

    • Examine the histograms

    • Use programs such as prss3 to confirm the expectation values.

    • Searching with shuffled sequences (use MLE/Shuffle in fasta) which should have an E() ~1.0

Sequence searching and alignments - Andrew Cowley


Sequence searching and alignments - Andrew Cowley alignments will be as nice as that one!


Search guidelines 4
Search guidelines 4 alignments will be as nice as that one!

  • Consider searches with different gap penalties and other scoring matrices

    • Use shallower matrices and/or more stringent gaps in order to uncover or force out relationships in partial sequences

    • Use BLOSUM62 instead of BLOSUM50 (or PAM100 instead of PAM250)

    • Remember to change the gap penalty defaults!

MATRIX open ext.

BLOSUM50 -10 -2

BLOSUM62 -7 -1

BLOSUM80 -16 -4

PAM250 -10 -2

PAM120 -16 -4

Sequence searching and alignments - Andrew Cowley


Search guidelines 5
Search guidelines 5 alignments will be as nice as that one!

  • Homology can be reliably inferred from statistically significant similarity

  • But remember:

    • Orthologous sequences have similar functions

    • Paralogous sequences can acquire very different functional roles

  • So further work might be needed to tease out details

Sequence searching and alignments - Andrew Cowley


Sequence searching and alignments - Andrew Cowley alignments will be as nice as that one!


Search guidelines 6
Search guidelines 6 alignments will be as nice as that one!

  • Consult motif or fingerprint databases in order to uncover evidence for conservation-critical or functional residues

    • However, motif identity in the absence of overall sequence similarity is not a reliable indicator of homology!

  • Try to produce multiple sequence alignments in order to validate the relatedness of your sequence data

    • ClustalW

    • MUSCLE

    • T-Coffee

    • Kalign

    • MAFFT

    • Mview (available from EBI FASTA & BLAST services)‏

    • DBCLUSTAL (available from EBI BLAST services)‏

Sequence searching and alignments - Andrew Cowley


Advanced

Advanced alignments will be as nice as that one!


Conserved regions alignments will be as nice as that one!

Structural information

Motifs

[R, T or D]-[D, A or Q]-[F, E or A]-A-T-H

In general, the more information we can add to an alignment, the better the result

Sequence searching and alignments - Andrew Cowley


Conserved regions
Conserved regions alignments will be as nice as that one!

We can even modify a normal search to generate a position specific scoring matrix, or PSSM

We can add a new ‘position’ parameter to the substitution matrix

Sequence searching and alignments - Andrew Cowley


Psi blast
PSI-BLAST alignments will be as nice as that one!

Position Specific Iterative – BLAST:

  • Takes the result of a normal BLAST

  • Aligns them and generates profile of conserved positions

  • Uses this to weight scoring on next iteration

Sequence searching and alignments - Andrew Cowley


Psi blast1
PSI-BLAST alignments will be as nice as that one!

More sensitive

By adding importance to conserved residues we might be able to find more distant sequences

But iterate too far and we might be assigning importance where there is none

Sequence searching and alignments - Andrew Cowley


Psi blast2
PSI-BLAST alignments will be as nice as that one!

Sequence searching and alignments - Andrew Cowley


Psi blast3
PSI-BLAST alignments will be as nice as that one!

Sequence searching and alignments - Andrew Cowley


Psi blast4
PSI-BLAST alignments will be as nice as that one!

Sequence searching and alignments - Andrew Cowley


Phi blast
PHI-BLAST alignments will be as nice as that one!

Pattern Hit Initiated-BLAST

User provides a pattern alongside a protein

Database hits have to contain this pattern, and similarity to rest of sequence

Results can initiate a PSI-BLAST search as well

Sequence searching and alignments - Andrew Cowley


Psi search
PSI-SEARCH alignments will be as nice as that one!

Smith-Waterman implementation (SSEARCH)

But with iterative position specific scoring

Sequence searching and alignments - Andrew Cowley


Using psi blast example sequence
Using PSI-BLAST - Example sequence alignments will be as nice as that one!

www.ebi.ac.uk/~watson/africa

test_prot.fasta

Pages 52-55 in full booklet: Questions 18-20

Sequence searching and alignments - Andrew Cowley


Problem sequences

Problem Sequences alignments will be as nice as that one!


Short sequences
Short sequences alignments will be as nice as that one!

  • What about short sequences?

  • Depends on their nature:

    • Protein

      • Reduce word length and/or increase the E() value cut off

      • Use shallow matrices

    • DNA

      • Reduce the word length‏

      • Ignore gap penalties (force local alignments only)‏

  • Use rigorous methods

  • But ask what you are trying to do!

Sequence searching and alignments - Andrew Cowley


Low complexity regions
Low complexity regions alignments will be as nice as that one!

Sometimes biologically relevant, but always likely to skew alignment scoring

E.g. CA repeats, poly-A tails and Proline rich regions

Sequence searching and alignments - Andrew Cowley


Good Statistics: alignments will be as nice as that one!

The inset shows good correlation

between the observed over expected

numbers of scores.

This is the region of the histogram to

look out for first when evaluating results.

Sequence searching and alignments - Andrew Cowley


Bad Statistics: alignments will be as nice as that one!

The inset shows bad correlation

between the observed and expected

scores in this search.

The spaces between the = and * symbols

indicate this poor correlation.

One reason for this can be low complexity

regions.

Sequence searching and alignments - Andrew Cowley


Low complexity regions1
Low complexity regions alignments will be as nice as that one!

  • Sometimes biologically relevant, but always likely to skew alignment scoring

  • E.g. CA repeats, poly-A tails and Proline rich regions

  • Compensate by filtering sequence so these regions don’t contribute to scoring

    • Filters: seg, xnu, dust, CENSOR

  • But check what you are filtering!

Sequence searching and alignments - Andrew Cowley


Filtered: alignments will be as nice as that one!

Inset showing the effect of using a low

complexity filter (seg) and searching

the database using the segment with

highest complexity.

Note that there is now good agreement

between the observed and expected

high score in the search and that the

distance between = and * has been

significantly reduced.

Sequence searching and alignments - Andrew Cowley


Using filters example sequence
Using Filters - Example sequence alignments will be as nice as that one!

www.ebi.ac.uk/~watson/africa

Filtertest_seq.fsa

Pages 56-57 in full booklet: Questions 21-22

Sequence searching and alignments - Andrew Cowley


Vector contamination
Vector contamination alignments will be as nice as that one!

You think you know what your sequence is..

.. But the results are really confusing!

Maybe you have vector contamination

Search against known vectors to check

Sequence searching and alignments - Andrew Cowley


Vector contamination1
Vector contamination alignments will be as nice as that one!

Sequence searching and alignments - Andrew Cowley


Vector contamination example sequences
Vector Contamination - Example sequences alignments will be as nice as that one!

www.ebi.ac.uk/~watson/africa

vectortest_seq1.fsa

vectortest_seq2.fsa

Page 57 in full booklet: Question 23

Sequence searching and alignments - Andrew Cowley


Multiple sequence alignments

Multiple Sequence Alignments alignments will be as nice as that one!


Uses of msa
Uses of MSA alignments will be as nice as that one!

Functional prediction

Phylogeny

Structural prediction

Homology detection

Protein analysis

To distinguish between orthology and parology

Sequence searching and alignments - Andrew Cowley


Ideally, you would build up multiple alignments through weighted sum of pairs (pairwise scores)

But this is too computationally intensive

Sequence searching and alignments - Andrew Cowley


Human beta --------VHLT weighted sum of pairs (pairwise scores)PEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST

Horse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN

Human alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-

Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-

Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT

Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT

Lupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE

*: : : * . : .: * : * : .

Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL

Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL

Human alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL

Horse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL

Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF

Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV

Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV

. .:: *. : . : *. * . : .

Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------

Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------

Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------

Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------

Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG

Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------

Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA---

: : .: . .. . :

Weighted Sums of Pairs: WSP

Sequences Time

Time O(LN)

2 1 second

3 150 seconds

4 6.25 hours

7 2404 years

5 39 days

6 16 years

Sequence searching and alignments - Andrew Cowley


Ideally, you would build up multiple alignments through weighted sum of pairs (pairwise scores)

But this is too computationally intensive

Therefore we have to use heuristics and progressive alignment methods

Sequence searching and alignments - Andrew Cowley


Clustal
Clustal weighted sum of pairs (pairwise scores)

  • >60,000 citations

  • Clustal1-Clustal4

    • 1988, Paul Sharp, Dublin

  • Clustal V 1992

    • EMBL Heidelberg,

    • Rainer Fuchs

    • Alan Bleasby

  • Clustal W, Clustal X 1994-2005

    • Toby Gibson, EMBL, Heidelberg

    • Julie Thompson, ICGEB, Strasbourg

  • Clustal W and Clustal X 2.0 2006

    • University College Dublin

www.clustal.org

Sequence searching and alignments - Andrew Cowley


Clustal1
CLUSTAL weighted sum of pairs (pairwise scores)

Quick, pairwise alignment of all sequences

Line up pairs, with the most similar first

Sequence searching and alignments - Andrew Cowley


Clustal2
CLUSTAL weighted sum of pairs (pairwise scores)

Fix the alignment between pairs and treat as one sequence

Sequence searching and alignments - Andrew Cowley


Clustal3
CLUSTAL weighted sum of pairs (pairwise scores)

Align your fixed pairs with each other

Sequence searching and alignments - Andrew Cowley


Note, this is not a phylogram! weighted sum of pairs (pairwise scores)

Only a guide tree for the alignment

Sequence searching and alignments - Andrew Cowley


Clustalw at the ebi
ClustalW at the EBI weighted sum of pairs (pairwise scores)

Sequence searching and alignments - Andrew Cowley


Clustalw
ClustalW weighted sum of pairs (pairwise scores)

Help!

Parameters

Sequence input

Submit!

Sequence searching and alignments - Andrew Cowley


Clustalw1
ClustalW weighted sum of pairs (pairwise scores)

Interactive – results in browser, deleted after 24 hours

Email – receive URL to results page, deleted after 7 days

Sequence searching and alignments - Andrew Cowley


Clustalw2
ClustalW weighted sum of pairs (pairwise scores)

Sequence searching and alignments - Andrew Cowley


Clustalw3
ClustalW weighted sum of pairs (pairwise scores)

Sequence searching and alignments - Andrew Cowley


Jalview
Jalview weighted sum of pairs (pairwise scores)

Sequence searching and alignments - Andrew Cowley


Clustalw4
ClustalW weighted sum of pairs (pairwise scores)

Advantages

  • Fast

  • Not too demanding

  • Widely used

  • Fine for most uses

Disadvantages

  • Fixing of early alignments

    • Propagate errors

  • Doesn’t search far

    • Local minima

  • Compresses gaps

Sequence searching and alignments - Andrew Cowley


Use of clustal jalview example sequences
Use of Clustal/JalView - Example sequences weighted sum of pairs (pairwise scores)

www.ebi.ac.uk/~watson/africa

prot_MSA.fasta

Problem_MSA1.fsa

Problem_MSA2.fsa

Problem_MSA3.fsa

Problem_MSA4.fsa

Pages 59-66 in full booklet: Questions 24-28

Sequence searching and alignments - Andrew Cowley


Other tools
Other Tools weighted sum of pairs (pairwise scores)

Sequence searching and alignments - Andrew Cowley


Coffee
COFFEE weighted sum of pairs (pairwise scores)

  • Consistency based Objective Function For alignmEnt Evaluation

  • Maximum Weight Trace (John Kececioglu)

  • Maximise similarity to a LIBRARY of residue pairs

  • Notredame, C., Holm, L. and Higgins, D.G. (1998) COFFEE: An objective function for multiple sequence alignments. Bioinformatics 14: 407-422.

Sequence searching and alignments - Andrew Cowley


Coffee1
COFFEE weighted sum of pairs (pairwise scores)

SAGA is another alignment method, using genetic algorithms

  • Library of reference pairwise alignments

    • For your given set of sequences

  • Objective Function

    • Evaluates consistency between multiple alignment and the library of pairwise alignments

    • Use SAGA to optimise this function

  • Weigh depending on quality of alignment

Sequence searching and alignments - Andrew Cowley


Coffee2
COFFEE weighted sum of pairs (pairwise scores)

  • More accurate than ClustalW

  • Much less prone to problems in early alignment stages

  • VERY slow!

Sequence searching and alignments - Andrew Cowley


T coffee
T-Coffee weighted sum of pairs (pairwise scores)

  • Tree-based COFFEE

  • Heuristic approach to COFFEE

    • Gets rid of genetic algorithm portion

    • Uses progressive alignments

    • Changes algorithm based on number of sequences

Sequence searching and alignments - Andrew Cowley


T coffee1
T-Coffee weighted sum of pairs (pairwise scores)

Much faster than COFFEE

Avoids some of ClustalW’s pitfalls

Can take information from several data sources

Still not that fast

Can be very demanding of memory etc.

Sequence searching and alignments - Andrew Cowley


Others
Others weighted sum of pairs (pairwise scores)

  • MUSCLE – Bob Edgar

    • Iterative/progressive alignment

    • Fast

    • Good for big alignments, proteins

  • MAFFT

    • Iterative based Fast Fourier Transform

    • Fast and accurate

    • Good for huge alignments

  • Kalign

    • Very fast, local-regions aligning

    • Good for very large numbers of alignments!

Sequence searching and alignments - Andrew Cowley


Which tool should i use
Which tool should I use? weighted sum of pairs (pairwise scores)

Input data

Recommendation

MUSCLE, T-Coffee, MAFFT, ClustalW

MUSCLE, MAFFT

MUSCLE, KALIGN

ClustalW

  • 2-100 sequences of typical protein length

  • 100-500 sequences

  • >500 sequences

  • Small number of unusually long sequences

Sequence searching and alignments - Andrew Cowley


How to evaluate
How to evaluate? weighted sum of pairs (pairwise scores)

Use a benchmark

BaliBASE

Sequence searching and alignments - Andrew Cowley


BaliBASE weighted sum of pairs (pairwise scores)

Thompson, JD, Plewniak, F. and Poch, O. (1999)NAR and Bioinformatics

  • ICGEB Strasbourg

  • 141 manual alignments using structures

    • 5 sections

    • core alignment regions marked

3. Two groups (12)

1. Equidistant(82)

4. Long internal gaps(13)

2. Orphan(23)

5. Long terminal gaps(11)

Sequence searching and alignments - Andrew Cowley


Benchmark pitfalls
Benchmark pitfalls weighted sum of pairs (pairwise scores)

  • Benchmark dataset may not be representative

  • Danger of over-training towards benchmark

  • Goldman: Most MSAs have unrealistic gaps

    • Tend towards multiple, independent deletions

    • Insertions are rare

      • Sequences shrink in length over evolution

        • No supporting evidence that this is the case

Sequence searching and alignments - Andrew Cowley


Solutions
Solutions weighted sum of pairs (pairwise scores)

Use phylogentic data to guide alignment

Keep track of changes to ancestor sequences

Don’t change them again so easily in decendents

Sequence searching and alignments - Andrew Cowley


Prank
PRANK weighted sum of pairs (pairwise scores)

  • Probabilistic Alignment Kit

  • webPRANK

  • Better suited for closely related sequences

  • Tied solutions are chosen from at random

    • Avoids incorrect confidence in result

    • Means alignments might not be reproducible

  • Alignments look quite different

    • Might look worse!

    • But gap patterns make sense

    • Gaps are good!

Sequence searching and alignments - Andrew Cowley


Sequence searching and alignments - Andrew Cowley weighted sum of pairs (pairwise scores)


Sequence searching and alignments - Andrew Cowley weighted sum of pairs (pairwise scores)


Comparing alignments example sequences
Comparing Alignments - Example sequences weighted sum of pairs (pairwise scores)

www.ebi.ac.uk/~watson/africa

prot_MSA.fasta

Pages 67-74 in full booklet: Questions 29-30

Sequence searching and alignments - Andrew Cowley


Common problems with msa
Common problems with MSA weighted sum of pairs (pairwise scores)

  • Input format

    • FASTA format

    • Unique sequence identifiers

    • Include sequence!

  • Job can’t be found

    • Interactive results deleted after 24hrs

    • Use email

    • Consider other tool

Sequence searching and alignments - Andrew Cowley


Common mis uses of msa
Common mis-uses of MSA weighted sum of pairs (pairwise scores)

  • Performing a sequence assembly

    • Specialist type of MSA

    • Use other tools (Staden etc.)

  • Aligning ESTs to a reference genome

    • Use EST2Genome

  • Designing primers

    • Use primer tools (primer3 etc.)

  • Aligning two sequences

    • Use a pairwise alignment tool!

Sequence searching and alignments - Andrew Cowley


Putting it all together
Putting it all together weighted sum of pairs (pairwise scores)

EB-Eye search

Sequence retrieval

Sequence search

Sequences retrieval

Multiple sequence alignment

Analysis

Sequence searching and alignments - Andrew Cowley


Final remarks
Final remarks weighted sum of pairs (pairwise scores)

  • Don’t assume a single tool will cater for all your needs

  • DO change the parameters of the tools

  • Remember where the tool excels and what its limitations are

  • A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!)

  • Crazy input will always give crazy results!

Sequence searching and alignments - Andrew Cowley


Getting help

Getting Help weighted sum of pairs (pairwise scores)


Getting help1
Getting Help weighted sum of pairs (pairwise scores)

  • Database documentation

  • Frequently Asked Questions

    • http://www.ebi.ac.uk/help/faq.html

  • 2can Support Portal

    • http://www.ebi.ac.uk/2can/

  • EBI Support

    • http://www.ebi.ac.uk/support/

  • Hands-on training programme

    • http://www.ebi.ac.uk/training/handson/

Sequence searching and alignments - Andrew Cowley


Thanks
Thanks! weighted sum of pairs (pairwise scores)

Hamish McWilliam and Andrew Cowley

Vicky Schneider

Rodrigo Lopez

EMBL-EBI

SLING

You!

Sequence searching and alignments - Andrew Cowley


ad