Ebi roadshow
This presentation is the property of its rightful owner.
Sponsored Links
1 / 211

EBI Roadshow PowerPoint PPT Presentation


  • 34 Views
  • Uploaded on
  • Presentation posted in: General

EBI Roadshow. James Watson, PhD Senior Scientific Training Officer EBI-EMBL [email protected] Sequence Searching and Alignments. Andrew Cowley External Services, EMBL-EBI. External Services. Andrew Cowley Bioinformatics Trainer. Hamish McWilliam Software engineer. Rodrigo Lopez

Download Presentation

EBI Roadshow

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ebi roadshow

EBI Roadshow

James Watson, PhD

Senior Scientific Training Officer

EBI-EMBL

[email protected]


Sequence searching and alignments

Sequence Searching and Alignments

Andrew Cowley

External Services, EMBL-EBI


External services

External Services

Andrew Cowley

Bioinformatics Trainer

Hamish McWilliam

Software engineer

Rodrigo Lopez

Head of External Services

+ many others!

Sequence searching and alignments - Andrew Cowley


Contents

Contents

  • Sequence databases

    • Database browsing tools

  • Similarity searching and alignments

    • Alignment basics

    • Similarity searching tools

    • More advanced tools

    • Alignment tools

    • Guidelines

  • (slightly) More advanced tools

  • Problem sequences

Sequence searching and alignments - Andrew Cowley


Materials

Materials

Presentations and tutorials can be found on

the roadshow course page at the EBI

Data files for exercises can be found at:

www.ebi.ac.uk/~watson/africa

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Data

Simplistically, much of the data at the EBI can be thought of as a container

One part being the raw data (eg. Sequence)

Another part being annotation on this data

Sequence searching and alignments - Andrew Cowley


Example

ID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP.

XX

AC AJ131285;

XX

DT 24-APR-2001 (Rel. 67, Created)

DT 20-JUL-2001 (Rel. 68, Last updated, Version 4)

XX

DE Sabellaspallanzanii mRNA for globin 3

XX

KW globin; globin 3; globin gene.

XX

OS Sabellaspallanzanii

OC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata;

OC Sabellida; Sabellidae; Sabella.

XX

RN [1]

RP 1-919

RA Negrisolo E.M.;

RT ;

RL Submitted (11-DEC-1998) to the EMBL/GenBank/DDBJ databases.

RL Negrisolo E.M., Biologia, Universita degli Studi di Padova, via U. Bassi

RL 58/B, Padova,35131, ITALY.

FH Key Location/Qualifiers

FH

FT source 1..919

FT /organism="Sabellaspallanzanii"

FT /mol_type="mRNA"

FT /db_xref="taxon:85702"

FT CDS 73..552

FT /gene="globin"

FT /product="globin 3"

FT /function="respiratory pigment"

FT /db_xref="GOA:Q9BHK1"

FT /db_xref="InterPro:IPR000971"

FT /db_xref="InterPro:IPR014610"

FT /db_xref="UniProtKB/TrEMBL:Q9BHK1"

FT /experiment="experimental evidence, no additional details

FT recorded"

FT /protein_id="CAC37412.1"

FT /translation="MYKWLLCLALIGCVSGCNILQRLKVKNQWQEAFGYADDRTSXGTA

FT LWRSIIMQKPESVDKFFKRVNGKDISSPAFQAHIQRVFGGFDMCISMLDDSDVLASQLA

FT HLHAQHVERGISAEYFDVFAESLMLAVESTIESCFDKDAWSQCTKVISSGIGSGV"

XX

SQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other;

caaacagtcarttaattcacagagccctgaggtctctcgctcctttctgcgtcactctct 60

cttaccgtcatcatgtacaagtggttgctttgcctggctctgattggctgcgtcagcggc 120

tgcaacatcctccagaggctgaaggtcaagaaccagtggcaggaggctttcggctatgct 180

gacgacaggacatcccycggtaccgcattgtggagatccatcatcatgcagaagcccgag 240

//

Example

Sequence searching and alignments - Andrew Cowley


Data nucleotide

Data - Nucleotide

  • ENA/EMBL-Bank:

    • Release and updates

    • Divided into classes and divisions

    • Supplementary sets: EMBL-CDS, EMBL-MGA

  • Specialist data sets, e.g.:

    • Immunoglobulins: IMGT/HLA, IMGT/LIGM, etc.

    • Alternative splicing: ASD, ASTD, etc.

    • Completed genomes: Ensembl, Integr8, etc.

    • Variation: HGVBase, dbSNP, etc.

Sequence searching and alignments - Andrew Cowley


Individual sequencing

Individual sequencing

ACTGCTGCTAGCTAG

What sequence data is submitted?

Individual scientists

Sequence individual gene

ACTGCTGCTAGCTGGCTGACTATTCTAGCTTTAGCTGAGTGACTATTATCAGCTATTACAGCATCCG

add annotation

submission

submission


High throughput sequencing

High throughput sequencing

ACTGCTGCTAGCTAG

chromosome

fragment

sequencing library

sequence reads

assemble sequence

annotation

cyp30

cyp309

insv

cg343


High throughput sequencing1

High throughput sequencing

ACTGCTGCTAGCTAG

chromosome

Large-scale sequencing

projects

fragment

sequencing library

submission

sequence reads

e.g. whole genome shotgun

assemble sequence

submission

submission

annotation

cyp30

cyp309

insv

cg343


What are primary sequence databases

What are primary sequence databases?

ACTGCTGCTAGCTAG

Individual scientists

ACTGCTGCTAGCTAGCTGATCTATGCTAGC

TGTAGCTGAG

Large-scale sequencing

projects

Patent Offices

Primary sequence data

  • Original sequence data

    • Experimental data

    • Patent data

  • Submitter-defined

Primary

sequence

database


How do primary and derived databases differ

How do primary and derived databases differ?

ACTGCTGCTAGCTAG

Individual scientists

ACTGCTGCTAGCTAGCTGATCTATGCTAGC

TGTAGCTGAG

Large-scale sequencing

projects

Patent Offices

Primary sequence data

Primary

sequence

database

Derived

database

Derived data

e.g. protein sequence


Ebi roadshow

Primary v. derived data

ACTGCTGCTAGCTAG

submit

DNA sequence

ACGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACAT

transcribe

Derived mRNA sequence

AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC

translate

Derived protein sequence

MRSNECCCAMSC


Ebi roadshow

How do primary and derived databases differ?

ACTGCTGCTAGCTAG

Individual scientists

ACTGCTGCTAGCTAGCTGATCTATGCTAGC

TGTAGCTGAG

Large-scale sequencing

projects

Patent Offices

If anything in submission varies (e.g. source / submitter / sequence)  generates a new entry

Primary sequence data

may be

non-redundant

Primary

sequence

database

Derived

database

Derived data

e.g. protein sequence

redundant


Ebi roadshow

How do primary and derived databases differ?

ACTGCTGCTAGCTAG

Individual scientists

ACTGCTGCTAGCTAGCTGATCTATGCTAGC

TGTAGCTGAG

Large-scale sequencing

projects

Patent Offices

Primary sequence data

regenerate data

Primary

sequence

database

Derived

database

Derived data

e.g. protein sequence

data lost


Ebi roadshow

  • INSDC:

  • International Nucleotide

  • Sequence Database

  • Collaboration

  • Daily exchange of data

Primary nucleotide sequence databases

ACTGCTGCTAGCTAG

DDBJ

GenBank

GenBank

DDBJ

ENA

(U.S.A.)

(Japan)

Submission can be made to any INSDC database

ENA

(Europe)


Sequence information

Sequence information

ACTGCTGCTAGCTAG

How is sequence data processed?

DDBJ

GenBank

ENA

  • Sequence machine output (reads)

  • Quality scores

Reads

  • Fragmented sequence reads

  •  assembled into contigs

  •  mapped onto chromosomes

Assembly

Annotation

  • Functional information assigned to assembled regions


Sequence information1

Sequence information

ACTGCTGCTAGCTAG

What type of sequence data is submitted?

  • Input information:

  • Sample

  • Set-up

  • Machine configuration

  • Output machine data:

  • Sequence traces

  • Reads

  • Quality scores

  • Metagenomic data:

  • Where originated

DDBJ

GenBank

ENA

Reads

Annotated /

Raw

Raw data

Assembled sequences

Assembly

  • Interpreted information:

  • Assembly

  • Mapping

  • Functional annotation

  • Sample information

Annotated sequence

Annotation


Ebi roadshow

European Nucleotide Archive

ACTGCTGCTAGCTAG

How does ENA store the data?

DDBJ

GenBank

ENA

Large-scale sequencing

projects

Annotated /

Raw

Trace Archive

Ann

SRA

Trace

ENA

Raw data

Sequence Read

Archive (SRA)

Individual scientists

Assembled sequences

ENA-Annotation

(formerly EMBL-Bank)

Annotated sequence

Patent Offices


Ebi roadshow

European Nucleotide Archive

ACTGCTGCTAGCTAG

How does ENA store the data?

DDBJ

GenBank

ENA

Large-scale sequencing

projects

  • Trace sequence reads

  • Capillary sequencing

  • instruments

Annotated /

Raw

Trace Archive

Ann

SRA

Trace

ENA

Raw data

Sequence Read

Archive (SRA)

  • Intensity reads

  • Next-generation

  • sequencing instruments

Individual scientists

Assembled sequences

ENA-Annotation

(formerly EMBL-Bank)

Annotated sequence

Patent Offices


Indsc sequencing projects

INDSC Sequencing Projects

ACTGCTGCTAGCTAG

Can data be traced to an Institute?

DDBJ

GenBank

Complete genome / metagenome

ENA

Database records

Pulls information together

Annotated /

Raw

Ann

SRA

Trace

genomic

genomic

Assembly & annotation

Track projects

ESTs...

ESTs...

Institute

shotgun

shotgun

Comparative analysis

Consortium

Assembly & annotation

(single organism / metagenomic study)


Nucleotides european nucleotide archive ena

Nucleotides: European Nucleotide Archive (ENA)

The ENA has a three-tiered data architecture.

It consolidates information from EMBL-Bank, the European Trace Archive (containing raw data from electrophoresis-based sequencing machines) and the Sequence Read Archive (containing raw data from next-generation sequencing platforms).

Figure adapted from: Cochrane, G. et al. Public Data Resources as the Foundation for a Worldwide Metagenomics Data Infrastructure. In: Metagenomics: Theory, Methods and Applications (Chapter 5), Caister Academic Press, Universidad Nacional de Cordoba, Argentina. Ed. D. Marco (2010).

Sequence searching and alignments - Andrew Cowley


Data quality

Data Quality

ACTGCTGCTAGCTAG

Is the data cleaned up?

DDBJ

GenBank

ENA

Validation of submitted data:

Annotated /

Raw

Ann

SRA

Trace

  • Automatic quality checks

Clean-up

  • Some manual inspection and curation

Errors can still exist in sequence and annotation


Ebi roadshow

Database Structure

ACTGCTGCTAGCTAG

How is the data organized?

DDBJ

GenBank

ENA

Data in ENA Annotation is divided in 2 ways:

Annotated /

Raw

1) Data classes

Ann

SRA

Trace

  • Type of data or

  • Methodology used to obtain data

  • Each entry belongs to one data class

Clean-up

Class

Taxon

2) Taxonomic Divisions

  • Each entry belongs to one taxonomic division


Ebi roadshow

Data Classes

ACTGCTGCTAGCTAG

CON

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

DDBJ

GenBank

EST

Genome Survey Sequence (high-throughput short sequence)

ENA

GSS

Annotated /

Raw

HTC

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

HTG

Ann

SRA

Trace

Mass Genome Annotation

MGA

Clean-up

Patent sequences

PAT

Sequence Tagged Site (short unique genomic sequences)

STS

Class

Taxon

Standard (high quality annotated sequence)

STD

Third Party Annotation (re-annotated and re-assembled)

TPA

Transcriptome Shotgun Assembly (computational assembly)

TSA

Whole Genome Shotgun

WGS


Ebi roadshow

Data Classes

ACTGCTGCTAGCTAG

CON

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

DDBJ

GenBank

EST

Genome Survey Sequence (high-throughput short sequence)

ENA

GSS

  • Single pass reads  variable quality

  • Need to search both EST and RNA data

Annotated /

Raw

HTC

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

HTG

Ann

SRA

Trace

Mass Genome Annotation

MGA

Clean-up

Patent sequences

PAT

Sequence Tagged Site (short unique genomic sequences)

STS

Class

Taxon

Standard (high quality annotated sequence)

STD

Third Party Annotation (re-annotated and re-assembled)

TPA

Transcriptome Shotgun Assembly (computational assembly)

TSA

Whole Genome Shotgun

WGS


Ebi roadshow

Data Classes

ACTGCTGCTAGCTAG

CON

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

DDBJ

GenBank

EST

Genome Survey Sequence (high-throughput short sequence)

ENA

GSS

Annotated /

Raw

HTC

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

HTG

Ann

SRA

Trace

Mass Genome Annotation

MGA

Clean-up

Patent sequences

PAT

Sequence Tagged Site (short unique genomic sequences)

STS

Class

Taxon

  • Often copies of existing entries

  • Records not clean, even for taxonomy

Standard (high quality annotated sequence)

STD

Third Party Annotation (re-annotated and re-assembled)

TPA

Transcriptome Shotgun Assembly (computational assembly)

TSA

Whole Genome Shotgun

WGS


Ebi roadshow

Data Classes

ACTGCTGCTAGCTAG

CON

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

DDBJ

GenBank

EST

Genome Survey Sequence (high-throughput short sequence)

ENA

GSS

Annotated /

Raw

HTC

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

HTG

Ann

SRA

Trace

  • Bulk of entries

  • Highest level of tracked information

Mass Genome Annotation

MGA

Clean-up

Patent sequences

PAT

Sequence Tagged Site (short unique genomic sequences)

STS

Class

Taxon

Standard (high quality annotated sequence)

STD

Third Party Annotation (re-annotated and re-assembled)

TPA

Transcriptome Shotgun Assembly (computational assembly)

TSA

Whole Genome Shotgun

WGS


Ebi roadshow

Data Classes

ACTGCTGCTAGCTAG

CON

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

DDBJ

GenBank

EST

Genome Survey Sequence (high-throughput short sequence)

ENA

GSS

Annotated /

Raw

HTC

High-Throughput cDNA (unfinished)

  • Derived data entries

    • e.g. patch genomic and RNA data to construct complete coverage

  • Must have publication

  • Must show which entries data is derived from

High-Throughput Genome sequencing (unfinished)

HTG

Ann

SRA

Trace

Mass Genome Annotation

MGA

Clean-up

Patent sequences

PAT

Sequence Tagged Site (short unique genomic sequences)

STS

Class

Taxon

Standard (high quality annotated sequence)

STD

Third Party Annotation (re-annotated and re-assembled)

TPA

Transcriptome Shotgun Assembly (computational assembly)

TSA

Whole Genome Shotgun

WGS


Ebi roadshow

Data Classes

ACTGCTGCTAGCTAG

CON

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

DDBJ

GenBank

EST

Genome Survey Sequence (high-throughput short sequence)

ENA

GSS

Annotated /

Raw

HTC

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

HTG

Ann

SRA

Trace

Mass Genome Annotation

MGA

  • Also derived data entries

  • ESTs assembled to construct RNA

  • Must show which EST/HTC entries data is derived from

Clean-up

Patent sequences

PAT

Sequence Tagged Site (short unique genomic sequences)

STS

Class

Taxon

Standard (high quality annotated sequence)

STD

Third Party Annotation (re-annotated and re-assembled)

TPA

Transcriptome Shotgun Assembly (computational assembly)

TSA

Whole Genome Shotgun

WGS


Ebi roadshow

Data Classes

ACTGCTGCTAGCTAG

CON

Constructed from sequence assemblies

Expressed Sequence Tag (cDNA)

DDBJ

GenBank

EST

Genome Survey Sequence (high-throughput short sequence)

ENA

GSS

Annotated /

Raw

HTC

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

HTG

Ann

SRA

Trace

Mass Genome Annotation

MGA

Clean-up

Patent sequences

PAT

  • Entries change over time (completely replaced)

  • Raw WGS entries assembled into contigs  CON entries

Sequence Tagged Site (short unique genomic sequences)

STS

Class

Taxon

Standard (high quality annotated sequence)

STD

Third Party Annotation (re-annotated and re-assembled)

TPA

Transcriptome Shotgun Assembly (computational assembly)

TSA

Whole Genome Shotgun

WGS


Data classes

Data Classes

ACTGCTGCTAGCTAG

How stable is the data?

DDBJ

GenBank

ENA

Data is always changing:

Annotated /

Raw

  • Assembly of sequences into larger fragments

  • Deletion of obsolete entries (i.e. once assembled)

  • Sequence modifications

  • Daily updates

  • Identifier changes

  • Corrections (databases can contain errors)

  • etc…

Ann

SRA

Trace

Clean-up

Class

Taxon


Ebi roadshow

Data Classes

ACTGCTGCTAGCTAG

How does assembly affect entries?

DDBJ

GenBank

ENA

Example:

Annotated /

Raw

WGS

Shotgun

Ann

SRA

Trace

  • Fragments in separate entry

Clean-up

CON

Constructed

  • Join to make new CON entries

Class

Taxon

  • Old WGS entries archived

Standard

STD

  • Join into large STD entry

  • (e.g. Completed genome)

  • Add annotation

  • Old CON entries archived


Ebi roadshow

Taxonomy

ACTGCTGCTAGCTAG

HUM

Human

Mouse

DDBJ

GenBank

MUS

Rodent

ENA

ROD

Annotated /

Raw

MAM

Mammal

Vertebrate

VRT

Ann

SRA

Trace

Fungi

FUN

Clean-up

Other:

Invertebrate

INV

Plant

PLN

Class

Environmental

ENV

Taxon

Prokaryote

PRO

Synthetic

SYN

Phage

PHG

Transgenic

TGN

Viral

VIR

Unclassified

UNC


Ebi roadshow

Taxonomy

ACTGCTGCTAGCTAG

HUM

Human

DDBJ

GenBank

Mouse

MUS

ENA

Rodent

ROD

  • CAUTION: organism never isolated

  • May blast sequence to assign putative organism

Annotated /

Raw

MAM

Mammal

Ann

SRA

Trace

Vertebrate

VRT

Fungi

FUN

Clean-up

Other:

Invertebrate

INV

Environmental

Class

ENV

Plant

PLN

Taxon

Synthetic

Prokaryote

SYN

PRO

Transgenic

Phage

TGN

PHG

Unclassified

UNC

Viral

VIR


Ebi roadshow

Taxonomy

ACTGCTGCTAGCTAG

HUM

Human

DDBJ

GenBank

Mouse

MUS

ENA

Rodent

ROD

Annotated /

Raw

MAM

Mammal

  • CAUTION: not consistently handled, variable quality

  • Transgenics may be from multiple organisms

Ann

SRA

Trace

Vertebrate

VRT

Fungi

FUN

Clean-up

Other:

Invertebrate

INV

Environmental

Class

ENV

Plant

PLN

Taxon

Synthetic

Prokaryote

SYN

PRO

Transgenic

Phage

TGN

PHG

Unclassified

UNC

Viral

VIR


Ebi roadshow

Taxonomy

ACTGCTGCTAGCTAG

HUM

Human

DDBJ

GenBank

Mouse

MUS

ENA

Rodent

ROD

Annotated /

Raw

MAM

Mammal

Ann

SRA

Trace

Vertebrate

VRT

  • Division primarily used by GenBank

  • for PAT (patent) sequences

Fungi

FUN

Clean-up

Other:

Invertebrate

INV

Environmental

Class

ENV

Plant

PLN

Taxon

Synthetic

Prokaryote

SYN

PRO

Transgenic

Phage

TGN

PHG

Unclassified

UNC

Viral

VIR


Ebi roadshow

Taxonomy exclusion

ACTGCTGCTAGCTAG

Some species excluded

from certain taxonomic ranges

DDBJ

GenBank

Rodent

ENA

ROD

Annotated /

Raw

MAM

Mammal

Vertebrate

VRT

Ann

SRA

Trace

 excludes mouse

Clean-up

human

mouse

rodent

 excludes

Class

Taxon

  • Applies to:

  • ftp files and

  • sequence search tools

  • But not:

  • ENA Browser

human

mouse

rodent

mammal

  • excludes


Ebi roadshow

Taxonomy Database

ACTGCTGCTAGCTAG

Which taxonomy database does ENA use?

DDBJ

GenBank

ENA

All INSDC databases use the NCBI Taxonomy Browser

Annotated /

Raw

Only organisms with sequence are represented

Ann

SRA

Trace

Clean-up

EBI Taxonomy Portal

Class

  • EBI-wide service  maps resources into taxonomy service

  • Culture collection – physical data, e.g. sample or stored version

  • Biomaterial

  • Specimen voucher

Taxon

representation, e.g. picture


Ebi roadshow

Database Structure

ACTGCTGCTAGCTAG

How does data organization differ from GenBank?

DDBJ

GenBank

GenBank

ENA-Annotation

ENA

Annotated /

Raw

Data classes

Data classes

...

con

gss

htg

sts

est

htc

pat

std

Ann

SRA

Trace

...

con

est

htc

pat

std

gss

htg

sts

hum

Clean-up

mus

Taxonomic

Divisions

rod

mam

Taxonomic Divisions

vrt

fun

Class

...

mus

mam

fun

pln

rod

vrt

inv

hum

...

Taxon

  • Data split into parallel slices

  • Large search sets

  • Classes incomplete for taxonomy

  • Taxonomy incomplete for classes

  • Data split into intersecting slices

  • Reduces search set

  • Ensures complete result set


Ebi roadshow

Database Structure

ACTGCTGCTAGCTAG

How does data organization differ from GenBank?

DDBJ

GenBank

ENA-Annotation

GenBank

ENA

  • ‘EST’ set

  • large data set

  • includes all EST entries

Annotated /

Raw

Data classes

Data classes

...

con

gss

htg

sts

est

htc

pat

std

Ann

SRA

Trace

...

con

est

htc

pat

std

gss

htg

sts

hum

Clean-up

mus

Taxonomic

Divisions

rod

mam

Taxonomic Divisions

vrt

fun

Class

  • ‘Mouse’ set

  • large data set

  • includes all mouse entries

...

mus

mam

fun

pln

rod

vrt

inv

hum

...

Taxon

  • ‘Mouse’ + ‘EST’ intersection

  • small data set

  • ensured complete set of mouse ESTs

  • Data split into intersecting slices

  • Reduces search set

  • Ensures complete result set

  • Data split into parallel slices

  • Large search sets

  • Classes incomplete for taxonomy

  • Taxonomy incomplete for classes


Data protein sequence

Data – Protein Sequence

  • UniProt databases:

    • UniProtKB: human curated and automatic translation sections

    • UniRef: non-redundant sequence clusters

    • UniParc: non-identical sequence archive

  • Sequence from structures:

    • PDB

    • SGT

  • Specialist data sets, e.g.:

    • Immunoglobulins: IMGT/HLA

    • Alternative splicing: ASD, ASTD

    • Completed proteomes: Ensembl, Integr8

    • Protein Interactions: IntAct

    • Patent Proteins: EPO, JPO, KIPO and USPTO

Sequence searching and alignments - Andrew Cowley


Sequence databases

Sequence Databases

Genbank


Protein sequence uniprot

  • Manual curation

  • Literature-based annotation

  • Sequence analysis

GO

Functional info

InterPro classification

Some data sources for annotation

Protein identification data

PRIDE

Signal prediction

Protein families and domains

InterPro

Molecular interactions

IntAct

Transmembrane prediction

IntEnz

Enzymes

  • Automated annotation

Other predictions

Microbial protein families

HAMAP

Protein

classification

Post-translational modifications

RESID

Protein sequence: UniProt

UniProt

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

UniRef

Pre-computed clusters of similar proteins

UniProtKB

UniProt Knowledgebase, the centrepiece of the UniProt Consortium’s activities. It provides a richly curated protein database.

UniRef 50

UniRef 90

Proteome

Sets

IPI

UniMes

UniProt Metagenomic and Environmental Sequences

(available by FTP only)

UniRef 100

UniSave

UniProtKB

UniMes

UniParc

UniProt Sequence Archive. Contains all current and obsolete UniProtKB sequences

UniParc

UniSave

UniProt protein entry archive. Contains all versions of each protein entry.

(Accessed via www.uniprot.org and www.ebi.ac.uk/unisave)

PDB

Sub/

Peptide

Data

FlyBase

WormBase

Patent

Data

INSDC

(incl. WGS,

Env.)

RefSeq

Ensembl

VEGA

Database sources

UniProt data sources and data flow


The two sides of uniprotkb

The Two Sides of UniProtKB

UniProtKB/TrEMBL

UniProtKB/Swiss-Prot

Redundant, automatically annotated - unreviewed

Non-redundant, high-quality manual annotation - reviewed


Databases

Databases

  • Many databases and they are getting bigger

  • Efficient searching involves knowledge of what is stored in these

  • Don’t assume that everything in the databases is correct

  • Nothing is constant, but changes...

    • Deletions, sequence modifications

    • Daily updates, identifier changes, etc.

Sequence searching and alignments - Andrew Cowley


Searching databases

Searching databases

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

What is the difference between a primary and secondary database?

What methods of searching databases do you know of?

?

What is the best protein sequence database to search(specific part)?

?

?

Sequence searching and alignments - Andrew Cowley


Searching

Searching

  • Many ways of searching databases

  • Annotation/title

    • Know something about your sequence

      • Gene name

      • Function

      • Accession

Sequence searching and alignments - Andrew Cowley


New search service

New search service

Access from the EBI’s homepage

Species selector allows for easy comparison

  • Data organised according to:

  • gene

  • expression

  • protein

  • structure

  • literature

Explore data, return easily to

your results


Database webpages

Database webpages

Sequence searching and alignments - Andrew Cowley


Database searching

Database searching

Sequence searching and alignments - Andrew Cowley


Searching1

Searching

  • Many ways of searching databases

  • Annotation/title

    • Know something about your sequence

      • Gene name

      • Function

      • Accession

  • Raw data

    • Don’t know!

    • Or want to check...

  • Infer extra information

    • Homology?

    • Annotation?

    • Function?

Sequence searching and alignments - Andrew Cowley


Sequence alignment

Sequence alignment

  • Relatively easy if we have an exact match

  • .. But sequence is variable

    • Between individuals, species, location etc.

  • That variability is useful data too!

  • Need a search method that allows for some variability

  • And even better – helps us assess that variability

Sequence searching and alignments - Andrew Cowley


Sequence alignment1

Sequence alignment

Query:

ACATAGGT

2

1

TCATAGAT

AAATTCTG

Sequence searching and alignments - Andrew Cowley


Sequence alignment2

Sequence alignment

Query:

ACATAGGT

ACATAGGT

ACATAGGT

1

2

TCATAGAT

AAATTCTG

Sequence searching and alignments - Andrew Cowley


Sequence alignment3

Sequence alignment

Query:

ACATAGGT

1

2

ACATAGGT

ACATAGGT

TCATAGAT

AAATTCTG

3/8

Score:

6/8

Sequence searching and alignments - Andrew Cowley


Sequence alignment4

Sequence alignment

atttcacagaggaggacaaggctactatcacaagcctgtggggcaaggtgaatgtggaag

atgctggaggagaaaccctgggaaggctcctggttgtctacccatggacccagaggttct

ttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaagg

cacatggcaagaaggtgctgacttccttgggagatgccattaaagcacctgggatgatct

caagggcacctttgcccagcttgagt

Query:

1

atggtgctctctgcagctgacaaaaccaacatcaagaactgctgggggaagattggtggc

catggtggtgaatatggcgaggaggccctacagaggatgttcgctgccttccccaccacc

aagacctacttctctcacattgatgtaagccccggctctgcccaggtcaaggctcacggc

aagaaggttgctgatgccctggccaaagctgcagaccacgtcgaagacctgcctggtgcc

ctgtccactctgagcgacctgc

cacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggct

cctggttgtntacccatggacccagaggttctttgacagctttggcaacctgtcctctgc

ctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttcctt

gggagatgccataaagcacctggatgatctcaagggca

2

Sequence searching and alignments - Andrew Cowley


Dot plot

Dot plot

GATACT

Sequence 1

A C A T A G

Query

Maybe a dot plot will help

Sequence searching and alignments - Andrew Cowley


Dot plot1

Dot plot

Query vs Sequence 1

Query vs Sequence 2

1

2

Query

Query

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

We can see the difference, but how to turn that into something a computer can evaluate?

Computers rely on algorithms which give them a score

They can then compare scores

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

0

0

-10

0

0

-10

  • Simple algorithm – penalise movement away from diagonal – gap penalty

Sequence searching and alignments - Andrew Cowley


Gap extend

Gap extend

Actual implementation is usually to apply gap extension penalty to every gap

-10.5

-10

-0.5

-10

-10

0

0

-10

-10

-10

0

-10

-10.5

-10

-0.5

Sequence searching and alignments - Andrew Cowley

  • Having opened a gap, we should assign a lesser penalty to extending it


Why a lesser gap extend penalty

Why a lesser gap extend penalty?

NVELKAET

NVDEATNFELKAET

NV-ELKAET

NV------ELKAET

NVDE--A-TNFELKAET

NVDEATNFELKAET

Single block of insertions/deletions is more likely than multiple in/del events

Sequence searching and alignments - Andrew Cowley


Match mismatch

Match/mismatch

A C G T

A

C

G

T

5 -4 -4 -4

-4 5 -4 -4

-4 -4 5 -4

-4 -4 -4 5

Of course, we need to tell the algorithm that matching letters are better than mismatches too

This is done via a scoring matrix

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

A C G T

A

C

G

T

5 -4 -4 -4

A

-13.5

-13

6

-4 5 -4 -4

Gap

-4 -4 5 -4

C

-18

1

-13

-4 -4 -4 5

-10

-10.5

-10

-4

-0.5

-10

T

-4

-18

-22.5

-10

0

0

-10

Mismatch

-10

A

C

A

-10

0

-10

-10.5

-10

-0.5

Sequence searching and alignments - Andrew Cowley

  • Putting the two together gives us a scoring mechanism


Ebi roadshow

A

-13.5

-13

6

C

-18

1

-13

T

-4

-18

-22.5

A

C

A

Sequence searching and alignments - Andrew Cowley

  • To pick the optimal alignment, start at the end and trace back the highest scoring route.


Needleman wunsch

Needleman-Wunsch

  • Congratulations! You’ve just reconstructed the Needleman-Wunsch algorithm!

    • An example of dynamic programming

  • Comparing the full length of both sequences is called a global-global or just global alignment

Sequence searching and alignments - Andrew Cowley


Global vs local

Global vs Local

  • But global-global might not be suitable for sequences that are very different lengths

  • A modified form of this algorithm for local alignment is called the Smith-Waterman algorithm.

    • Sets negative scores in matrix to 0, and allows trace back to end and restart

Sequence searching and alignments - Andrew Cowley


Question global vs local which is which

QUESTION: Global vs Local - which is which?

A T G T A T A C G C

A - T G T A T A C G C

- A G T A T A - G C

A G T A T A - - - G C

LOCAL

GLOBAL

Sequence searching and alignments - Andrew Cowley


Scoring

Scoring

  • Parameters so far:

    • Match/mismatch

    • Gap opening

    • Gap extending

  • Can we improve it?

Sequence searching and alignments - Andrew Cowley


Substitutions

Substitutions

  • Some substitutions are more likely than others

  • DNA:

    • Purines (A,G) – dual ring

    • Pyrimidines (C, T) – single ring

  • Substitutions of the same type are called transitions, where as exchanging one for another is called a transversion

  • Transistions occur more frequently than transversions, so we can score them higher in the scoring matrix

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Sequence searching and alignments - Andrew Cowley


Proteins

Proteins

What about proteins?

Sequence searching and alignments - Andrew Cowley


Protein substitution matrices

Protein substitution matrices

  • Can look at closely related proteins to determine substitution rates

  • Two most commonly used models:

    • BLOSUM

    • PAM

Sequence searching and alignments - Andrew Cowley


Blosum

BLOSUM

Blocks of Amino Acid Substitution Matrix

Align conserved regions of evolutionary divergent sequences clustered at a given % identity

Count relative frequencies of amino acids and substitution probability

Turn that into a matrix where the more positive a substitution is, the more likely is it to be found, and the more negative, the less likely.

Higher BLOSUM number = more closely related

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

PAM

PAM 250

Point Accepted Mutation

Observed mutations in a set of closely related proteins

Markov chain model created to describe substitutions

Normalised so that PAM1 = 1 mutation per 100 amino acids

Extrapolate matrices from model

Higher PAM number = less closely related

Sequence searching and alignments - Andrew Cowley


Effect of applying pam10 500 matrices to the human ldl receptor sequence

10

100

200

300

400

500

Effect of applying PAM10 -> 500 matrices to the human LDL receptor sequence

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

BLOSUM 45

PAM 250

BLOSUM 62

PAM 160

BLOSUM 90

PAM 100

More divergent

Less divergent

Sequence searching and alignments - Andrew Cowley


Scoring1

Scoring

  • Parameters:

    • Match/mismatch

    • Gap opening

    • Gap extending

    • Substitution matrix

Sequence searching and alignments - Andrew Cowley


Dynamic programming alignments at the ebi

Dynamic programming alignments at the EBI

  • EMBOSS Pairwise Alignment Algorithms

  • European Molecular Biology Open Software Suite

    • Suite of useful tools for molecular biology

    • Command line based

    • Designed to be used as part of scripts/chained programs

  • We implement selected tools to provide web-based access

Sequence searching and alignments - Andrew Cowley


Where to find at the ebi

Where to find at the EBI?

http://www.ebi.ac.uk/Tools/sequence.html

Or...

Sequence searching and alignments - Andrew Cowley


Where to find at the ebi1

Where to find at the EBI?

Sequence searching and alignments - Andrew Cowley


Emboss align tools

EMBOSS align tools

Needle

  • Global alignment

  • Local alignment

Water

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Program selection

Parameters

Sequence input

Submit!

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Key

- Gap

: Positive match

. Negative match

| Identity

Sequence searching and alignments - Andrew Cowley


Pairwise alignments example sequences

Pairwise Alignments - Example sequences

www.ebi.ac.uk/~watson/africa

Pairwise_align1.fsa

Pairwise_align2.fsa

Pages 25-30 in full booklet: Questions 7-10

Sequence searching and alignments - Andrew Cowley


Dynamic programming sequence search methods at the ebi

Dynamic programming sequence search methods at the EBI

GGSEARCH

  • Global alignment

  • Local alignment

  • Global query vs local database

SSEARCH

GLSEARCH

Sequence searching and alignments - Andrew Cowley


Where to find at the ebi2

Where to find at the EBI?

www.ebi.ac.uk/Tools/sss/

Or...

Sequence searching and alignments - Andrew Cowley


Similarity search

Similarity search

Database selection

Sequence input

Parameters

Submit!

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Dynamic programming methods are rigorous and guarantee an optimal result

But have to store the matrix of both sequences in memory

And evaluate each position of the matrix

Predictably, this makes them slow and demanding when you are aligning large sequences

Sequence searching and alignments - Andrew Cowley


Heuristics

Heuristics

  • Therefore we need methods of estimating alignments

  • Estimation methods are called heuristics

    • Try and take short cuts in an intelligent manner

    • Speed up the search

    • At the possible expense of accuracy

  • Accuracy in sequence searches is important for:

    • Aligning the right bits

    • Scoring the alignment correctly

    • Identifying similar sequences - sensitivity

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Going back to our dot plot

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Instead of searching the whole matrix, if we narrow the search space down to a likely region we will improve the speed.

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

  • Of course, we have to identify likely regions – not all alignments will be as nice as that one!

  • This is the method used by FASTA

    • W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

Sequence searching and alignments - Andrew Cowley


Fasta step 1

FASTA – step 1

Ktup parameter:

How many consecutive identities before considered a ‘run’

Also called ‘word size’

Increase Ktup = faster, but less sensitive

Identify runs of identical sequence and pick regions with highest density of runs

Sequence searching and alignments - Andrew Cowley


Fasta step 2

FASTA – step 2

Parameter:

Substitution matrix

Weight scoring of runs using matrix, trim back regions to those contributing to highest scores

Sequence searching and alignments - Andrew Cowley


Fasta step 3

FASTA – step 3

Joining threshold:

Internally determined

Discard regions too far from the highest scoring region

Sequence searching and alignments - Andrew Cowley


Fasta step 4

FASTA – step 4

Parameters:

Gap open

Gap extend

Substitution matrix

Use dynamic programming to optimise alignment in a narrow band encompassing the top scoring regions

Sequence searching and alignments - Andrew Cowley


Fasta

FASTA

Repeat against all sequences in the database

Sequence searching and alignments - Andrew Cowley


Fasta programs available at ebi

FASTA – programs available at EBI

  • FASTA: ”a fast approximation to Smith & Waterman”

    • FASTA – scan a protein or DNA sequence library for similar sequences.

    • FASTX/Y – compare a DNA sequence to a protein sequence databases, comparing the translated DNA sequence in forward or reverse translation frames.

    • TFASTX/Y – compare a protein sequence to a translated DNA data bank.

    • FASTF – compares ordered peptides (Edman degradation) to a protein databank.

    • FASTS – compares unordered peptides (Mass Spec.) to a protein databank.

    • SSEARCH – Rigorous scan of protein or DNA sequence library (S&W Algorithm).

Sequence searching and alignments - Andrew Cowley


Where to find at the ebi3

Where to find at the EBI?

www.ebi.ac.uk/Tools/sss/

Or...

Sequence searching and alignments - Andrew Cowley


Similarity search1

Similarity search

Database selection

Sequence input

Parameters

Submit!

Sequence searching and alignments - Andrew Cowley


Fasta results

FASTA - results

Sequence searching and alignments - Andrew Cowley


Fasta results1

FASTA - results

Sequence searching and alignments - Andrew Cowley


Fasta results2

FASTA - results

Sequence searching and alignments - Andrew Cowley


Fasta results3

FASTA - results

Key

- Gap

: Identity

. Similarity

X Filtered

Sequence searching and alignments - Andrew Cowley


Using fasta example sequence

Using FASTA - Example sequence

www.ebi.ac.uk/~watson/africa

test_prot.fasta

Page 37-46 in full booklet: Questions 11-14

Sequence searching and alignments - Andrew Cowley


Blast basic local alignment search tool

BLAST – Basic Local Alignment Search Tool

Altschul et al (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.

Instead of narrowing the dynamic programming search space, BLAST works a different way

Firstly, it creates a word list both of the exact sequence and high scoring substitutions

Sequence searching and alignments - Andrew Cowley


Blast step 1

BLAST – step 1

SEWRFKHIYRGQPRRHLLTTGWSTFVT

SEW

EWR

Parameter:

Word length (w)

WRF

Increase = faster, but less sensitive

w=3

Sequence searching and alignments - Andrew Cowley


Blast step 1 cont d

BLAST – step 1(cont.d)

SEWRFKHIYRGQPRRHLLTTGWSTFVT

GQP 18

GEP 15

GRP 14

GKP 14

GNP 13

GDP 13

Parameters:

Neighbourhood threshold (T)

Substitution matrix

AQP 12

NQP 12

w=3

T=13

Sequence searching and alignments - Andrew Cowley


Blast step 2

BLAST – step 2

Then it scans database sequences for exact matches with these words

Sequence searching and alignments - Andrew Cowley


Blast step 3

BLAST – step 3

Parameters:

Drop off

Substitution matrix

If two hits are found on the same diagonal the alignment is extended until the score drops by a certain amount

This results in a High-scoring Segment Pair (HSP)

Sequence searching and alignments - Andrew Cowley


Blast step 4

BLAST – step 4

Parameters:

Extension threshold (Sg)

Substitution matrix

If the total HSP score is above another threshold then a gapped extension is initiated

Sequence searching and alignments - Andrew Cowley


Blast

BLAST

The steps rule out many database sequences early on

Large increase in speed

Sequence searching and alignments - Andrew Cowley


Blast programs available at the ebi

BLAST – programs available at the EBI

Combines several parameters into ‘sensitivity’ option

  • Basic Local Alignment Search Tool

  • NCBI-BLAST programs:

    • BLASTP – protein sequence vs. protein sequence library

    • BLASTN – nucleotide query vs. nucleotide database

    • BLASTX – translated DNA vs. protein sequence library

  • WU-BLAST programs:

    • BLASTP – protein query vs. protein database

    • BLASTN – nucleotide query vs. nucleotide database

    • BLASTX – translated nucleotide query vs. protein database

    • TBLASTN – protein query vs. translated nucleotide database

    • TBLASTX – translated nucleotide query vs. translated nucleotide database

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Sequence searching and alignments - Andrew Cowley


Using blast example sequence

Using BLAST - Example sequence

www.ebi.ac.uk/~watson/africa

test_prot.fasta

Pages 47-50 in full booklet: Questions 15-17

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Key

- Gap

[residue] Identity

+ Similarity

X Filtered

Sequence searching and alignments - Andrew Cowley


Differences between blast and fasta

Differences between BLAST and FASTA

  • BLAST

    • Fast

    • Good with proteins

    • Produces good local alignments + short global alignments

    • Produces HSP (reports internal matches in long sequences)‏

    • Might miss a potential alignment due to ruling out sequences early on in the process

    • Good at finding siblings

  • FASTA

    • Not as fast as BLAST

    • Much better with DNA than BLASTN

    • Produces S&W alignments

    • Checks each possible alignment with database sequences

    • Good at finding cousins

Sequence searching and alignments - Andrew Cowley


When to use what

When to use what?

Query length

NCBI BLAST

WU-BLAST

PSI-SEARCH

FASTA

Database size

Sequence searching and alignments - Andrew Cowley


When to use what1

When to use what?

time to search

NCBI BLAST

WU-BLAST

PSI-SEARCH

FASTA

PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc

Sequence searching and alignments - Andrew Cowley


Homology and similarity

Homology and Similarity

Sequence searching and alignments - Andrew Cowley


Similarity

Similarity

Sequence searching and alignments - Andrew Cowley


Homology

Homology

Sequence searching and alignments - Andrew Cowley


Unrelated

Unrelated!

Sequence searching and alignments - Andrew Cowley


Homology vs similarity

Homology vs. Similarity

  • Presence of similar features because of common decent

  • Cannot be observed since the ancestors are not anymore

  • Is inferred as a conclusion based on ‘similarity’

  • Homology is like pregnancy: Either one is or one isn’t! (Gribskov – 1999)

  • Quantifies a ‘likeness’

  • Uses statistics to determine ‘significance’ of a similarity

  • Statistically significant similar sequences are considered ‘homologous’

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

  • So far, we’ve talked about scoring alignments

    • Direct function of the algorithm

  • But what we want is to assign some kind of quality to that score

Sequence searching and alignments - Andrew Cowley


Score vs significance

Score vs significance

A A A

A C A T A A G G C T

A A A

A T A C A A G C C T

High score

High significance

Sequence searching and alignments - Andrew Cowley


Lies damn lies and statistics

“Lies, damn lies, and statistics”

Sequence searching and alignments - Andrew Cowley


Lies damn lies and statistics1

“Lies, damn lies, and statistics”

Not just interested in score...

...But how likely we are to get that alignment by chance alone

It is this ‘non-random’ alignment that infers homology

Statistics are used to estimate this chance

Sequence searching and alignments - Andrew Cowley


E value

E-value

‘Expect’ value

Probability of obtaining this alignment by chance

Best measure of how good an alignment is

Often used for ranking results by default

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Calculated in different ways for BLAST and FASTA

Short query sequences are more likely to be found by chance so have higher E-values

Affected by parameter values like gap penalties and substitution matrices

Sequence searching and alignments - Andrew Cowley


Fasta statistics

FASTA statistics

Compares query sequence with every sequence in database

As most of these sequences are unrelated it is possible to use the distribution of scores to assign statistical significance

Sequence searching and alignments - Andrew Cowley


Fasta histogram

FASTA - histogram

Key

Predicted distribution of scores

Observed distribution of scores

*

=

High scoring region

Sequence searching and alignments - Andrew Cowley


Blast statistics

BLAST statistics

“Appears to yield fairly accurate results”

Main reason for speed is that it doesn’t compare query with lots of other sequences

Therefore it pre-estimates statistical values using a random sequence model

Sequence searching and alignments - Andrew Cowley


Search guidelines

Search Guidelines


Search guidelines 1

Search guidelines 1

Whenever possible, compare at the amino acid level rather than at the nucleotide level (fasta, blastp, etc…)‏

Then with translated DNA sequences (fastx, blastx)

Search with DNA vs. DNA as the next resort

And then with translated DNA vs. translated DNA (tfastx, tblastx) as the VERY LAST RESORT!

Sequence searching and alignments - Andrew Cowley


Search guidelines 2

Search guidelines 2

  • Search the smallest database that is likely to contain the sequence(s) of interest

  • Use sequence statistics (E()-values) rather than % identity or % similarity, as your primary criterion for sequence homology

Sequence searching and alignments - Andrew Cowley


Search guidelines 3

Search guidelines 3

  • Check that the statistics are likely to be accurate by looking for the highest scoring unrelated sequence

    • Examine the histograms

    • Use programs such as prss3 to confirm the expectation values.

    • Searching with shuffled sequences (use MLE/Shuffle in fasta) which should have an E() ~1.0

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Sequence searching and alignments - Andrew Cowley


Search guidelines 4

Search guidelines 4

  • Consider searches with different gap penalties and other scoring matrices

    • Use shallower matrices and/or more stringent gaps in order to uncover or force out relationships in partial sequences

    • Use BLOSUM62 instead of BLOSUM50 (or PAM100 instead of PAM250)

    • Remember to change the gap penalty defaults!

MATRIX open ext.

BLOSUM50 -10 -2

BLOSUM62 -7 -1

BLOSUM80 -16 -4

PAM250 -10 -2

PAM120 -16 -4

Sequence searching and alignments - Andrew Cowley


Search guidelines 5

Search guidelines 5

  • Homology can be reliably inferred from statistically significant similarity

  • But remember:

    • Orthologous sequences have similar functions

    • Paralogous sequences can acquire very different functional roles

  • So further work might be needed to tease out details

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Sequence searching and alignments - Andrew Cowley


Search guidelines 6

Search guidelines 6

  • Consult motif or fingerprint databases in order to uncover evidence for conservation-critical or functional residues

    • However, motif identity in the absence of overall sequence similarity is not a reliable indicator of homology!

  • Try to produce multiple sequence alignments in order to validate the relatedness of your sequence data

    • ClustalW

    • MUSCLE

    • T-Coffee

    • Kalign

    • MAFFT

    • Mview (available from EBI FASTA & BLAST services)‏

    • DBCLUSTAL (available from EBI BLAST services)‏

Sequence searching and alignments - Andrew Cowley


Advanced

Advanced


Ebi roadshow

Conserved regions

Structural information

Motifs

[R, T or D]-[D, A or Q]-[F, E or A]-A-T-H

In general, the more information we can add to an alignment, the better the result

Sequence searching and alignments - Andrew Cowley


Conserved regions

Conserved regions

We can even modify a normal search to generate a position specific scoring matrix, or PSSM

We can add a new ‘position’ parameter to the substitution matrix

Sequence searching and alignments - Andrew Cowley


Psi blast

PSI-BLAST

Position Specific Iterative – BLAST:

  • Takes the result of a normal BLAST

  • Aligns them and generates profile of conserved positions

  • Uses this to weight scoring on next iteration

Sequence searching and alignments - Andrew Cowley


Psi blast1

PSI-BLAST

More sensitive

By adding importance to conserved residues we might be able to find more distant sequences

But iterate too far and we might be assigning importance where there is none

Sequence searching and alignments - Andrew Cowley


Psi blast2

PSI-BLAST

Sequence searching and alignments - Andrew Cowley


Psi blast3

PSI-BLAST

Sequence searching and alignments - Andrew Cowley


Psi blast4

PSI-BLAST

Sequence searching and alignments - Andrew Cowley


Phi blast

PHI-BLAST

Pattern Hit Initiated-BLAST

User provides a pattern alongside a protein

Database hits have to contain this pattern, and similarity to rest of sequence

Results can initiate a PSI-BLAST search as well

Sequence searching and alignments - Andrew Cowley


Psi search

PSI-SEARCH

Smith-Waterman implementation (SSEARCH)

But with iterative position specific scoring

Sequence searching and alignments - Andrew Cowley


Using psi blast example sequence

Using PSI-BLAST - Example sequence

www.ebi.ac.uk/~watson/africa

test_prot.fasta

Pages 52-55 in full booklet: Questions 18-20

Sequence searching and alignments - Andrew Cowley


Problem sequences

Problem Sequences


Short sequences

Short sequences

  • What about short sequences?

  • Depends on their nature:

    • Protein

      • Reduce word length and/or increase the E() value cut off

      • Use shallow matrices

    • DNA

      • Reduce the word length‏

      • Ignore gap penalties (force local alignments only)‏

  • Use rigorous methods

  • But ask what you are trying to do!

Sequence searching and alignments - Andrew Cowley


Low complexity regions

Low complexity regions

Sometimes biologically relevant, but always likely to skew alignment scoring

E.g. CA repeats, poly-A tails and Proline rich regions

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Good Statistics:

The inset shows good correlation

between the observed over expected

numbers of scores.

This is the region of the histogram to

look out for first when evaluating results.

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Bad Statistics:

The inset shows bad correlation

between the observed and expected

scores in this search.

The spaces between the = and * symbols

indicate this poor correlation.

One reason for this can be low complexity

regions.

Sequence searching and alignments - Andrew Cowley


Low complexity regions1

Low complexity regions

  • Sometimes biologically relevant, but always likely to skew alignment scoring

  • E.g. CA repeats, poly-A tails and Proline rich regions

  • Compensate by filtering sequence so these regions don’t contribute to scoring

    • Filters: seg, xnu, dust, CENSOR

  • But check what you are filtering!

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Filtered:

Inset showing the effect of using a low

complexity filter (seg) and searching

the database using the segment with

highest complexity.

Note that there is now good agreement

between the observed and expected

high score in the search and that the

distance between = and * has been

significantly reduced.

Sequence searching and alignments - Andrew Cowley


Using filters example sequence

Using Filters - Example sequence

www.ebi.ac.uk/~watson/africa

Filtertest_seq.fsa

Pages 56-57 in full booklet: Questions 21-22

Sequence searching and alignments - Andrew Cowley


Vector contamination

Vector contamination

You think you know what your sequence is..

.. But the results are really confusing!

Maybe you have vector contamination

Search against known vectors to check

Sequence searching and alignments - Andrew Cowley


Vector contamination1

Vector contamination

Sequence searching and alignments - Andrew Cowley


Vector contamination example sequences

Vector Contamination - Example sequences

www.ebi.ac.uk/~watson/africa

vectortest_seq1.fsa

vectortest_seq2.fsa

Page 57 in full booklet: Question 23

Sequence searching and alignments - Andrew Cowley


Multiple sequence alignments

Multiple Sequence Alignments


Uses of msa

Uses of MSA

Functional prediction

Phylogeny

Structural prediction

Homology detection

Protein analysis

To distinguish between orthology and parology

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Ideally, you would build up multiple alignments through weighted sum of pairs (pairwise scores)

But this is too computationally intensive

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST

Horse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN

Human alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-

Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-

Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT

Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT

Lupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE

*: : : * . : .: * : * : .

Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL

Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL

Human alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL

Horse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL

Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF

Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV

Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV

. .:: *. : . : *. * . : .

Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------

Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------

Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------

Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------

Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG

Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------

Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA---

: : .: . .. . :

Weighted Sums of Pairs: WSP

Sequences Time

Time O(LN)

21 second

3150 seconds

46.25 hours

72404 years

539 days

616 years

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Ideally, you would build up multiple alignments through weighted sum of pairs (pairwise scores)

But this is too computationally intensive

Therefore we have to use heuristics and progressive alignment methods

Sequence searching and alignments - Andrew Cowley


Clustal

Clustal

  • >60,000 citations

  • Clustal1-Clustal4

    • 1988, Paul Sharp, Dublin

  • Clustal V 1992

    • EMBL Heidelberg,

    • Rainer Fuchs

    • Alan Bleasby

  • Clustal W, Clustal X 1994-2005

    • Toby Gibson, EMBL, Heidelberg

    • Julie Thompson, ICGEB, Strasbourg

  • Clustal W and Clustal X 2.0 2006

    • University College Dublin

www.clustal.org

Sequence searching and alignments - Andrew Cowley


Clustal1

CLUSTAL

Quick, pairwise alignment of all sequences

Line up pairs, with the most similar first

Sequence searching and alignments - Andrew Cowley


Clustal2

CLUSTAL

Fix the alignment between pairs and treat as one sequence

Sequence searching and alignments - Andrew Cowley


Clustal3

CLUSTAL

Align your fixed pairs with each other

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Note, this is not a phylogram!

Only a guide tree for the alignment

Sequence searching and alignments - Andrew Cowley


Clustalw at the ebi

ClustalW at the EBI

Sequence searching and alignments - Andrew Cowley


Clustalw

ClustalW

Help!

Parameters

Sequence input

Submit!

Sequence searching and alignments - Andrew Cowley


Clustalw1

ClustalW

Interactive – results in browser, deleted after 24 hours

Email – receive URL to results page, deleted after 7 days

Sequence searching and alignments - Andrew Cowley


Clustalw2

ClustalW

Sequence searching and alignments - Andrew Cowley


Clustalw3

ClustalW

Sequence searching and alignments - Andrew Cowley


Jalview

Jalview

Sequence searching and alignments - Andrew Cowley


Clustalw4

ClustalW

Advantages

  • Fast

  • Not too demanding

  • Widely used

  • Fine for most uses

Disadvantages

  • Fixing of early alignments

    • Propagate errors

  • Doesn’t search far

    • Local minima

  • Compresses gaps

Sequence searching and alignments - Andrew Cowley


Use of clustal jalview example sequences

Use of Clustal/JalView - Example sequences

www.ebi.ac.uk/~watson/africa

prot_MSA.fasta

Problem_MSA1.fsa

Problem_MSA2.fsa

Problem_MSA3.fsa

Problem_MSA4.fsa

Pages 59-66 in full booklet: Questions 24-28

Sequence searching and alignments - Andrew Cowley


Other tools

Other Tools

Sequence searching and alignments - Andrew Cowley


Coffee

COFFEE

  • Consistency based Objective Function For alignmEnt Evaluation

  • Maximum Weight Trace (John Kececioglu)

  • Maximise similarity to a LIBRARY of residue pairs

  • Notredame, C., Holm, L. and Higgins, D.G. (1998) COFFEE: An objective function for multiple sequence alignments. Bioinformatics 14: 407-422.

Sequence searching and alignments - Andrew Cowley


Coffee1

COFFEE

SAGA is another alignment method, using genetic algorithms

  • Library of reference pairwise alignments

    • For your given set of sequences

  • Objective Function

    • Evaluates consistency between multiple alignment and the library of pairwise alignments

    • Use SAGA to optimise this function

  • Weigh depending on quality of alignment

Sequence searching and alignments - Andrew Cowley


Coffee2

COFFEE

  • More accurate than ClustalW

  • Much less prone to problems in early alignment stages

  • VERY slow!

Sequence searching and alignments - Andrew Cowley


T coffee

T-Coffee

  • Tree-based COFFEE

  • Heuristic approach to COFFEE

    • Gets rid of genetic algorithm portion

    • Uses progressive alignments

    • Changes algorithm based on number of sequences

Sequence searching and alignments - Andrew Cowley


T coffee1

T-Coffee

Much faster than COFFEE

Avoids some of ClustalW’s pitfalls

Can take information from several data sources

Still not that fast

Can be very demanding of memory etc.

Sequence searching and alignments - Andrew Cowley


Others

Others

  • MUSCLE – Bob Edgar

    • Iterative/progressive alignment

    • Fast

    • Good for big alignments, proteins

  • MAFFT

    • Iterative based Fast Fourier Transform

    • Fast and accurate

    • Good for huge alignments

  • Kalign

    • Very fast, local-regions aligning

    • Good for very large numbers of alignments!

Sequence searching and alignments - Andrew Cowley


Which tool should i use

Which tool should I use?

Input data

Recommendation

MUSCLE, T-Coffee, MAFFT, ClustalW

MUSCLE, MAFFT

MUSCLE, KALIGN

ClustalW

  • 2-100 sequences of typical protein length

  • 100-500 sequences

  • >500 sequences

  • Small number of unusually long sequences

Sequence searching and alignments - Andrew Cowley


How to evaluate

How to evaluate?

Use a benchmark

BaliBASE

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

BaliBASE

Thompson, JD, Plewniak, F. and Poch, O. (1999)NAR and Bioinformatics

  • ICGEB Strasbourg

  • 141 manual alignments using structures

    • 5 sections

    • core alignment regions marked

3. Two groups (12)

1. Equidistant(82)

4. Long internal gaps(13)

2. Orphan(23)

5. Long terminal gaps(11)

Sequence searching and alignments - Andrew Cowley


Benchmark pitfalls

Benchmark pitfalls

  • Benchmark dataset may not be representative

  • Danger of over-training towards benchmark

  • Goldman: Most MSAs have unrealistic gaps

    • Tend towards multiple, independent deletions

    • Insertions are rare

      • Sequences shrink in length over evolution

        • No supporting evidence that this is the case

Sequence searching and alignments - Andrew Cowley


Solutions

Solutions

Use phylogentic data to guide alignment

Keep track of changes to ancestor sequences

Don’t change them again so easily in decendents

Sequence searching and alignments - Andrew Cowley


Prank

PRANK

  • Probabilistic Alignment Kit

  • webPRANK

  • Better suited for closely related sequences

  • Tied solutions are chosen from at random

    • Avoids incorrect confidence in result

    • Means alignments might not be reproducible

  • Alignments look quite different

    • Might look worse!

    • But gap patterns make sense

    • Gaps are good!

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Sequence searching and alignments - Andrew Cowley


Ebi roadshow

Sequence searching and alignments - Andrew Cowley


Comparing alignments example sequences

Comparing Alignments - Example sequences

www.ebi.ac.uk/~watson/africa

prot_MSA.fasta

Pages 67-74 in full booklet: Questions 29-30

Sequence searching and alignments - Andrew Cowley


Common problems with msa

Common problems with MSA

  • Input format

    • FASTA format

    • Unique sequence identifiers

    • Include sequence!

  • Job can’t be found

    • Interactive results deleted after 24hrs

    • Use email

    • Consider other tool

Sequence searching and alignments - Andrew Cowley


Common mis uses of msa

Common mis-uses of MSA

  • Performing a sequence assembly

    • Specialist type of MSA

    • Use other tools (Staden etc.)

  • Aligning ESTs to a reference genome

    • Use EST2Genome

  • Designing primers

    • Use primer tools (primer3 etc.)

  • Aligning two sequences

    • Use a pairwise alignment tool!

Sequence searching and alignments - Andrew Cowley


Putting it all together

Putting it all together

EB-Eye search

Sequence retrieval

Sequence search

Sequences retrieval

Multiple sequence alignment

Analysis

Sequence searching and alignments - Andrew Cowley


Final remarks

Final remarks

  • Don’t assume a single tool will cater for all your needs

  • DO change the parameters of the tools

  • Remember where the tool excels and what its limitations are

  • A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!)

  • Crazy input will always give crazy results!

Sequence searching and alignments - Andrew Cowley


Getting help

Getting Help


Getting help1

Getting Help

  • Database documentation

  • Frequently Asked Questions

    • http://www.ebi.ac.uk/help/faq.html

  • 2can Support Portal

    • http://www.ebi.ac.uk/2can/

  • EBI Support

    • http://www.ebi.ac.uk/support/

  • Hands-on training programme

    • http://www.ebi.ac.uk/training/handson/

Sequence searching and alignments - Andrew Cowley


Thanks

Thanks!

Hamish McWilliam and Andrew Cowley

Vicky Schneider

Rodrigo Lopez

EMBL-EBI

SLING

You!

Sequence searching and alignments - Andrew Cowley


  • Login