Large scale genomes comparisons Bioinformatics aspects (Introduction)

Large scale genomes comparisons Bioinformatics aspects (Introduction) Fredj Tekaia Institut Pasteur tekaia@pasteur.fr EMBO Bioinformatic and Comparative Genome Analysis Course Institut Pasteur Paris June 27 - July 9, 2011

Starting from genomes (whole sequence, whole gene sequences or whole protein sequences of given species) what Large-scale Genome Comparisons include?

Large-scale genome comparisons: Comparing a genome (in terms of whole sequence, whole set of predicted genes or whole set of predicted proteins) to itself (intra-species comparisons) or to another genome (inter-species comparisons).

Large scale genome comparisons • -Duplication; • -Conservation; • -Specificity (species-specific genes, proteins); • -Paralogues, orthologues; • -Families (clusters) of paralogues, of orthologues; • -Genomes organisations (duplicated, conserved genes); • -Search for shared motifs in proteins of the same cluster; • Protein conservation profiles; • -Selection pressure analyses • (synonymous, non synonymous substitutions,..),….

Evolution

G Duplication Time G2 G1 Speciation Duplication A-G1 A-G2 B-G1 B-G21 B-G22 outparalogs outparalogs inparalogs orthologs B A Speciation - Duplication •Speciation •Duplication •Inparalogs •Orthologs •Outparalogs •Loss of genes Predict these events by comparing genomes?

Orthologs / Paralogs • How to detect orthologous genes? - easy way: best reciprocal hit (RBH) 1a 1b 2.1a 2.1b 2.2a 2.2b 3a 3b Organism B Organism A

Evolutionary processes include Ancestor Phylogeny* Expansion* genesis duplication species genome HGT HGT selection* Deletion* Exchange* loss • Large scale comparative analysis of predicted proteomes revealed significant evolutionary processes: Expansion, Exchange and Deletion.

S. cerevisiae genome Kellis et al. Nature, 2004 Colours reveal Duplications

Duplication Speciation Deletion Actual content of the 2 copies Reconstruction of the ancestral organization Kellis et al. Nature, 2004

Original version Actual version Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol 2(7): e206.

Search for similarity

Methods: • Important to know how algorithms that allow sequence comparisons work, • There are many comparisons methods, • Among most used: • BLAST • FASTA • Smith-Waterman algorithm dynamic programming method • HMM (Hidden Markov Model)

Sequence Comparaisons V I TK L G T C V G S V IT K L G T C V G S V I S. . . T Q V G S V .S K . G T Q V . S • Identity • Similarity • Homology

Comparison of 2 sequences • • Aims at finding the optimal alignment: the one that shows most similar regions and regions that are less similar. • •In describing sequence comparisons, three different terms are commonly used : • Identity, Similarity and Homology. • Need for a score that evaluates: - matches - mismatches - gaps • and a method that evaluates the numerous possible alignments.

Identity • Refers to the occurence of identical nucleotides or amino acids in the same position in aligned sequences ; • Identity is objective and well defined; • Identity can be quantified: Percent i.e the number of identical matches divided by the length of the aligned region.

Similarity • Sequence similarity takes approximate matches into account, and is meaningful only when such substitutions are scored according to some measure of «difference» with conservative substitutions assigned more favorable scores than non-conservative ones (substitution matrices). • Given a number of parameters (alphabet, scoring matrix, filtering procedure, etc...), the similarity of an aligned region is defined by a score calculated on that region; • The score depends on the chosen parameters; • Contrarily to homology : expression like significant or weak similarity are often used.

Homology • Sequence homology underlies common ancestry and sequence conservation; • Homology can be inferred, under suitable conditions from sequence similarity ; • The main objective of sequence similarity searching studies aims at inferring homology between sequences; • Homology is not a measure. It is an all or none relashionship (i.e homology exits or does not exist. Expressions like : significant or weak homology are meaningless!). Sequence similarity is a measure of the matching characters in an alignment, whereas homology is a statement of common evolutionary origin.

Local Alignment Global Alignment

Compare one query sequence to a BLAST formatted database

Amino acid scoring schemes (substitution matrices) • All algorithms comparing protein sequences rely on some schemes to score the equivalence of each of the 210 possible pairs of amino acids. As a result : what a local alignment program produces depends strongly upon the scores it uses. • implicitly a scheme may represent a particular theory of evolution, • choice of a matrix can strongly influence the outcome of an analysis. •The scores in the matrix are integer values which assign a positive score to identical or similar character pairs, and a negative value to dissimilar character pairs. Sij = (ln(qij/pipj))/u; qij are target frequencies for aligned pairs of amino acids, the pi and pj are background frequencies, and u is a statistical parameter.

Examples of substitution matrices • # PAM250 substitution matrix, scale = ln(2)/3 = 0.231049 • # Expected score = -0.844, Entropy = 0.354 bits • # Lowest score = -8, Highest score = 17 • A R N D C Q E G H I L K M F P S T W Y V B Z X * • A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 0 0 0 -8 • R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 -1 0 -1 -8 • N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 2 1 0 -8 • D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 3 3 -1 -8 • C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -4 -5 -3 -8 • Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 1 3 -1 -8 • E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 3 3 -1 -8 • G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 0 0 -1 -8 • H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 1 2 -1 -8 • I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -2 -2 -1 -8 • L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -3 -3 -1 -8 • K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 1 0 -1 -8 • M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -2 -2 -1 -8 • F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -4 -5 -2 -8 • P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 -1 0 -1 -8 • S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 0 0 0 -8 • T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 0 -1 0 -8 • W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -5 -6 -4 -8 • Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -3 -4 -2 -8 • V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 -2 -2 -1 -8 • B 0 -1 2 3 -4 1 3 0 1 -2 -3 1 -2 -4 -1 0 0 -5 -3 -2 3 2 -1 -8 • Z 0 0 1 3 -5 3 3 0 2 -2 -3 0 -2 -5 0 0 -1 -6 -4 -2 2 3 -1 -8 • X 0 -1 0 -1 -3 -1 -1 -1 -1 -1 -1 -1 -1 -2 -1 0 0 -4 -2 -1 -1 -1 -1 -8 • * -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 1

• PAM matrices (Dayhoff et al. (1978)) PAM stands for “point accepted mutation”. • 1 PAM corresponds to 1 amino acid change per 100 residues, • 1 PAM ~1% divergence, • Extrapolate to predict patterns at longer distances. Assumptions : • replacements are independent of surrounding residues, • sequences being compared are of average composition, • all sites are equally mutable, Source of error : • small, globular proteins were used to derive PAM matrices (departure from average composition) • errors in PAM1 are magnified up to PAM250,.... • does not account for conserved blocks or motifs. Strategy : • PAM40 short alignments, highly similar • PAM120 average similarity • PAM250 longer , weaker local alignments.

• BLOSUM matrices (Henikoff, S., and Henikoff, J., G. (1992)) BlosumX denotes a matrix obtained from alignments of clustered sequence segments with more than X% identity. Examples : - Blosum62 is obtained from clustered sequences with identity greater than 62%. - Blosum80 is obtained from clustered sequences with identity greater than 80%. Which substitution matrix to choose? Blosum80 Blosum62 Blosum45 PAM10 PAM120 PAM250 Less divergent <------ searching ------> More divergent

• Position Specific Scoring Matrix (PSSM) • Conserved motifs are identified and amino acid profile matrix for each motif is calculated. • This matrix (n x 20 aa ) is representative of the relative amino acid probabilities at specific positions and is characteristic of a protein family. • -Such matrices are used by the profile database searching programs (including PSI-BLAST and HMM based programs).

Example of a PSSM matrices determined (PSI-BLAST program): A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 3 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 5 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 6 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 7 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 8 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 9 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 10 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 11 G 0 -2 0 -1 -2 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 12 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 13 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 14 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 15 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 16 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 17 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 18 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 19 Q -1 1 0 0 -3 5 3 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 20 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 21 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 22 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 23 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 ..................................................................... 573 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 574 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 575 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1

Blast algorithm: (1) Query sequence: list of high scoring words of length w. Query Sequence of length L Maximum of L-w+1 words; w=3,11 List the words that score at least T using a substitution matrix (Bosum62 or PAM250,...) ..... DB sequences ..... Extract matches of words from word list. (3)For each word match, extend alignment in both directions to find alignments with scores > S Maximal Segment Pairs (MSPs): HSPs (2) Compare the word list to the database and identify exact matches.

Large-scale proteome comparisons

The expected number of HSPs with score at least S is given by: E = Kmne-S. m and n are sequence and database lengths.

Systematic Analysis of Completely Sequenced Organisms •In silico species specific comparisons; • Degree of ancestral duplication and of ancestral conservation between pairs of species; • Families of paralogs (Partition-MCL); • Families of orthologs (Partition-MCL); • Determination of the protein dictionary (orthologs); • Determination of protein conservation profiles;

Working Examples Comparing S. cerevisiae (SC) genome with C. elegans (CE) genome

SC vs SC

- Paralogs - multiple matches - Partitions/clustering

SC/CE CE/SC Reciprocal Best Hits (RBH)

segmatchSCCE

Conclusion Large-scale analyses of Completely sequenced genomes allow a systematic vision of genes, genome organization and their macro as well their micro evolutions. Starting step for further evolutionary analyses that will be dealt with during this course.

Practical sessions (see text)

Large scale genomes comparisons Bioinformatics aspects (Introduction)

Large scale genomes comparisons Bioinformatics aspects (Introduction)

Presentation Transcript

Large scale sequencing leading to sequencing of cancer genomes

Large scale proteome comparisons Genome trees

Introduction to Large Scale Modeling Systems

Introduction to bioinformatics Lecture 2 Genes and Genomes

Introduction to bioinformatics Lecture 2 Genes and Genomes

Large-scale adaptive systems

Large-scale adaptive systems

Large-scale adaptive systems

LARGE SCALE

Large scale genomes comparisons Practical sessions

Large scale genomes comparisons Bioinformatics aspects (Introduction)

Large Scale HI Bias

Large scale

Introduction to Large-Scale Graph Computation

Large-Scale Systems

Large Scale Sharing

Large Scale Sharing

Large Scale Systems Design G52LSS

Introduction to Large Scale Change

Control of Large Scale Systems

Large-Scale Graph Analytics