Large scale genomes comparisons Bioinformatics aspects (Introduction)

Large scale genomes comparisons Bioinformatics aspects (Introduction) Fredj Tekaia Institut Pasteur tekaia@pasteur.fr EMBO Bioinformatic and Comparative Genome Analysis Course Stazione Zoologica Anton Dohrn, Naples, Italy May 7 - 19, 2012

Large-scale genome comparisons: Comparing a genome (in terms of whole sequence, whole set of predicted genes or whole set of predicted proteins) to itself (intra-species comparisons) or to another genome (inter-species comparisons).

Large scale genome comparisons • -Duplication; • -Conservation; • -Specificity (species-specific genes, proteins); • -Paralogues, orthologues; • -Families (clusters) of paralogues, of orthologues; • -Genomes organisations (duplicated, conserved genes); • -Search for shared motifs in proteins of the same cluster; • Protein conservation profiles; • -Selection pressure analyses • (synonymous, non synonymous substitutions,..),….

Evolution

G Duplication Time G2 G1 Speciation Duplication A-G1 A-G2 B-G1 B-G21 B-G22 outparalogs outparalogs inparalogs orthologs B A Speciation - Duplication •Speciation •Duplication •Inparalogs •Orthologs •Outparalogs •Loss of genes Predict these events by comparing genomes?

Orthologs / Paralogs • How to detect orthologous genes? - easy way: best reciprocal hit (RBH) 1a 1b 2.1a 2.1b 2.2a 2.2b 3a 3b Organism B Organism A

Evolutionary processes include Ancestor Phylogeny* Expansion* genesis duplication species genome HGT HGT selection* Deletion* Exchange* loss • Large scale comparative analysis of predicted proteomes revealed significant evolutionary processes: Expansion, Exchange and Deletion.

S. cerevisiae genome Kellis et al. Nature, 2004 Colours reveal Duplications

Duplication Speciation Deletion Actual content of the 2 copies Reconstruction of the ancestral organization Kellis et al. Nature, 2004

Original version Actual version Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol 2(7): e206.

Genome duplication. a, Distribution of Ks values of duplicated genes in Tetraodon (left) and Takifugu (right) genomes. Duplicated genes broadly belong to two categories, depending on their Ks value being below or higher than 0.35 substitutions per site since the divergence between the two puffer fish (arrows). b, Global distribution of ancient duplicated genes (Ks > 0.35) in the Tetraodon genome. The 21 Tetraodon chromosomes are represented in a circle in numerical order and each line joins duplicated genes at their respective position on a given pair of chromosomes. Jaillon et al. Nature 431, 946-857. 2004.

Search for similarity

Methods: • Important to know how algorithms that allow sequence comparisons work, • There are many comparisons methods, • Among most used: • BLAST • FASTA • Smith-Waterman algorithm dynamic programming method • HMM (Hidden Markov Model)

Sequence Comparaisons V I TK L G T C V G S V IT K L G T C V G S V I S. . . T Q V G S V .S K . G T Q V . S • Identity • Similarity • Homology

Comparison of 2 sequences • • Aims at finding the optimal alignment: the one that shows most similar regions and regions that are less similar. • •In describing sequence comparisons, three different terms are commonly used : • Identity, Similarity and Homology. • Need for a score that evaluates: - matches - mismatches - gaps • and a method that evaluates the numerous possible alignments.

Homology • Sequence homology underlies common ancestry and sequence conservation; • Homology can be inferred, under suitable conditions from sequence similarity ; • The main objective of sequence similarity searching studies aims at inferring homology between sequences; • Homology is not a measure. It is an all or none relashionship (i.e homology exits or does not exist. Expressions like : significant or weak homology are meaningless!). Sequence similarity is a measure of the matching characters in an alignment, whereas homology is a statement of common evolutionary origin.

Local Alignment Global Alignment

Compare one query sequence to a BLAST formatted database

Amino acid scoring schemes (substitution matrices) • All algorithms comparing protein sequences rely on some schemes to score the equivalence of each of the 210 possible pairs of amino acids. As a result : what a local alignment program produces depends strongly upon the scores it uses. • implicitly a scheme may represent a particular theory of evolution, • choice of a matrix can strongly influence the outcome of an analysis. •The scores in the matrix are integer values which assign a positive score to identical or similar character pairs, and a negative value to dissimilar character pairs. Sij = (ln(qij/pipj))/u; qij are target frequencies for aligned pairs of amino acids, the pi and pj are background frequencies, and u is a statistical parameter.

BLOSUM62 Clustered Scoring Matrix in 1/2 Bit Units # Cluster Percentage: >= 62 # Lowest score = -4, Highest score = 11 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

• BLOSUM matrices (Henikoff, S., and Henikoff, J., G. (1992)) BlosumX denotes a matrix obtained from alignments of clustered sequence segments with more than X% identity. Examples : - Blosum62 is obtained from clustered sequences with identity greater than 62%. - Blosum80 is obtained from clustered sequences with identity greater than 80%. Which substitution matrix to choose? Blosum80 Blosum62 Blosum45 PAM10 PAM120 PAM250 Less divergent <------ searching ------> More divergent

Blast algorithm: (1) Query sequence: list of high scoring words of length w. Query Sequence of length L Maximum of L-w+1 words; w=3,11 List the words that score at least T using a substitution matrix (Bosum62 or PAM250,...) ..... DB sequences ..... Extract matches of words from word list. (3)For each word match, extend alignment in both directions to find alignments with scores > S Maximal Segment Pairs (MSPs): HSPs (2) Compare the word list to the database and identify exact matches.

Large-scale proteome comparisons

The expected number of HSPs with score at least S is given by: E = Kmne-S. m and n are sequence and database lengths.

Systematic Analysis of Completely Sequenced Organisms •In silico species specific comparisons; • Degree of ancestral duplication and of ancestral conservation between pairs of species; • Families of paralogs (Partition-mcl); • Families of orthologs (Partition-mcl); • Determination of the protein dictionary (orthologs); • Determination of protein conservation profiles;

Working Examples Comparing S. cerevisiae (SC) genome with C. elegans (CE) genome

SC vs SC

- Paralogs - multiple matches - Partitions/clustering

SC/CE CE/SC Reciprocal Best Hits (RBH)

segmatchSCCE

Conclusion Large-scale analyses of Completely sequenced genomes allow a systematic vision of genes and genome organization and their macro as well their micro evolutions. Starting step for sophiticated evolutionary analyses that will be dealt with during this course.

Practical sessions (see text)

Large scale genomes comparisons Bioinformatics aspects (Introduction)

Large scale genomes comparisons Bioinformatics aspects (Introduction)

Presentation Transcript

Large scale sequencing leading to sequencing of cancer genomes

Large scale proteome comparisons Genome trees

Introduction to Large Scale Modeling Systems

Introduction to bioinformatics Lecture 2 Genes and Genomes

Introduction to bioinformatics Lecture 2 Genes and Genomes

Large-scale adaptive systems

Large-scale adaptive systems

Large-scale adaptive systems

LARGE SCALE

Large scale genomes comparisons Practical sessions

Large Scale HI Bias

Large scale

Introduction to Large-Scale Graph Computation

Large-Scale Systems

Large scale genomes comparisons Bioinformatics aspects (Introduction)

Large Scale Sharing

Large Scale Sharing

Large Scale Systems Design G52LSS

Introduction to Large Scale Change

Control of Large Scale Systems

Large-Scale Graph Analytics