CSE182-L18

CSE182-L18 Population Genetics

Perfect Phylogeny • Assume an evolutionary model in which no recombination takes place, only mutation. • The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny. • How can one reconstruct such a tree?

The 4-gamete condition i A 0 B 0 C 0 D 1 E 1 F 1 • A column i partitions the set of species into two sets i0, and i1 • A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogenous. • EX: i is heterogenous w.r.t {A,D,E} i0 i1

4 Gamete Condition • 4 Gamete Condition • There exists a perfect phylogeny if and only if for all pair of columns (i,j), either j is not heterogenous w.r.t i0, or i1. • Equivalent to • There exists a perfect phylogeny if and only if for all pairs of columns (i,j), the following 4 rows do not exist (0,0), (0,1), (1,0), (1,1)

i i0 i1 4-gamete condition: proof • Depending on which edge the mutation j occurs, either i0, or i1 should be homogenous. • (only if) Every perfect phylogeny satisfies the 4-gamete condition • (if) If the 4-gamete condition is satisfied, does a prefect phylogeny exist?

An algorithm for constructing a perfect phylogeny • We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. • In any tree, each node (except the root) has a single parent. • It is sufficient to construct a parent for every node. • In each step, we add a column and refine some of the nodes containing multiple children. • Stop if all columns have been considered.

Inclusion Property • For any pair of columns i,j • i < j if and only if i1 j1 • Note that if i<j then the edge containing i is an ancestor of the edge containing i i j

r A B C D E Example 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 Initially, there is a single clade r, and each node has r as its parent

Sort columns • Sort columns according to the inclusion property (note that the columns are already sorted here). • This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0

Add first column 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 • In adding column i • Check each edge and decide which side you belong. • Finally add a node if you can resolve a clade r u B D A C E

Adding other columns 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 • Add other columns on edges using the ordering property r 1 3 E 2 B 5 4 D A C

Unrooted case • Switch the values in each column, so that 0 is the majority element. • Apply the algorithm for the rooted case

Summary :No recombination leads to correlation between sites A B 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 5 8 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 • The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa. • The history of a population can be expressed as a tree. • The tree can be constructed efficiently

Recombination • A tree is not sufficient as a sequence may have 2 parents • Recombination leads to violation of 4 gamete property. • Recombination leads to loss of correlation between columns 00000000 11111111 00011111

Studying recombination • A tree is not sufficient as a sequence may have 2 parents • Recombination leads to loss of correlation between columns • How can we measure recombination?

Linkage (Dis)-equilibrium (LD) A B 0 0 0 1 0 0 0 0 1 1 1 0 1 0 1 0 A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 Extensive Recombination • Pr[A,B=(0,1)=0.125 • Linkage equilibrium • No recombination • Pr[A,B=0,1] = 0.25 • Linkage disequilibrium

Measuring LD • Consider two bi-allelic sites A and B, with values 0 and 1. • Let p1 = probability[individual has allele 1 in site A] • q1 = probability[individual has allele 1 in site B] • P11 = Prob [individual has allele 1 in site A, and B] • Linkage Disequilibrium, D = |P11-p1q1| = |P01-p0q1| =…. • If D=0, sites are uncorrelated, (are in linkage equilibrium) • If |D| >>0, sites are highly correlated (have high LD) • Other measures exist, but they all measure similar quantities.

LD can be used to map disease genes LD • LD decays with distance from the disease allele. • By plotting LD, one can short list the region containing the disease gene. D N N D D N 0 1 1 0 0 1

Population sub-structure can cause problems in disease gene mapping

Pop. A 0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 p1=0.1 q1=0.9 P11=0.1 D=0.01 Pop. B 1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0 p1=0.9 q1=0.1 P11=0.1 D=0.01 Population sub-structure can increase LD • Consider two populations that were isolated and evolving independently. • They might have different allele frequencies in some regions. • Pick two regions that are far apart (LD is very low, close to 0)

Recent ad-mixing of population • If the populations came together recently (Ex: African and European population), artificial LD might be created. • D = 0.15 (instead of 0.01), increases 10-fold • This spurious LD might lead one false associations • Other genetic events can cause LD to arise, and one needs to be careful 0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 Pop. A+B p1=0.5 q1=0.5 P11=0.1 D=0.1-0.25=0.15 1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0

Determining population sub-structure • Given a mix of people, can you sub-divide them into ethnic populations. • Turn the ‘problem’ of spurious LD into a clue. • Find markers that are too far apart to show LD • If they do show LD (correlation), that shows the existence of multiple populations. • Sub-divide them into populations so that LD disappears.

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0 Determining Population sub-structure • Same example as before: • The two markers are too similar to show any LD, yet they do show LD. • However, if you split them so that all 0..1 are in one population and all 1..0 are in another, LD disappears

Iterative Algorithm for Population Substructure • Assume that there are 2 sub-populations • Randomly partition the individuals into two. • Select an individual, and compute the probabilities Pr(x|A), and Pr (x|B) • Assign the individual to A with probability • Pr(x|A)/ (Pr(x|A)+Pr(x|B)) • Continue.

Iterative algorithm for population sub-structure • Define • N = number of individuals (each has a single chromosome) • k = number of sub-populations. • Z  {1..k}N is a vector giving the sub-population. • Zi=k’ => individual i is assigned to population k’ • Xi,j = allelic value for individual i in position j • Pk,j,l = frequency of allele l at position j in population k

Example • Ex: consider the following assignment • P1,1,0 = 0.9 • P2,1,0 = 0.1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0

Goal • X is known. • P, Z are unknown. • The goal is to estimate Pr(P,Z|X) • Various learning techniques can be employed. • Here a Bayesian (MCMC) scheme is employed. We will only consider a simplified version

Algorithm:Structure • Iteratively estimate • (Z(0),P(0)), (Z(1),P(1)),.., (Z(m),P(m)) • After ‘convergence’, Z(m) is the answer. • Iteration • Guess Z(0) • For m = 1,2,.. • Sample P(m) from Pr(P | X, Z(m-1)) • Sample Z(m) from Pr(P | X, P(m-1)) • How is this sampling done?

Example • Choose Z at random, so each individual is assigned to be in one of 2 populations. See example. • Now, we need to sample P(1) from Pr(P | X, Z(0)) • Simply count • Nk,j,l = number of people in pouplation k which have allele l in position j • pk,j,l = Nk,j,l / N 1 2 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1 0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0

Example 1 2 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1 0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 • Nk,j,l = number of people in population k which have allele l in position j • pk,j,l = Nk,j,l / N1,1,* • N1,1,0 = 4 • N1,1,1 = 6 • p1,1,0 = 4/10 • p1,2,0 = 4/10 • Thus, we can sample P(m) 1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0

Sampling Z • Pr[Z1 = 1] = Pr[”01” belongs to population 1]? • We know that each position should be in linkage equilibrium and independent. • Pr[”01” |Population 1] = p1,1,0 * p1,2,1 =(4/10)*(6/10)=(0.24) • Pr[”01” |Population 2] = p2,1,0 * p2,2,1 = (6/10)*(4/10)=0.24 • Pr [Z1 = 1] = 0.24/(0.24+0.24) = 0.5

Sampling 1 1 1 2 1 2 1 2 1 1 2 2 2 1 2 2 1 2 2 1 0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 • Suppose, during the iteration, there is a bias. • Then, in the next step of sampling Z, we will do the right thing • Pr[“01”| pop. 1] = p1,1,0 * p1,2,1 = 0.7*0.7 = 0.49 • Pr[“01”| pop. 2] = p2,1,0 * p2,2,1 =0.3*0.3 = 0.09 • Pr[Z1 = 1] = 0.49/(0.49+0.09) = 0.85 • Pr[Z6 = 1] = 0.49/(0.49+0.09) = 0.85 • Eventually all “01” will become 1 population, and all “10” will become a second population 1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0

Population Structure • 377 locations (loci) were sampled in 1000 people from 52 populations. • 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003) Oceania Eurasia East Asia America Africa

Other topics Assembly Genomic Analysis/ Pop. Genetics Protein Sequence Analysis Sequence Analysis Gene Finding ncRNA

ncRNA gene finding • Gene is transcribed but not translated. • What are the clues to non-coding genes? • Look for signals selecting start of transcription and translation. Non coding genes are transcribed by Pol III • Non-coding genes have structure. Look for genomic sequences that fold into an RNA structure • Structure: Given a sequence, what is the structure into which it can fold with minimum energy?

tRNA structure

RNA structure: Basics • Key: RNA is single-stranded. Think of a string over 4 letters, AC,G, and U. • The complementary bases form pairs. • Base-pairing defines a secondary structure. The base-pairing is usually non-crossing.

RNA structure: pseudoknots Sometimes, unpaired bases in loops form ‘crossing pairs’. These are pseudoknots

A Static picture of the cell is insufficient • Each Cell is continuously active, • Genes are being transcribed into RNA • RNA is translated into proteins • Proteins are PT modified and transported • Proteins perform various cellular functions • Can we probe the Cell dynamically Pathways Gene Regulation Proteomic profiling Transcript profiling

Gene expression • The expression of transcripts and protein in the cell is not static. It changes in response to signals. • The expression can be measured using micro-arrays. • What causes the change in expression?

Transcriptional machinery • DNA polymerase (II) scans the genome, initiating transcription, and terminating it. • The same machinery is used for every gene, so while Pol II is required, it is not sufficient to confer specificity

TF binding Transcription factors • Other transcription factors interact with the core machinery and upstream DNA to provide specificity. • TFs bind to TF binding sites which are clustered in upstream enhancer and promoter elements. • The enhancer elements may be located many kb upstream of the core-promoter Upstream elements

TF binding sites • TF binding sites are weak signal (about 10 bp with 5bp conserved) • If two genes are co-regulated, they are likely to share binding sites • Discovery of binding site motifs is an important research problem. TCAGGAG g1 TGAGGAG g2 g3 TCAGGTG g4 TGAGGTG g5 TCAGGTG

http://www.gene-regulation.com/pub/databases.html#transfac

Discovering TF binding sites • Identification of these TF binding sites/switches is critical. • Requires identification of co-regulated genes (genes containing the same set of switches). • How do we find co-regulated genes?

Idea1: Use orthologous genes from different species The species are too close (EX: humans and chimps). Binding & non-binding sites are both conserved. The species are distant. Binding sites are conserved but not other sequence. The species are very distant. Even binding sites are not conerved. The genes have alternative regulators. ACGGCAGCTCGCCGCCGCGC ||||| || ||||||| || ACGGC-GGGCGCCGCCCCGC ACGGCAGCTCGCCGCCGC-C | || | ||||||| | AGTGC-GGGCGCCGCCTCAT ACGGC-GC-TCGCCGCCGCGC | | | || | | AT-ACGAAGTAGCGG-ATGGT

Idea2: Microarray • Expression level of all genes

Pathways • Proteins interact to transduce signal, catalyze reactions, etc. • The interactions can be captured in a database. • Queries on this database are about looking for interesting sub-graphs in a large graph.

Summary • Biological databases cannot be understood without understanding the data, and the tools for querying and accessing these data. • While database technology (XML, Relational OO databases, text formats) is used to store this data, its use is (often) transparent for Bioinformatics people. • In this course, we looked at various data-streams, and pointed to databases that store these data-streams • Nucleic Acids Research brings out a database issue every January 2005 issue

CSE182-L18

CSE182-L18

Presentation Transcript

L18

CSE182-L10

CSE182-L16

CSE182-L12

CSE182-L11

CSE182-L12

CSE182-L9

CSE182-L6

CSE182-L12

CSE182-L9

CSE182-L7

CSE182-L11

CSE182-L10

CSE182-L13