1 / 38

CSE182-L17

CSE182-L17. Clustering Population Genetics: Basics. Clusters. Unsupervised Clustering. Given a set of points (in n-dimensions), and k, compute the k “best clusters”. In k-means, clustering is done by choosing k centers (means). Each point is assigned to the closest center.

mchappell
Download Presentation

CSE182-L17

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE182-L17 Clustering Population Genetics: Basics

  2. Clusters Unsupervised Clustering • Given a set of points (in n-dimensions), and k, compute the k “best clusters”. • In k-means, clustering is done by choosing k centers (means). • Each point is assigned to the closest center. • The notion of “best” is defined by distances to the center. • Question: How can we compute the k best centers?

  3. Distance • Given a data pointv and a set of points X, define the distance from v to X d(v, X) as the (Euclidean) distance from v to the closest point from X. • Given a set of n data pointsV={v1…vn} and a set of k points X, define the Squared Error Distortion d(V,X) = ∑d(vi, X)2 / n 1 <i<n v

  4. K-Means Clustering Problem: Formulation • Input: A set, V, consisting of n points and a parameter k • Output: A set X consisting of k points (cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of X This problem is NP-complete in general.

  5. 1-Means Clustering Problem: an Easy Case • Input: A set, V, consisting of n points. • Output: A single point X that minimizes d(V,X) over all possible choices of X. This problem is easy. However, it becomes very difficult for more than one center. An efficient heuristic method for k-Means clustering is the Lloyd algorithm

  6. K-means: Lloyd’s algorithm • Choose k centers at random: • X’ = {x1,x2,x3,…xk} • Repeat • X=X’ • Assign each v V to the closest cluster j • d(v,xj) = d(v,X)  Cj= Cj  {v} • Recompute X’ • x’j (∑ v  Cj v) /|Cj| • until (X’ = X)

  7. x1 x2 x3

  8. x1 x2 x3

  9. x1 x2 x3

  10. x1 x2 x3

  11. Conservative K-Means Algorithm • Lloyd algorithm is fast but in each iteration it moves many data points, not necessarily causing better convergence. • A more conservative method would be to move one point at a time only if it improves the overall clustering cost • The smaller the clustering cost of a partition of data points is the better that clustering is • Different methods can be used to measure this clustering cost (for example in the last algorithm the squared error distortion was used)

  12. Microarray summary • Microarrays (like MS) are a technology for probing the dynamic state of the cell. • We answered questions like the following: • Which genes are coordinately regulated (They have similar expression patterns in different conditions)? • How can we reduce the dimensionality of the system? • Using gene expression values from a sample, can you predict if the sample is normal (state A) or diseased (state B) • The techniques employed for classification/clustering etc. are general and can be employed in a number of contexts.

  13. Microarray non-summary • We did not cover: • How are the gene expression values measured (the technology)? (CSE183) • How do you control variability across different experiments (normalization)? (CSE183) • What controls the expression of a gene (gene regulation), or a set of genes? (CSE 181)

  14. Population Genetics • The sequence of an individual does not say anything about the diversity of a population. • Small individual genetic differences can have a profound impact on “phenotypes” • Response to drugs • Susceptibility to diseases • Soon, we will have sequences of many individuals from the same species. Studying the differences will be a major challenge.

  15. Population Structure • 377 locations (loci) were sampled in 1000 people from 52 populations. • 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003) Oceania Eurasia East Asia America Africa

  16. Population Genetics • What is it about our genetic makeup that makes us measurably different? • These genetic differences are correlated with phenotypic differences • With cost reduction in sequencing and genotyping technologies, we will know the sequence for entire populations of individuals. • Here, we will study the basics of this polymorphism data, and tools that are being developed to analyze it.

  17. What causes variation in a population? • Mutations (may lead to SNPs) • Recombinations • Other genetic events (Ex: microsatellite repeats) • Deletions, inversions

  18. Single Nucleotide Polymorphisms Infinite Sites Assumption: Each site mutates at most once 00000101011 10001101001 01000101010 01000000011 00011110000 00101100110

  19. Short Tandem Repeats GCTAGATCATCATCATCATTGCTAG GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATCATTGC GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATCATTGC 4 3 5 3 3 5

  20. STR can be used as a DNA fingerprint • Consider a collection of regions with variable length repeats. • Variable length repeats will lead to variable length DNA • Vector of lengths is a finger-print 4 2 3 3 5 1 3 2 3 1 5 3 individuals positions

  21. Recombination 00000000 11111111 00011111

  22. What if there were no recombinations? • Life would be simpler • Each sequence would have a single parent • The relationship is expressed as a tree.

  23. The Infinite Sites Assumption 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 5 8 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 • The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa. • Some phenotypes could be linked to the polymorphisms • Some of the linkage is “destroyed” by recombination

  24. Infinite sites assumption and Perfect Phylogeny • Each site is mutated at most once in the history. • All descendants must carry the mutated value, and all others must carry the ancestral value i 1 in position i 0 in position i

  25. Perfect Phylogeny • Assume an evolutionary model in which no recombination takes place, only mutation. • The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny. • How can one reconstruct such a tree?

  26. The 4-gamete condition i A 0 B 0 C 0 D 1 E 1 F 1 • A column i partitions the set of species into two sets i0, and i1 • A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogenous. • EX: i is heterogenous w.r.t {A,D,E} i0 i1

  27. 4 Gamete Condition • 4 Gamete Condition • There exists a perfect phylogeny if and only if for all pair of columns (i,j), either j is not heterogenous w.r.t i0, or i1. • Equivalent to • There exists a perfect phylogeny if and only if for all pairs of columns (i,j), the following 4 rows do not exist (0,0), (0,1), (1,0), (1,1)

  28. i i0 i1 4-gamete condition: proof • Depending on which edge the mutation j occurs, either i0, or i1 should be homogenous. • (only if) Every perfect phylogeny satisfies the 4-gamete condition • (if) If the 4-gamete condition is satisfied, does a prefect phylogeny exist?

  29. An algorithm for constructing a perfect phylogeny • We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. • In any tree, each node (except the root) has a single parent. • It is sufficient to construct a parent for every node. • In each step, we add a column and refine some of the nodes containing multiple children. • Stop if all columns have been considered.

  30. Inclusion Property • For any pair of columns i,j • i < j if and only if i1 j1 • Note that if i<j then the edge containing i is an ancestor of the edge containing i i j

  31. r A B C D E Example 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 Initially, there is a single clade r, and each node has r as its parent

  32. Sort columns • Sort columns according to the inclusion property (note that the columns are already sorted here). • This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0

  33. Add first column 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 • In adding column i • Check each edge and decide which side you belong. • Finally add a node if you can resolve a clade r u B D A C E

  34. Adding other columns 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 • Add other columns on edges using the ordering property r 1 3 E 2 B 5 4 D A C

  35. Unrooted case • Switch the values in each column, so that 0 is the majority element. • Apply the algorithm for the rooted case

  36. Handling recombination • A tree is not sufficient as a sequence may have 2 parents • Recombination leads to loss of correlation between columns

  37. Linkage (Dis)-equilibrium (LD) • Consider sites A &B • Case 1: No recombination • Pr[A,B=0,1] = 0.25 • Linkage disequilibrium • Case 2:Extensive recombination • Pr[A,B=(0,1)=0.125 • Linkage equilibrium A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0

More Related