1 / 57

Top down systems biology

Top down systems biology. Introduction systems biology 2011/2012 Huub Hoefsloot. Schedule. 09:00-09:45 lecture 09:45-11:00 coffee + assignments 1&2 11:00-11:45 lecture 11:45-12:45 coffee + assignment 3 13:30-15:15 MATLAB tutorials P323 15:15- ?? finish report F229

meadem
Download Presentation

Top down systems biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Top down systems biology Introduction systems biology 2011/2012 Huub Hoefsloot

  2. Schedule • 09:00-09:45 lecture • 09:45-11:00 coffee + assignments 1&2 • 11:00-11:45 lecture • 11:45-12:45 coffee + assignment 3 • 13:30-15:15 MATLAB tutorials P323 • 15:15- ?? finish report F229 • 16:30-?? iGEM KC159

  3. Aim • What is the top down approach? • Basic methods: • Clustering • Classification • Principal Component Analysis (PCA) • Looking at the single genes, proteins, metabolites, • How to draw biological conclusions?

  4. Top down • Measure as much as possible • Let the data speak!!! • Common data types • Transcriptomics • Proteomics • Metabolomics

  5. Methods • Clustering • Test single genes/proteins……. • Principal component analysis • Discrimination

  6. Hierarchical clustering • Various hierarchical clustering algorithms exist • Single-linkage clustering, nearest-neighbor • Complete-linkage, furthest-neighbor • Average-linkage, un-weighted pair-group method average • Weighted-pair group average • ...

  7. Hierarchical clustering • Small example 2 5 3 1 4

  8. distances • Calculate the distances between all points • This yields n(n-1)/2 distances Y=pdist(X) 2.9155 1.0000 3.0414 3.0414 2.5495 3.3541 2.5000 2.0616 2.0616 1.0000 (1,2) (1,3) (1,4) (1,5) (2,3) (2,4) (2,5) (3,4) (3,5) (4,5)

  9. Dendrogram (nearest) 2.9155 1.0000 3.0414 3.0414 2.5495 3.3541 2.5000 2.0616 2.0616 1.0000 (1,2) (1,3) (1,4) (1,5) (2,3) (2,4) (2,5) (3,4) (3,5) (4,5)

  10. Furthest 2.9155 1.0000 3.0414 3.0414 2.5495 3.3541 2.5000 2.0616 2.0616 1.0000 (1,2) (1,3) (1,4) (1,5) (2,3) (2,4) (2,5) (3,4) (3,5) (4,5)

  11. average

  12. Clustering from a tree • Define a cut off value • Define a number of clusters.

  13. An example http://www2.warwick.ac.uk/fac/sci/moac/students/peter_cock/r/heatmap/

  14. Distances • Distances in Rn • Euclidean • City Block distance • Mahalanobis • . • . D is diagonal

  15. : K-means clustering • The number of clusters has to be chosen in advance • Initial position of the cluster centers have to be chosen • For each data point the (euclidean, many more available) distance to each cluster center is calculated • Each data point is assigned to it‘s nearest cluster center • Cluster centers are shifted to the center of data points assigned to this cluster center • Step 3. – 5. is iterated until cluster centers are not shifted anymore

  16. K-means Dividing line X X First guess for blues First guess for reds

  17. K-means Dividing line X X Second guess for blues second guess for reds

  18. K-means Dividing line X X Final result for blues Final result for reds

  19. K-means • How to determine the number of clusters? • Try some and use a method to calculate the best.

  20. Multiple testing

  21. What is the problem • 30000 genes are measured in a case control study • For each gene a t-test is performed at a significance level of α=0.05 • 1500 genes are found to be significant even if the two groups do not differ.

  22. False positives/false negatives • False positives are variables that are found to differ but actually do not differ between the groups. • False negatives are variables that are not found by the test but actually do differ.

  23. Bonferroni correction • If n tests are performed use adjusted values: αadj=α/n • This results in a procedure that the chance of finding a false negative in a set of n variables equals α.

  24. Permutation approach • Permute the labels and make a nonsense data set. • Look for the number of significant variables many times. • The number of findings in the true data should be in the right tail of the distribution

  25. Distribution found in permutations Only significant (with a level of 0.05) if the result is in the right hand side 5% of the permutation distribution Frequency The false discovery rate can be calculated using the mean of the permutation distribution 5% Number of genes found

  26. Result is a list of genes • What to do with this list? • How to get a biological interpretation? • Look if the genes in the list are related. • See, with the help of GO if some biological processes are overrepresented. (Bioinformatics)

  27. Assignment 1 • Read the primer on gene expression clustering. (ignore SOM) Answer the following question: • Come up with a biological question in which Pearson correlation and not an Euclidian distance. Please explain!

  28. Assignment 2 • Read primer on multiple testing and answer the following question: • What is the advantages of an empirical null and what are the advantages of an analytical null?

  29. Principal Components Analysis An intuitive explanation

  30. Points in space x1 Object 1 Object 2 Object 3 Object 4 x1 0 1 2 3 4

  31. Points in space x1 x2 Object 1 4 Object 2 3 Object 3 Object 4 2 1 x1 0 1 2 3 4

  32. Points in space x1 x2 x3 x2 Object 1 4 x3 Object 2 3 Object 3 Object 4 2 1 x1 0 1 2 3 4 Objects are points in R3

  33. Points in space x1 x2 x3 xn Object 1 ………… Object 2 Object 3 Object 4 Objects are points in Rn

  34. How to visualize Rn? • Use 2 variables only? Many, many plots!! This does not help. • Find “important” directions. And plot the objects with respect to these directions. This is the idea behind Principal Component Analysis (PCA).

  35. Important directions x2 x2 x1 x1 Covariance?

  36. Principal components x2 PC1 PC2 x1

  37. Any gain? Only consider the most important components PC1 projection PC1

  38. Dimension reduction • From 2D to 1D is a small gain • From 30000D to 2D is a lot and a plot (visualization) can be made.

  39. What are the results of a PCA loading score

  40. = principal component Scores and loadings loadings ... + + DATA scores Scores tell something about an object Loadings about the variables

  41. Explained variance • PCA tries to explain variance in the data • First PC explain as much variance as possible • The next PC explain as much as possible from the remainder

  42. discrimination

  43. Question • In what group does an object belong

  44. Discriminant analysis LDA Linear discriminant analysis PCDA Principal component discriminant analysis

  45. Normal distributions

  46. Multivariate normal distribution No co-variance, equal variance

  47. Linear discriminant analysis (LDA) D Discriminant direction m2 m1 Dividing line

  48. Multivariate normal With covariance

  49. Linear discriminant analysis (LDA) M2 Discriminant direction M1

  50. LDA calculations W is the pooled within class covariance D is the discriminant vector

More Related