1 / 58

Introduction to

Introduction to. Bioinformatics. Introduction to Bioinformatics. LECTURE 5: Variation within and between species * Chapter 5: Are Neanderthals among us?. Neandertal, Germany, 1856. Initial interpretations: * bear skull * pathological idiot * Old Dutchman.

Download Presentation

Introduction to

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Bioinformatics

  2. Introduction to Bioinformatics. LECTURE 5: Variation within and between species * Chapter 5: Are Neanderthals among us?

  3. Neandertal, Germany, 1856 Initial interpretations: * bear skull * pathological idiot * Old Dutchman ...

  4. Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

  5. Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

  6. Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION

  7. Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION • 5.1 Variation in DNA sequences • * Even closely related individuals differ in genetic sequences • * (point) mutations : copy error at certain location • * Sexual reproduction – diploid genome

  8. Diploid chromosomes Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

  9. Mitosis: diploid reproduction Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

  10. Meiosis: diploid (=double) → haploid (=single) Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

  11. Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES * typing error rate very good typist: 1 error / 1K typed letters * all our diploid cells constantly reproduce 7 billion letters * typical cell copying error rate is ~ 1 error /1 Gbp

  12. GERM LINE • Reverse time and follow your cells: • Now you count ~ 1013 cells • One generation ago you had 2 cells ‘somewhere’ in your parents body • Small T generations ago you had (2T – multiple ancestors) cells • Large T generations ago you counted #(fertile ancestors) cells • Congratulations: you are 3.4 billion years old !!! • Fast-forward time and follow your cells: • Only a few cells in your reproductive organs have a chance to live on in the next generations • The rest (including you) will die … Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

  13. Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES GERM LINE MUTATIONS This potentially immortal lineage of (germ) cells is called the GERM LINE All mutations that we have accumulated are en route on the germ line

  14. Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES * Polymorphism : multiple possibilities for a nucleotide: allelle * Single Nucleotide Polymorphism – SNP (“snip”) point mutation example: AAATAAA vs AAACAAA * Humans: SNP = 1/1500 bases = 0.067% * STR = Short Tandem Repeats (microsatelites) example: CACACACACACACACACA … * Transition - transversion

  15. Purines – Pyrimidines Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES

  16. Introduction to Bioinformatics5.1 VARIATION IN DNA SEQUENCES Transitions – Transversions

  17. Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION • 5.2 Mitochondrial DNA • * mitochondriae are inherited only via the maternal line!!! • * Very suitable for comparing evolution, not reshuffled

  18. Introduction to Bioinformatics 5.2 MITOCHONDRIAL DNA H.sapiens mitochondrion

  19. Introduction to Bioinformatics 5.2 MITOCHONDRIAL DNA EM photograph of H. Sapiens mtDNA

  20. Introduction to Bioinformatics 5.2 MITOCHONDRIAL DNA

  21. Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION • 5.3 Variation between species • * genetic variation accounts for morphological-physiological-behavioralvariation • * Genetic variation (c.q. distance) relates to phylogeneticrelation (=relationship) • * Necessity to measure distances between sequences: a metric

  22. Substitution rate * Mutations originate in single individuals * Mutations can become fixed in a population * Mutation rate: rate at which new mutations arise * Substitution rate: rate at which a species fixes new mutations * For neutral mutations Introduction to Bioinformatics5.3 VARIATION BETWEEN SPECIES

  23. Introduction to Bioinformatics5.3 VARIATION BETWEEN SPECIES Substitution rate and mutation rate * For neutral mutations * ρ = 2Nμ*1/(2N) = μ * ρ = K/(2T)

  24. Introduction to BioinformaticsLECTURE 5: INTER- AND INTRASPECIES VARIATION 5.4 Estimating genetic distance * Substitutions are independent (?) * Substitutions are random * Multiple substitutions may occur * Back-mutations mutate a nucleotide back to an earlier value

  25. Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE Multiple substitutions and Back-mutations conceal the real genetic distance GACTGATCCACCTCTGATCCTTTGGAACTGATCGT TTCTGATCCACCTCTGATCCTTTGGAACTGATCGT TTCTGATCCACCTCTGATCCATCGGAACTGATCGT GTCTGATCCACCTCTGATCCATTGGAACTGATCGT observed : 2 (= d) actual : 4 (= K)

  26. Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE * Saturation: on average one substitution per site * Two random sequences of equal length will match for approximately ¼ of their sites * In saturation therefore the proportional genetic distance is ¼

  27. Introduction to Bioinformatics5.4 ESTIMATING GENETIC DISTANCE * True genetic distance (proportion): K * Observed proportion of differences: d * Due to back-mutations K ≥ d

  28. Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE SEQUENCE EVOLUTION is a Markov process: a sequence at generation (= time) t depends only the sequence at generation t-1

  29. Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE The Jukes-Cantor model Correction for multiple substitutions Substitution probability per site per second is α Substitution means there are 3 possible replacements (e.g. C → {A,G,T}) Non-substitution means there is 1 possibility (e.g. C → C)

  30. Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL Therefore, the one-step Markov process has the following transition matrix: MJC = A C G T A 1-αα/3 α/3 α/3 C α/3 1-αα/3 α/3 G α/3 α/3 1-αα/3 T α/3 α/3 α/3 1-α

  31. Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL After t generations the substitution probability is: M(t) = MJCt Eigen-values and eigen-vectors of M(t): λ1 = 1, (multiplicity 1): v1 = 1/4 (1 1 1 1)T λ2..4 = 1-4α/3, (multiplicity 3): v2 = 1/4 (-1 -1 1 1)T v3 = 1/4 (-1 -1 -1 1)T v4 = 1/4 (1 -1 1 -1)T

  32. Spectral decomposition of M(t): MJCt = ∑iλitviviT Define M(t) as: MJCt = Therefore, substitution probability s(t) per site after t generations is: s(t) = ¼ - ¼(1 - 4α/3)t Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL r(t) s(t) s(t) s(t) s(t) r(t) s(t) s(t) s(t) s(t) r(t) s(t) s(t) s(t) s(t) r(t)

  33. Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL substitution probability s(t) per site after t generations: s(t) = ¼ - ¼(1 - 4α/3)t observed genetic distance dafter t generations ≈s(t) : d = ¼ - ¼(1 - 4α/3)t For small α:

  34. Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL For small α the observed genetic distance is: The actual genetic distance is (of course): K = αt So: This is the Jukes-Cantor formula : independent of αand t.

  35. Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL The Jukes-Cantor formula : For smalld using ln(1+x) ≈ x : K ≈ d So: actual distance ≈ observed distance For saturation: d↑ ¾ : K →∞ So: if observed distance corresponds to random sequence-distance then the actual distance becomes indeterminate

  36. Jukes-Cantor

  37. Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL Variance in K If: K = f(d) then: So: Generation of a sequence of length n with substitution rate d is a binomial process: and therefore with variance: Var(d) = d(1-d)/n Because of the Jukes-Cantor formula:

  38. Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL Variance in K Variance: Var(d) = d(1-d)/n Jukes-Cantor: So:

  39. Var(K)

  40. Introduction to Bioinformatics 5.4 THE JUKES-CANTOR MODEL EXAMPLE 5.4 on page 90 * Create artificial data with n = 1000: generate K* mutations * Count d * With Jukes-Cantor relation reconstruct estimate K(d) * Plot K(d) – K*

  41. Introduction to Bioinformatics 5.4 EXAMPLE 5.4 on page 90

  42. Introduction to Bioinformatics 5.4 EXAMPLE 5.4 on page 90

  43. Introduction to Bioinformatics 5.4 EXAMPLE 5.4 on page 90

  44. Introduction to Bioinformatics 5.4 EXAMPLE 5.4 on page 90 (= FIG 5.3)

  45. Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE The Kimura 2-parameter model Include substitution bias in correction factor Transition probability (G↔A and T↔C) per site per second is α Transversion probability (G↔T, G↔C, A↔T, and A↔C) per site per second is β

  46. Introduction to Bioinformatics 5.4 THE KIMURA 2-PARAM MODEL The one-step Markov process substitution matrix now becomes: MK2P = A C G T A 1-α-ββαβ C β 1-α-ββα G αβ1-α-ββ T βαβ1-α-β

  47. Introduction to Bioinformatics 5.4 THE KIMURA 2-PARAM MODEL After t generations the substitution probability is: M(t) = MK2Pt Determine of M(t): eigen-values {λi} and eigen-vectors {vi}

  48. Introduction to Bioinformatics 5.4 THE KIMURA 2-PARAM MODEL Spectral decomposition of M(t): MK2Pt = ∑iλitviviT Determine fraction of transitions per site after t generations : P(t) Determine fraction of transitions per site after t generations : Q(t) Genetic distance: K ≈ - ½ ln(1-2P-Q) – ¼ ln(1 – 2Q) Fraction of substitutionsd = P + Q → Jukes-Cantor

  49. Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE Other models for nucleotide evolution * Different types of transitions/transversions * Pairwise substitutions GTR (= General Time Reversible) model * Amino-acid substitutions matrices * …

  50. Introduction to Bioinformatics 5.4 ESTIMATING GENETIC DISTANCE Other models for nucleotide evolution DEFICIT: all above models assume symmetric substitution probs; prob(A→T) = prob(T→A) Now strong evidence that this assumption is not true Challenge: incorporate this in a self-consistent model

More Related