1 / 9

Searching for applications of EVT in biology

Searching for applications of EVT in biology. Adam Butler, Biomathematics & Statistics Scotland UK extremes, April 2007 Acknowledgements: Len Thomas, Clive Anderson, Dirk Husmeier. Overview.

marged
Download Presentation

Searching for applications of EVT in biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching for applications of EVT in biology Adam Butler, Biomathematics & Statistics Scotland UK extremes, April 2007 Acknowledgements: Len Thomas, Clive Anderson, Dirk Husmeier

  2. Overview • Biologists are frequently interested in properties of extreme or rare events - i.e. extinction, long-range dispersal, genetic mutation – but EVT is not widely known or used in many branches of biology • Some possible reasons: • Biological sciences have tended to be data-poor, relative to e.g. hydrology • Focus on testing of scientific hypotheses rather than risk assessment • Difficulty in deriving a meaningful quantitative definition of an extreme event • Oppurtunities arise from the large datasets that arise in modern biology (e.g. genetics, ecological modelling), & from an increasing focus on quantitative risk assessment

  3. Genetics • “…a sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences…” (Wikipedia) • EVT has been used for sequence alignment since the early 90s (Karlin et al., 1990; Mott, 1992; Mott & Tribe, 1999), and is now embedded within standard software (BLAST, FASTA)

  4. Basic idea is to compare the target sequence with a (very) large database of known sequences, by: • defining a similarity score • using a fast algorithm to search for the best match(es) within the database • using EVT to evaluate the statistical significance of this match • Theoretical arguments are used to justify the use of a Gumbel model for the best score • Currently interest is in the alignment of multiple sequences (Fromlett & Futschik, 2004; Wang & Sen, 2006), & this requires the use of multivariate extreme value methods

  5. Ecology Review papers: Gaines & Denny (1993), Katz et al. (2005) • Disturbance Study the extremes of environmental processes that are known to lead to ecological disturbance: sediment rates, fire sizes, frost days • Longevity & survival Study the maximum lifespan or size of an individual • Population dynamics Evaluate the probability of extinction or explosion of a population

  6. Dispersal & spread Spatial spread (of diseases, pollen, invasive species, native species responding to climate change) known to be influenced by long-range dispersal events: use EVT to analyse dispersal data? Issues: spatial structure; censoring &/or non-reporting; mixtures • Ecological modelling Study the properties of extreme events simulated by complex process-based ecological models – e.g. mass extinction events Deterministic models: find the region of the parameter space associated with the process exceeding a particular level Stochastic models: calculate the probability of the process exceeding a threshold for a given parameter set

  7. EVT for complex stochastic models: some vague ideas Y() ~ CSM(), likelihood of CSM intractable,  high dimensional Possible approach if simulation is quick & we have real data x…?: EVT + ABC: 1. generate a value from the prior,  ~  2. use the model to simulate a dataset y()~ CSM() 3. fit y()|{y()> u} ~ GPD to estimate P(Y() > v), for v >> u 4. accept  if P(Y() > v) lies within a 95% confidence interval about P(X > v), else reject Or perhaps could use ABC-MCMC on (,v) with pseudo-prior on v

  8. Y() ~ CSM(), likelihood of CSM intractable,  high dimensional Possible approach if simulation is slow & we do not have data…? EVT + GP: Run CSM for a relatively small set of parameter values  Assume y()|{y()> u} ~ GPD(()) Assume () ~ N(, ) Impose structure on  & fit by hierarchical Bayes Could be used to draw inferences about P(Y() > v) for v >> u, even if we have not simulated from CM()

  9. Some references Karlin, S., Dembo, A. and Kawabata, T. (1990) Statistical composition of high-scoring segments from molecular sequences. The Annals of Statistics, 18, 571-581. Mott, R. (1992) Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bulletin of Mathematical Biology,54, 59-75. Gaines, S.D. and Denny, M. W. (1993) The largest, smallest, highest, lowest, longest, and shortest: extremes in ecology. Ecology, 74, 1677-1692. Mott, R. and Tribe, R. (1999) Approximate sequences of gapped alignments. Journal of Computational Biology, 6, 91-112. Frommlet, F. and Futschik, A. (2004) On the Dependence Structure of Sequence Alignment Scores Calculated with Multiple Scoring Matrices, Statistical Applications in Genetics and Molecular Biology, 3, article 24. Katz, R. W., Brush, G.S. and Parlange, M.B. (2005) Statistics of extremes: modeling ecological disturbances. Ecology, 86, 1124-1134. Wang, L. and Sen, P. K. (2006) Extreme value theory in some statistical analysis of genomic sequences. Extremes, 8, 295-310. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410. (PubMed) Karlin S, Altschul SF. (1990) "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc Natl Acad Sci U S A. Mar;87(6):2264-8. (PubMed) Karlin S, Altschul SF. (1993) "Applications and statistics for multiple high-scoring segments in molecular sequences." Proc Natl Acad Sci U S A. 1993 Jun 15;90(12):5873-7. (PubMed) Altschul, S.F., Madden, T.L., Sch?ffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402. (PubMed) Pearson, W.R. (1998) "Empirical statistical estimates for sequence similarity searches." J. Mol. Biol. 276:71-84. (PubMed) Mott R, Tribe R. (1999) Approximate statistics of gapped alignments. J Comput Biol. Spring;6(1):91-112. (PubMed) Mott R. (2000) "Accurate formula for P-values of gapped local sequence and profile alignments." J Mol Biol. Jul 14;300(3):649-59. (PubMed) Altschul S.F., Bundschuh R, Olsen R, Hwa T. (2001) "The estimation of statistical parameters for local alignment score distributions." Nucleic Acids Res. Jan 15;29(2):351-61. (PubMed) Ewens, W. J., Grant G. R. (2001) "Statistical Methods in Bioinformatics. An introduction" Springer Verlag. Statistics for biology and health. Park Y, Spouge JL. (2002) "The correlation error and finite-size correction in an ungapped sequence alignment." Bioinformatics. Sep;18(9):1236-42. (PubMed)

More Related