Applications of Extreme Value Theory (EVT) in Biology: A Comprehensive Overview

Searching for applications of EVT in biology Adam Butler, Biomathematics & Statistics Scotland UK extremes, April 2007 Acknowledgements: Len Thomas, Clive Anderson, Dirk Husmeier

Overview • Biologists are frequently interested in properties of extreme or rare events - i.e. extinction, long-range dispersal, genetic mutation – but EVT is not widely known or used in many branches of biology • Some possible reasons: • Biological sciences have tended to be data-poor, relative to e.g. hydrology • Focus on testing of scientific hypotheses rather than risk assessment • Difficulty in deriving a meaningful quantitative definition of an extreme event • Oppurtunities arise from the large datasets that arise in modern biology (e.g. genetics, ecological modelling), & from an increasing focus on quantitative risk assessment

Genetics • “…a sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences…” (Wikipedia) • EVT has been used for sequence alignment since the early 90s (Karlin et al., 1990; Mott, 1992; Mott & Tribe, 1999), and is now embedded within standard software (BLAST, FASTA)

Basic idea is to compare the target sequence with a (very) large database of known sequences, by: • defining a similarity score • using a fast algorithm to search for the best match(es) within the database • using EVT to evaluate the statistical significance of this match • Theoretical arguments are used to justify the use of a Gumbel model for the best score • Currently interest is in the alignment of multiple sequences (Fromlett & Futschik, 2004; Wang & Sen, 2006), & this requires the use of multivariate extreme value methods

Ecology Review papers: Gaines & Denny (1993), Katz et al. (2005) • Disturbance Study the extremes of environmental processes that are known to lead to ecological disturbance: sediment rates, fire sizes, frost days • Longevity & survival Study the maximum lifespan or size of an individual • Population dynamics Evaluate the probability of extinction or explosion of a population

Dispersal & spread Spatial spread (of diseases, pollen, invasive species, native species responding to climate change) known to be influenced by long-range dispersal events: use EVT to analyse dispersal data? Issues: spatial structure; censoring &/or non-reporting; mixtures • Ecological modelling Study the properties of extreme events simulated by complex process-based ecological models – e.g. mass extinction events Deterministic models: find the region of the parameter space associated with the process exceeding a particular level Stochastic models: calculate the probability of the process exceeding a threshold for a given parameter set

EVT for complex stochastic models: some vague ideas Y() ~ CSM(), likelihood of CSM intractable,  high dimensional Possible approach if simulation is quick & we have real data x…?: EVT + ABC: 1. generate a value from the prior,  ~  2. use the model to simulate a dataset y()~ CSM() 3. fit y()|{y()> u} ~ GPD to estimate P(Y() > v), for v >> u 4. accept  if P(Y() > v) lies within a 95% confidence interval about P(X > v), else reject Or perhaps could use ABC-MCMC on (,v) with pseudo-prior on v

Y() ~ CSM(), likelihood of CSM intractable,  high dimensional Possible approach if simulation is slow & we do not have data…? EVT + GP: Run CSM for a relatively small set of parameter values  Assume y()|{y()> u} ~ GPD(()) Assume () ~ N(, ) Impose structure on  & fit by hierarchical Bayes Could be used to draw inferences about P(Y() > v) for v >> u, even if we have not simulated from CM()

Some references Karlin, S., Dembo, A. and Kawabata, T. (1990) Statistical composition of high-scoring segments from molecular sequences. The Annals of Statistics, 18, 571-581. Mott, R. (1992) Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bulletin of Mathematical Biology,54, 59-75. Gaines, S.D. and Denny, M. W. (1993) The largest, smallest, highest, lowest, longest, and shortest: extremes in ecology. Ecology, 74, 1677-1692. Mott, R. and Tribe, R. (1999) Approximate sequences of gapped alignments. Journal of Computational Biology, 6, 91-112. Frommlet, F. and Futschik, A. (2004) On the Dependence Structure of Sequence Alignment Scores Calculated with Multiple Scoring Matrices, Statistical Applications in Genetics and Molecular Biology, 3, article 24. Katz, R. W., Brush, G.S. and Parlange, M.B. (2005) Statistics of extremes: modeling ecological disturbances. Ecology, 86, 1124-1134. Wang, L. and Sen, P. K. (2006) Extreme value theory in some statistical analysis of genomic sequences. Extremes, 8, 295-310. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410. (PubMed) Karlin S, Altschul SF. (1990) "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc Natl Acad Sci U S A. Mar;87(6):2264-8. (PubMed) Karlin S, Altschul SF. (1993) "Applications and statistics for multiple high-scoring segments in molecular sequences." Proc Natl Acad Sci U S A. 1993 Jun 15;90(12):5873-7. (PubMed) Altschul, S.F., Madden, T.L., Sch?ffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402. (PubMed) Pearson, W.R. (1998) "Empirical statistical estimates for sequence similarity searches." J. Mol. Biol. 276:71-84. (PubMed) Mott R, Tribe R. (1999) Approximate statistics of gapped alignments. J Comput Biol. Spring;6(1):91-112. (PubMed) Mott R. (2000) "Accurate formula for P-values of gapped local sequence and profile alignments." J Mol Biol. Jul 14;300(3):649-59. (PubMed) Altschul S.F., Bundschuh R, Olsen R, Hwa T. (2001) "The estimation of statistical parameters for local alignment score distributions." Nucleic Acids Res. Jan 15;29(2):351-61. (PubMed) Ewens, W. J., Grant G. R. (2001) "Statistical Methods in Bioinformatics. An introduction" Springer Verlag. Statistics for biology and health. Park Y, Spouge JL. (2002) "The correlation error and finite-size correction in an ungapped sequence alignment." Bioinformatics. Sep;18(9):1236-42. (PubMed)

Applications of Extreme Value Theory (EVT) in Biology: A Comprehensive Overview

Applications of Extreme Value Theory (EVT) in Biology: A Comprehensive Overview

Presentation Transcript

Applications of Genetics to Conservation Biology

Applications of Game Theory in the Computational Biology Domain

Applications of Synthetic Biology

The effect of Level of Effort EVT

Chapter 2: Applications of Biology

Chapter 2: Applications of Biology

Evt (easy VoIP Tester)

Applications of HMMs in Computational Biology

High Performance Computing Applications in Biology

Applications of Synchrotron Radiation in Biology and Biotechnology

FLD Freq/Evt Bytes/Evt

Applications of scan statistics in molecular biology and neuroscience

Applications of Transition State in System Biology

Applications of Synthetic Biology

Chapter 12: Searching in Web applications

Searching in Applications Containing Bio-Sequences

Searching, finding and organising literature: Biology

Searching for Searching for Smog Coupons in Pacoima?

Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology

Applications of HMMs in Computational Biology

Searching, finding and organising literature: Biology

Applications of Differential Equations in Synthetic Biology