S peaker: Bin-Shenq Ho D ec. 19, 2011

Inadequacies of Minimum Spanning Treesin Molecular EpidemiologyStephen J. Salipante and Barry G. HallJOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p. 3568–3575 Speaker: Bin-Shenq Ho Dec. 19, 2011

Underlying Reasoning • How will be the representativeness of a single, arbitrarily selected MST in terms of potentially many equally optimal solutions • How could be the role of statistical metrics in the credibility of MST estimations

Materials and Methods MST goldhttp://www.bellinghamresearchinstitute.com http://web.me.com/barryghall/ Max amount of time Max number of unique MSTs Min rate of new discovery

Materials and Methods Distance matrix calculation • Equidistant method sequence, spoligotype, SNP • Difference method VNTR

spoligotype spacer oligonucleotide type (http://www.cdc.gov/tb/programs/genotyping/Chap3/3_CDCLab_2Description.htm)

VNTR variable number of tandem repeat (http://www.cdc.gov/tb/programs/genotyping/Chap3/3_CDCLab_2Description.htm)

MLST multilocus sequence type • The procedure characterizes isolates of bacterial species using the DNA sequences of internal fragments of multiple housekeeping genes. • For each housekeeping gene, the different sequences present within a bacterial species are assigned as distinct alleles and, for each isolate, the alleles at each of the loci define the allelic profile or sequence type (ST). • Nucleotide differences between strains can be checked at a variable number of genes depending on the degree of discrimination desired. (http://en.wikipedia.org/wiki/Multilocus_sequence_typing)

Materials and Methods MSTs estimation and MSNs creation • Kruskal’s algorithm with input by node order randomization • Combination of all edges defined within unique MSTs constitutes MSN.

Materials and Methods Number estimation of possible MSTs through mark-recapture (Schnabel method) N ＝ [(M＋1)(C＋1)] ÷ (R＋1) － 1 N＋1 ＝ [(M＋1)(C＋1)] ÷ (R＋1) (M＋1) ÷ (N＋1) ＝ (R＋1) ÷ (C＋1) M：Mark C：Current R：Recapture

Materials and Methods Bootstrapping • To establish confidence level of a model • 100 individual pseudoreplicates for each MST • Bootstrap value expressed as the fraction of pseudoreplicates yielding the same inference as the original data • Given enough information, there should be sufficiently redundant data that independent pseudoreplicates will yield analyses identical to that of the complete data set.

BootstrapEfron and Gong (1983)Diaconis and Efron (1983)Felsenstein, J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791Inferring the variability in an unknown distribution from which your data were drawn by resampling from the data

Results Estimating alternative MSTs • Multiple, equally parsimonious solutions possible • Kruskal’s MST algorithm sensitive to node input order • Schnabel methodappropriate to estimate the number of alternative MSTs, esp. after discarding the early cycles of node order randomization

note for number estimation of possible MSTs through mark-recapture (Schnabel method) N ＝ [(M＋1)(C＋1)] ÷ (R＋1) － 1 N＋1 ＝ [(M＋1)(C＋1)] ÷ (R＋1) (M＋1) ÷ (N＋1) ＝ (R＋1) ÷ (C＋1) M：Mark C：Current R：Recapture

The number of possible MSTs is proportional only to the number of minimal pairwise distances with equal lengths. There is a relationship between the number of possible MSTs and the method used to compute the pairwise distance matrix.

note for distance matrix computation • Equidistant method – sites scored merely as “same” or “different” such that any difference carries the same weight • Difference method – distances between sites calculated on the basis of the difference between the values of the two sites

There were significantly fewer alternative MSTs possible when the same data were processed using the difference method. There is a relationship between the type of data used and the number of possible alternative MSTs.

Results Estimating alternative MSTs • When there are limited numbers of informative sites and alleles are treated as equidistant from one another, there are many pairwise distances of the same length, and large numbers of MSTs are possible. • Basing analyses on the arithmetic number of pairwise differences among individuals both limits the number of possible MSTs and more faithfully represents the genetic distances between individuals.

Results Creating MSN • Approximation by majority rule dashed line – edges present in ≧ 50% of MSTs solid line – edges present in 100% of MSTs • Fraction ≠ Credibility

Results Estimating credibility of MSTs Within any set of alternative MSTs examined, the individual trees demonstrated a considerable range of average bootstrap values. Although all MSTs in the MSN are equally parsimonious, some tree configurations are more statistically robust.

Results Estimating credibility of MSTs • By restricting analysis to a single, arbitrary MST, there is considerable risk in picking a tree with an inferior credibility. • By surveying and evaluating trees within the MSN, it is possible to identify those with more credible configurations.

Results Systematic approach to MST estimation

Discussion • Failing to consider alternative solutions (MSTs) can easily mislead or confound our understanding of population structure. • Molecular epidemiology has yet to adopt measures to evaluate the credibility of the estimation. • Presenting a single MST neither explores the range of alternative hypotheses nor evaluates the quality of MSTs based on their relative credibilities.

Discussion~ proposed approach to MST analysis ~ • 1. The distance matrix that maximizes the differences between individuals is calculated. For VNTR data, a distance matrix calculated by the difference method should be used, and for MLST data, distances should be computed from the underlying DNA sequence data. • 2. Instead of returning a single, arbitrarily selected MST, the MSN (representing or approximating the entire population of alternative MSTs) is reported. The total number of possible MSTs is estimated using a mark-recapture calculation.

Discussion~ proposed approach to MST analysis ~ 3. A bootstrapping metric is employed to estimate the credibility of individual MSTs within the population of alternative solutions comprising the MSN. As many MSTs as time permits are subjected to bootstrap analysis so that the most reliable MST topology can be estimated and statistical support for particular relationships may be ascertained. 4. The most credible hypothesis or hypotheses within the larger population of MSTs are reported.

Thanks for Your Attention !

S peaker: Bin-Shenq Ho D ec. 19, 2011