Pavel B. Klimov Barry M. OConnor University of Michigan, Museum of Zoology, 1109 Geddes Ave., Ann Arbor, MI

The next generation of identification tools: interactive programs incorporating multivariate models Pavel B. KlimovBarry M. OConnor University of Michigan, Museum of Zoology, 1109 Geddes Ave., Ann Arbor, MI Context: The vast majority of interactive identification programs use a sequential approach to assign an unknown specimen to a known group. This algorithm works when the distinguishing characters do not have overlapping values. If the boundaries between taxa are overlapping, simultaneous (=probabilistic, matching) methods of identifications are more likely to lead to the correct assignment, but these methods usually require time-consuming measurements or experiments. We discuss how the sequential approach can be enhanced by multivariate statistics incorporated into this method. 1. INTRODUCTION Computer assisted interactive identification allows quick assignment of an unknown specimen to a known taxon with minimal costs in obtaining data and learning about the unknown. The number of characters used in the identification is substantially reduced compared to traditional taxonomic keys. For example, any of 128 taxa can be identified using only eight binary characters, or even fewer numeric or multistate characters. There are two major approaches to identification, sequential (=elimination, diagnostic) and simultaneous (=probabilistic, matching). In the sequential approach, only one character is used at each step of identification until the unknown specimen is assigned to a particular group. In the simultaneous approach, some or all characters are entered simultaneously, and the probability of group membership of the unknown specimen is calculated. The advantage of the sequential algorithm, particularly its multi-entry variant (=freedom to choose any character), is obvious when a taxon set is large and the taxa have distinct boundaries. At each step, taxa matching the unknown are retained and diagnostic characters for this subset are ordered according to their separating power. This algorithm has been implemented in a variety of interactive identification programs such as DELTA and Lucid that are widely used at present. In contrast, simultaneous methods usually require data obtained by time consuming measurements or experiments and are not that flexible in terms of the freedom of choosing characters, but are more likely to lead to the correct assignment if the boundaries between some or all taxa are overlapping. The situation when a data set is large and contains taxa that cannot be completely separated using qualitative or uni- or bivarite characters requires a combination of both methods of identification where each approach will handle the appropriate data. 2. MULTIVARIATE MODELS Multivariate statistics summarizes variation in many variables in many specimens in the form of a concise model that contains essential and comprehensive information about the groups and that has predictive power. We consider two multivariate techniques that are usually used to analyze intergroup differences: canonical variates analysis (CVA), and binomial logistic regression (LR). Both analyses handle metric and non-metric independent variables. A canonical variates function is a latent variable that is created as a linear combination of independent variables, CV = b1*x1 + b2*x2 + ... + bn*xn + c (1), where the b's are coefficients, the x's are independent variables, and c is a constant. If there are n groups, n-1 CV's are calculated. For assignment purposes, the estimated posterior probability of group membership is calculated, or, when multivariate normality of the independent variables is assumed, the value of CV can be equivalently used. Logistic regression models can be expressed as the following equation, P(0) = exp(b1*x1 + b2*x2 + ... + bn*xn + c)/(1+exp(b1*x1 + b2*x2 + ... + bn*xn + c)) (2), where P(0) the probability of an unknown specimen being taxon 0, other notations are the same as for CVA above. If P(0) exceeds 0.5, then the unknown belongs to taxon 0, otherwise to taxon 1. A great advantage of LR over CVA is that it is a direct posterior probabilities estimator, it calculates the class posterior probabilities without ever estimating the classes' individual density functions, which requires additional data (group means, prior probabilities, and the value of mean square within groups). 3. INCORPORATING THE MODELS IN THE SEQUENTIAL ALGORITHM Both (1) and (2) can be used in any sequential identification program, as a single character “Model classifies the unknown specimen to” with the character states “group 1, group2,…group n”. The user, however, should be asked simply to enter measurements or observations, x1, x2, …, xn, then the Bayesian probabilities associated with being in either group are calculated, and the greater of these probabilities is used to classify the specimen. • Implementation of the new data type will require some adjustment in the internal logic of an identification program. In the general case, there are some characters in the identification matrix that can separate a subset of taxa without using methods of multivariate models. These characters, whether they are binary, multistate, or variable, should be given more weight compared to the complex character generated by a multivariate model. The latter also should be coded only for the subset of taxa included in the model, and this character for the other taxa should be coded as "missing". Because a multivariate model may contain characters that are used elsewhere in the identification matrix, these matching characters should be cross-referenced. • Results • The most optimal way of identification when a data matrix contain both both discrete and overlapping groups is to use combined sequential and probabilistic strategies for appropriate data. • Canonical variates and logistic regression models can be used in the context of the sequential approach to calculate posterior probabilities and to classify the unknown specimen. http://insects.ummz.lsa.umich.edu/beemites/Morphometrics.html Research supported by NSF DEB-0118766 (PEET) and the USDA (CSREES #2002-35302-12654).

Pavel B. Klimov Barry M. OConnor University of Michigan, Museum of Zoology, 1109 Geddes Ave., Ann Arbor, MI

Pavel B. Klimov Barry M. OConnor University of Michigan, Museum of Zoology, 1109 Geddes Ave., Ann Arbor, MI

Presentation Transcript

Analysis of Gene Microarray Data Alfred O. Hero III University of Michigan, Ann Arbor, MI http://www.eecs.umich.edu/~her

FAIRIE DOORS OF ANN ARBOR

Lujun Fang, Kristen LeFevre University of Michigan, Ann Arbor

Terry E. Robinson Department of Psychology & Neuroscience Program The University of Michigan, Ann Arbor, MI

Michigan Union, University of Michigan, Ann Arbor December 10-11, 2009

George Gloeckler and Len Fisk University of Michigan, Ann Arbor, MI

HEPIX Fall 2013 – Ann Arbor MI USA

University of Michigan Museum of Zoology: Animal Diversity Web

Epilepsy: Error of Scales? Ann Arbor, MI 2007

Cunming Duan University of Michigan, Ann Arbor, MI

Michigan Tech Research Institute, Michigan Technological University, Ann Arbor, Michigan

Li HAN and Neal H. Clinthorne University of Michigan, Ann Arbor, MI, USA

Prof. Julius Atlason Univ. of Michigan, Ann Arbor

Arnold S. Monto University of Michigan School of Public Health Ann Arbor, MI

Li HAN, Sam Huh, and Neal H. Clinthorne University of Michigan, Ann Arbor, MI, USA

Ann Geddes, Director of Public Policy

Advantages Of Hormone Replacement Therapy Ann Arbor, Michigan

Vinyl Siding In Ann Arbor Michigan

Catherine Demery - Ann Arbor, Michigan

Handyman Ann Arbor MI

11 Ideal Date Night Spots in Ann Arbor | Ann Arbor Michigan

Termite Control Ann Arbor MI - Framespestcontrol.com

Pavel B. Klimov Barry M. OConnor University of Michigan, Museum of Zoology, 1109 Geddes Ave., Ann Arbor, MI