slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Pavel B. Klimov Barry M. OConnor University of Michigan, Museum of Zoology, 1109 Geddes Ave., Ann Arbor, MI PowerPoint Presentation
Download Presentation
Pavel B. Klimov Barry M. OConnor University of Michigan, Museum of Zoology, 1109 Geddes Ave., Ann Arbor, MI

Loading in 2 Seconds...

play fullscreen
1 / 1

Pavel B. Klimov Barry M. OConnor University of Michigan, Museum of Zoology, 1109 Geddes Ave., Ann Arbor, MI - PowerPoint PPT Presentation


  • 178 Views
  • Uploaded on

The next generation of identification tools: interactive programs incorporating multivariate models. Pavel B. Klimov Barry M. OConnor University of Michigan, Museum of Zoology, 1109 Geddes Ave., Ann Arbor, MI.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Pavel B. Klimov Barry M. OConnor University of Michigan, Museum of Zoology, 1109 Geddes Ave., Ann Arbor, MI' - donkor


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

The next generation of identification tools: interactive programs incorporating multivariate models

Pavel B. KlimovBarry M. OConnor

University of Michigan, Museum of Zoology, 1109 Geddes Ave., Ann Arbor, MI

Context: The vast majority of interactive identification programs use a sequential approach to assign an unknown specimen to a known group. This algorithm works when the distinguishing characters do not have overlapping values. If the boundaries between taxa are overlapping, simultaneous (=probabilistic, matching) methods of identifications are more likely to lead to the correct assignment, but these methods usually require time-consuming measurements or experiments. We discuss how the sequential approach can be enhanced by multivariate statistics incorporated into this method.

1. INTRODUCTION

Computer assisted interactive identification allows quick assignment of an unknown specimen to a known taxon with minimal costs in obtaining data and learning about the unknown. The number of characters used in the identification is substantially reduced compared to traditional taxonomic keys. For example, any of 128 taxa can be identified using only eight binary characters, or even fewer numeric or multistate characters. There are two major approaches to identification, sequential (=elimination, diagnostic) and simultaneous (=probabilistic, matching). In the sequential approach, only one character is used at each step of identification until the unknown specimen is assigned to a particular group. In the simultaneous approach, some or all characters are entered simultaneously, and the probability of group membership of the unknown specimen is calculated.

The advantage of the sequential algorithm, particularly its multi-entry variant (=freedom to choose any character), is obvious when a taxon set is large and the taxa have distinct boundaries. At each step, taxa matching the unknown are retained and diagnostic characters for this subset are ordered according to their separating power. This algorithm has been implemented in a variety of interactive identification programs such as DELTA and Lucid that are widely used at present. In contrast, simultaneous methods usually require data obtained by time consuming measurements or experiments and are not that flexible in terms of the freedom of choosing characters, but are more likely to lead to the correct assignment if the boundaries between some or all taxa are overlapping.

The situation when a data set is large and contains taxa that cannot be completely separated using qualitative or uni- or bivarite characters requires a combination of both methods of identification where each approach will handle the appropriate data.

2. MULTIVARIATE MODELS

Multivariate statistics summarizes variation in many variables in many specimens in the form of a concise model that contains essential and comprehensive information about the groups and that has predictive power. We consider two multivariate techniques that are usually used to analyze intergroup differences: canonical variates analysis (CVA), and binomial logistic regression (LR). Both analyses handle metric and non-metric independent variables.

A canonical variates function is a latent variable that is created as a linear combination of independent variables,

CV = b1*x1 + b2*x2 + ... + bn*xn + c (1),

where the b's are coefficients, the x's are independent variables, and c is a constant.

If there are n groups, n-1 CV's are calculated. For assignment purposes, the estimated posterior probability of group membership is calculated, or, when multivariate normality of the independent variables is assumed, the value of CV can be equivalently used.

Logistic regression models can be expressed as the following equation,

P(0) = exp(b1*x1 + b2*x2 + ... + bn*xn + c)/(1+exp(b1*x1 + b2*x2 + ... + bn*xn + c)) (2),

where P(0) the probability of an unknown specimen being taxon 0, other notations are the same as for CVA above.

If P(0) exceeds 0.5, then the unknown belongs to taxon 0, otherwise to taxon 1.

A great advantage of LR over CVA is that it is a direct posterior probabilities estimator, it calculates the class posterior probabilities without ever estimating the classes' individual density functions, which requires additional data (group means, prior probabilities, and the value of mean square within groups).

3. INCORPORATING

THE MODELS IN THE SEQUENTIAL ALGORITHM

Both (1) and (2) can be used in any sequential identification program, as a single character “Model classifies the unknown specimen to” with the character states “group 1, group2,…group n”. The user, however, should be asked simply to enter measurements or observations, x1, x2, …, xn, then the Bayesian probabilities associated with being in either group are calculated, and the greater of these probabilities is used to classify the specimen.

  • Implementation of the new data type will require some adjustment in the internal logic of an identification program. In the general case, there are some characters in the identification matrix that can separate a subset of taxa without using methods of multivariate models. These characters, whether they are binary, multistate, or variable, should be given more weight compared to the complex character generated by a multivariate model. The latter also should be coded only for the subset of taxa included in the model, and this character for the other taxa should be coded as "missing". Because a multivariate model may contain characters that are used elsewhere in the identification matrix, these matching characters should be cross-referenced.
  • Results
  • The most optimal way of identification when a data matrix contain both both discrete and overlapping groups is to use combined sequential and probabilistic strategies for appropriate data.
  • Canonical variates and logistic regression models can be used in the context of the sequential approach to calculate posterior probabilities and to classify the unknown specimen.

http://insects.ummz.lsa.umich.edu/beemites/Morphometrics.html

Research supported by NSF DEB-0118766 (PEET) and the USDA (CSREES #2002-35302-12654).