A linear least squares framework for learning ordinal classes

A linear least squares framework for learning ordinal classes Ioannis Mariolis, PhD

Outline • Introduction to Ordinal Data Modeling • Generalized Linear Models • Ordinary Least Squares (OLS) Regression • Ordinal Logistic Regression (OLR) • Linear Classifier of Ordinal Classes • learns a linear model • modifies OLS regression • Experimental Results • synthetic datasets • real datasets • visual features • textile seam quality control • Conclusions

Ordinal Data Modeling Collection of measurements called data Building a model to fit the data The term ordinal refers to the scale of measurement of the data

Scales of Measurement • Measurement is the assignment of numbers to objects or events in a systematic fashion • Four levels of measurement scales are commonly distinguished • Nominal • Ordinal • Interval • Ratio

Nominal Scale • Nominal measurement consists of assigning items to groups or categories • No quantitative information is conveyed and no ordering of the items is implied • qualitative rather than quantitative • Variables measured on a nominal scale are often referred to as categorical or qualitative variables

Ordinal Scale • Measurements with ordinal scales are ordered • higher numbers represent higher values • The intervals between the numbers are not necessarily equal • There is no "true" zero point for ordinal scales • the zero point is chosen arbitrarily

Interval Scale • On interval scales, one unit represents the same magnitude across the whole range of the scale • Interval scales do not have a "true" zero point • It is not possible to make statements about how many times higher one score on that scale is than another • e.g. the Celsius scale for temperature • equal differences on this scale represent equal differences in temperature • but a temperature of 30 degrees is not twice as warm as one of 15 degrees

Ratio Scale • Ratio scales are like interval scales except they have true zero points • e.g. the Kelvin scale of temperature • this scale has an absolute zero • a temperature of 300 Kelvin is twice as high as a temperature of 150 Kelvin

Ratio Scale • Ratio scales are like interval scales except they have true zero points • e.g. the Kelvin scale of temperature • this scale has an absolute zero • a temperature of 300 Kelvin is twice as high as a temperature of 150 Kelvin. Earth’s mean temperature is about 14o C (287o K), and it drops as a function of the earth-sun distance’s square root. Thus, doubling the distance results to a factor of ~1.4 decrease in temperature. The calculations should be made in Kelvin (287/1.4=205) resulting to a difference of 82 degrees. The new temperature would be -68o C and not 14/1.4=10o C

Classification to Ordinal Classes Pattern classification addresses the issue of assigning objects to different categories called classes Most often those classes are of nominal scale discrete classes with no established relationship among them In some cases, additional information regarding the arrangement of the classes is available e.g. an order among the classes is exhibited in that case the predicted classes are of ordinal scale classification is bridged to metric regression in a setting called ranking learning or ordinal regression

Classification to Ordinal Classes Pattern classification addresses the issue of assigning objects to different categories called classes Most often those classes are of nominal scale discrete classes with no established relationship among them In some cases, additional information regarding the arrangement of the classes is available e.g. an order among the classes is exhibited in that case the predicted classes are of ordinal scale classification is bridged to metric regression in a setting called ranking learning or ordinal regression. applied to variables measured on interval or ratio scales

State of the Art • Ordinal regression problems have been addressed in both machine learning and statistics domain • In Frank (2001) classes’ ordering was encoded by a set of nested binary classifiers • the classification results were organized for prediction accordingly • A constrained classification approach, based on binary classifiers, was proposed in Har-Peled (2003) • A loss function between pair of ranks was used in Herbrich (2000) • employing distribution independent methods • Modifications of support vector machines have been proposed in Shashua (2003), Chu (2005), Pelckmans (2006) • incorporating in the design of SVMs information regarding the order of the classes • A probabilistic kernel approach to ordinal regression was proposed by Chu (2005) • In McCullagh (1980) multinomial logistic regression is extended to apply to ordinal data by using cumulative probabilities • proportional odds model • proportional hazards model • In Tutz (2003) generalized additive models were extended into a semi-parametric approach • based on the maximization of penalized log likelihood • choice of used parameters based on minimization of the Akaike criterion • In Johnson (1999) sampling techniques were employed in order to apply Bayesian inference on parametric models for ordinal data • In Krammer (2001) and Torra (2006) the ordinal values are transformed into numeric, and then standard metric regression analysis is performed

State of the Art • Ordinal regression problems have been addressed in both machine learning and statistics domain Extending Binary Classifiers • In Frank (2001) classes’ ordering was encoded by a set of nested binary classifiers • the classification results were organized for prediction accordingly • A constrained classification approach, based on binary classifiers, was proposed in Har-Peled (2003) • A loss function between pair of ranks was used in Herbrich (2000) • employing distribution independent methods • Modifications of support vector machines have been proposed in Shashua (2003), Chu (2005), Pelckmans (2006) • incorporating in the design of SVMs information regarding the order of the classes • A probabilistic kernel approach to ordinal regression was proposed by Chu (2005) • In McCullagh (1980) multinomial logistic regression is extended to apply to ordinal data by using cumulative probabilities • proportional odds model • proportional hazards model • In Tutz (2003) generalized additive models were extended into a semi-parametric approach • based on the maximization of penalized log likelihood • choice of used parameters based on minimization of the Akaike criterion • In Johnson (1999) sampling techniques were employed in order to apply Bayesian inference on parametric models for ordinal data • In Krammer (2001) and Torra (2006) the ordinal values are transformed into numeric, and then standard metric regression analysis is performed

State of the Art • Ordinal regression problems have been addressed in both machine learning and statistics domain Extending SVM Classifiers • In Frank (2001) classes’ ordering was encoded by a set of nested binary classifiers • the classification results were organized for prediction accordingly • A constrained classification approach, based on binary classifiers, was proposed in Har-Peled (2003) • A loss function between pair of ranks was used in Herbrich (2000) • employing distribution independent methods • Modifications of support vector machines have been proposed in Shashua (2003), Chu (2005), Pelckmans (2006) • incorporating in the design of SVMs information regarding the order of the classes • A probabilistic kernel approach to ordinal regression was proposed by Chu (2005) • In McCullagh (1980) multinomial logistic regression is extended to apply to ordinal data by using cumulative probabilities • proportional odds model • proportional hazards model • In Tutz (2003) generalized additive models were extended into a semi-parametric approach • based on the maximization of penalized log likelihood • choice of used parameters based on minimization of the Akaike criterion • In Johnson (1999) sampling techniques were employed in order to apply Bayesian inference on parametric models for ordinal data • In Krammer (2001) and Torra (2006) the ordinal values are transformed into numeric, and then standard metric regression analysis is performed

State of the Art • Ordinal regression problems have been addressed in both machine learning and statistics domain Explicitly Ordinal Approach • In Frank (2001) classes’ ordering was encoded by a set of nested binary classifiers • the classification results were organized for prediction accordingly • A constrained classification approach, based on binary classifiers, was proposed in Har-Peled (2003) • A loss function between pair of ranks was used in Herbrich (2000) • employing distribution independent methods • Modifications of support vector machines have been proposed in Shashua (2003), Chu (2005), Pelckmans (2006) • incorporating in the design of SVMs information regarding the order of the classes • A probabilistic kernel approach to ordinal regression was proposed by Chu (2005) • In McCullagh (1980) multinomial logistic regression is extended to apply to ordinal data by using cumulative probabilities • proportional odds model • proportional hazards model • In Tutz (2003) generalized additive models were extended into a semi-parametric approach • based on the maximization of penalized log likelihood • choice of used parameters based on minimization of the Akaike criterion • In Johnson (1999) sampling techniques were employed in order to apply Bayesian inference on parametric models for ordinal data • In Krammer (2001) and Torra (2006) the ordinal values are transformed into numeric, and then standard metric regression analysis is performed

State of the Art • Ordinal regression problems have been addressed in both machine learning and statistics domain Treat Ordinal Data as Numeric • In Frank (2001) classes’ ordering was encoded by a set of nested binary classifiers • the classification results were organized for prediction accordingly • A constrained classification approach, based on binary classifiers, was proposed in Har-Peled (2003) • A loss function between pair of ranks was used in Herbrich (2000) • employing distribution independent methods • Modifications of support vector machines have been proposed in Shashua (2003), Chu (2005), Pelckmans (2006) • incorporating in the design of SVMs information regarding the order of the classes • A probabilistic kernel approach to ordinal regression was proposed by Chu (2005) • In McCullagh (1980) multinomial logistic regression is extended to apply to ordinal data by using cumulative probabilities • proportional odds model • proportional hazards model • In Tutz (2003) generalized additive models were extended into a semi-parametric approach • based on the maximization of penalized log likelihood • choice of used parameters based on minimization of the Akaike criterion • In Johnson (1999) sampling techniques were employed in order to apply Bayesian inference on parametric models for ordinal data • In Krammer (2001) and Torra (2006) the ordinal values are transformed into numeric, and then standard metric regression analysis is performed

State of the Art • Ordinal regression problems have been addressed in both machine learning and statistics domain Treat Ordinal Data as Numeric • In Frank (2001) classes’ ordering was encoded by a set of nested binary classifiers • the classification results were organized for prediction accordingly • A constrained classification approach, based on binary classifiers, was proposed in Har-Peled (2003) • A loss function between pair of ranks was used in Herbrich (2000) • employing distribution independent methods • Modifications of support vector machines have been proposed in Shashua (2003), Chu (2005), Pelckmans (2006) • incorporating in the design of SVMs information regarding the order of the classes • A probabilistic kernel approach to ordinal regression was proposed by Chu (2005) • In McCullagh (1980) multinomial logistic regression is extended to apply to ordinal data by using cumulative probabilities • proportional odds model • proportional hazards model • In Tutz (2003) generalized additive models were extended into a semi-parametric approach • based on the maximization of penalized log likelihood • choice of used parameters based on minimization of the Akaike criterion • In Johnson (1999) sampling techniques were employed in order to apply Bayesian inference on parametric models for ordinal data • In Krammer (2001) and Torra (2006) the ordinal values are transformed into numeric, and then standard metric regression analysis is performed Ordinary Least Squares will be implied when referring to Metric Regression

Generalized Linear Models • GLMs are a generalization of the OLS regression • were formulated as a way of unifying under one framework • linear regression • logistic regression • Poisson regression • a general algorithm for maximum likelihood estimation in all these models has been developed • According to GLM theory • a linear predictor is related the distribution function of the dependent variables through a link function • each outcome of the dependent variables, Y, is assumed to be generated from a particular exponential-type probability density function • Normal, Binomial, Poisson distributions, etc • The mean, μ, of the distribution depends on the independent variables,x, through: , where E{Y} is the expected value of Y; g is the link function; b are the unknown weights of the linear model • The unknown weights b, called also regression coefficients, are typically estimated with maximum likelihood or Bayesian techniques

Generalized Linear Models • GLMs are a generalization of the OLS regression • were formulated as a way of unifying under one framework • linear regression • logistic regression • Poisson regression • a general algorithm for maximum likelihood estimation in all these models has been developed • According to GLM theory • a linear predictor is related the distribution function of the dependent variables through a link function • each outcome of the dependent variables, Y, is assumed to be generated from a particular exponential-type probability density function • Normal, Binomial, Poisson distributions, etc • The mean, μ, of the distribution depends on the independent variables,x, through: , where E{Y} is the expected value of Y; g is the link function; b are the unknown weights of the linear model • The unknown weights b, called also regression coefficients, are typically estimated with maximum likelihood or Bayesian techniques In case Y follows the Normal distribution and g is the identity function, the GLM is the standard linear regression model

Generalized Linear Models • GLMs are a generalization of the OLS regression • were formulated as a way of unifying under one framework • linear regression • logistic regression • Poisson regression • a general algorithm for maximum likelihood estimation in all these models has been developed • According to GLM theory • a linear predictor is related the distribution function of the dependent variables through a link function • each outcome of the dependent variables, Y, is assumed to be generated from a particular exponential-type probability density function • Normal, Binomial, Poisson distributions, etc • The mean, μ, of the distribution depends on the independent variables,x, through: , where E{Y} is the expected value of Y; g is the link function; b are the unknown weights of the linear model • The unknown weights b, called also regression coefficients, are typically estimated with maximum likelihood or Bayesian techniques In the context of this presentationxcorresponds to feature vectors andYto classes

Ordinary Least Squares • The simplest and very popular GLM • The distribution function is the normal distribution with constant variance and the link function is the identity • Unlike most other GLMs, the maximum likelihood estimates of the linear weights are provided in a closed form solution • X is the matrix consisting of all available feature vectors x • Y is the vector consisting of the observed values of the dependent variables Y • The model’s linear weights b are given by

Ordinary Least Squares (cont.) • OLS is designed to process interval or ratio variables • OLS estimates are likely to be satisfactory from a statistical perspective when an ordinal level variable is examined • if it is measured in a relatively high number of ascending categories • if it can be assumed that the interval each category represents, is the same as the prior interval • Thus, OLS can be applied to ordinal measurements treated as if they were interval • it is most likely that some of the assumptions of the Gauss-Markov theorem are not met and the regression is not the Best Linear Unbiased Estimator

Ordinal Logistic Regression • Explicitly takes into account an ordered categorical dependent variable and does not assume any specific distance among the categories • Different regression models that can be applied in case of ordinal measurements are proposed • the proportional odds model is assumed • Like in multinomial logistic regression (MLR), in OLR • a multinomial distribution is assumed • the logit is selected as the link function • The main difference between MLR and OLR is that rather than estimating the probability of a single category, OLR estimates a cumulative probability • i.e. the probability that the outcome is equal to or less than the category of interest c

Ordinal Logistic Regression • Explicitly takes into account an ordered categorical dependent variable and does not assume any specific distance among the categories • Different regression models that can be applied in case of ordinal measurements are proposed • the proportional odds model is assumed • Like in multinomial logistic regression (MLR), in OLR • a multinomial distribution is assumed • the logit is selected as the link function • The main difference between MLR and OLR is that rather than estimating the probability of a single category, OLS estimates a cumulative probability • i.e. the probability that the outcome is equal to or less than the category of interest c c denotes the integer values used to label the classes

Ordinal Logistic Regression (cont.) Using the Logit equation, the probabilities for each instance belonging to each class can be estimated • the proportional odds model employs the cumulative probability’s logit equation • The threshold values are different for each category • The weights of the linear model contained in vector b are assumed to remain constant for every category • A Log-Likelihood function (LL) is created and the parameter values that maximize that function are estimated using computational methods

Linear Classifier of Ordinal Classes • Numerical mapping of the K ordered classes . into real numbers • Classification is based on the assumption of a linear relationship between • the numerical input vectors and • the numerical values assigned to the ordered classes • A linear output y is produced as the dot product of input vector x and vector b containing the weights of the linear model • The output o derives as the class ωj assigned with the numerical value j that is the nearest to the linear output y. j is given by In case of metric regression a numerical mapping is needed and the results do not correspond to probabilities

Linear Classifier of Ordinal Classes • Performs numerical mapping of the K ordered classes . into real numbers • Classification is based on the assumption of a linear relationship between • the numerical input vectors and • the numerical values assigned to the ordered classes • A linear output y is produced as the dot product of input vector x and vector b containing the weights of the linear model • The output o derives as the class ωj assigned with the numerical value j that is the nearest to the linear output y. j is given by In case of metric regression a numerical mapping is needed and the results do not correspond to probabilities

Training LCOC-the naïve case • Arbitrary consequent numbers are assigned to the ordered classes: • The linear output of the classifier is xb where vector b has been estimated by minimizing the Sum of Squared Errors (SSE) matrix X is the design matrix consisting of all available input vectors, t denotes the vector of the corresponding targets • Then

Training LCOC-the proposed case • Target vector t is decomposed into a product of • a known matrix S coding the target classes of the training samples • and a parameter vector zof elements containing the unknown numerical values assigned to the K classes • SSE becomes • where • SSE minimization revisited Least Squares Ordinal Classification (LSOC)

Training LCOC-the proposed case • Target vector t is decomposed into a product of • a known matrix S coding the target classes of the training samples • and a parameter vector zof elements containing the unknown numerical values assigned to the K classes • SSE becomes • where • SSE minimization revisited Least Squares Ordinal Classification (LSOC) A1, AK selection does not affect the classification results

Training LCOC-the proposed case (cont.) • Since SSE is quadratic with respect to b and z, setting the partial derivatives of SSE to zero results to • where , • if the estimated z parameters were also employed by OLS the same b parameters would have been estimated by both training methods • the estimated ζvalues are in fact the intra-class average values of the linear outputs • By substituting in the second equation the b vector given in the first the system of linear equations becomes Least Squares Ordinal Classification (LSOC)

Invariant Error Measure When the numerical values of the classes are not fixed the classification results do not depend only on the magnitude of the error, but also on the distance among the classes Proposed measure that is also minimized by LSOC training method However unlike SSE Takes into account the distance between the classes is invariant to the selection of the bounding values A1 and AK since

Experimental Evaluation • Both synthetic and real datasets are examined • Synthetic input vectors were produced by means of a random number generator • arbitrary linear model produces linear targets • quantizing linear targets produces class targets • quantization levels correspond to ordered classes • initial error introduced into the linear model only by quantization • the performance of the proposed training method was also assessed in case of weaker linear dependency • Additive White Gaussian Noise (AWGN) has been introduced into the linear model before quantization • Real datasets involve visual inspection of seam specimen classified to five grades of quality • the critical assumption of linear dependency is unverified • if not valid, the classification accuracy of the LSOC is anticipated to be as poor as the one of OLS or even worse • the produced results were also compared to those of Ordinal Logistic Regression (OLR) • OLR yields a good choice for comparison, since its model employs the same number of parameters with those of LSOC • however, OLR relies on computational methods to estimate these parameters, whereas LSOC employs a closed form solution

Synthetic Datasets • Using a uniform random generator were artificially generated • 1000 5-dimensional input vectors • the vectors were augment by adding an extra unit element • grouped into a design matrix of size 1000×6 • 6 arbitrary values were randomly selected as the weights of the linear model • the design matrix was multiplied with the weights’ vector and the vector of the linear targets has been created • consisting of 1000 values linearly dependent on the corresponding input vectors • the elements of the linear targets’ vector were positioned in monotonically increasing order by rearranging accordingly the rows of matrix • The 1st Synthetic Dataset contains 10 ordered classes with 100 input vectors in each class • the 1000 input vectors were grouped together in hundreds • the first 100 input vectors of matrix were classified to the first class, and so on until the 10th class • The 2nd Synthetic dataset used the same design matrix and vector of linear weights • the 1st and the 2nd class were assigned with 300 input vectors each • the 8 remaining classes were assigned with 50 vectors each • the class targets of the input vectors are different for the second dataset

Synthetic Datasets Euclidian Distance of z values from the norm. centers • 1st dataset • LSOC: 0.05 • OLS: 0.32 • 2nd dataset • LSOC: 0.54 • OLS: 0.90

Synthetic Datasets R2 denotes the coefficient of determination CA denotes Classification Accuracy V denotes 10-fold Cross-Validation

Synthetic Datasets 1st syntheticdataset 2nd syntheticdataset • AWGN has been introduced into the estimation of the linear targets • The Mean Distance (MD) among the classes has been calculated • the standard deviation of the added noise was set to be 5% of MD to 100% of MD • with a 5% of MD increment • Thus, for each dataset 20 different cases with increasing ratios were constructed and tested

Real Datasets • Image database of 325 seam specimens, belonging to three different types of fabric • Specimen size approximately 20×4 cm • A committee of three experts labelled each specimen by assigning a grade denoting the quality of the seam • 1 (worse) to 5 (best) • For each specimen three ratings are assigned • the median is selected as the actual grade • the average agreement of each expert to the median ratings has been 80.3% ±1.8%. • 3 different feature sets all based on intensity curves • Roughness features • FFT features • Fractal features • 4 different features in each set

Textile Seam Quality Control ISO 7700 Standard

(a) (b) (c) (d) (e) Pre-process (a) (b) (c) (d) (e)

Intensity Curves

I(1) I(2) I(3) I(4) Intensity Curves γραμμή εικόνας

S (1) S (2) S (3) S (4) I(1) I(2) I(3) I(4) Intensity Curves γραμμή εικόνας Mean intensity values (column-wise)

Roughness Features Feature Extraction • Moving Average filter • Intensity Deviation

FFT Features Feature Extraction • Using the first 40 FFT coefficients produced from each intensity curve • Applying averaging using different window centers and sizes • Selecting the window settings that present the highest correlation with the quality grades

A linear least squares framework for learning ordinal classes

A linear least squares framework for learning ordinal classes

Presentation Transcript

Ordinary least Squares

Linear Least Squares Approximation

Least-Squares Warped Distance for Adaptive Linear Image Interpolation

Simple Linear Regression 1. the least squares procedure 2. inference for least squares lines

Scientific Computing Chapter 3 - Linear Least squares

Linear Least Squares Approximation

Least-Squares Regression

Least Squares

Linear Least-Squares Approximation

Linear Least Squares

Linear Least Squares QR Factorization

Least Squares Regression

Least squares

Least Squares

Least Squares Approximation: A Linear Algebra Technique

Method of Least Squares (Least Squares Regression):

Simple Linear Regression 1. the least squares procedure 2. inference for least squares lines

Least Squares Migration

Linear Least Squares

A linear least squares framework for learning ordinal classes

General Linear Least-Squares and Nonlinear Regression

Linear Least Squares