Modeling the probability of a binary outcome

Alla Stolyarevska The Eastern-Ukrainian Branch of the International Solomon University, Kharkov, Ukraine Modeling the probability of a binary outcome

The annotation

Supervised and unsupervised learning models

The prerequisites for the course

Dichotomous variables Many variables in the real world are dichotomous: for example, consumers make a decision to buy or not buy, a product may pass or fail quality control, there are good or poor credit risks, an employee may be promoted or not. promoted or not pass or fail quality control good or poor credit risks to buy or not buy

The binary outcomes

Problem 1 Suppose that you are the administrator of a university department and you want to determine each applicant's chance of admission based on their results on two exams. For each training example, you have the applicant's scores on two exams and the admissions decision.

The result of classification

Can we use linear regression on this problem?

From linear regression to logistic regression

Logistic regression

Results of 100 students (Exam 1 & Exam 2 scores) plotted against the admission categories

The results plotted against probability of allocation to Admitted/Not admitted categories This curve is not a straight line; it is a s-shaped curve. Predicted values are interpreted as probabilities. The outcome is not a prediction of a Y value, as in linear regression, but a probability of belonging to one of two conditions of Y, which can take on any value between 0 and 1 rather than just 0 and 1 in two previous figures.

Regression with Excel There are 100 observations of exam score 1 (x1), exam score 2 (x2), and admission value (y). We wish to see if admission value can be predicted from exam score 1, exam score 2 based on a linear relationship. A portion of the data as it appears in an Excel worksheet is presenting here.

Data analysis We can fit a multiple regression to the data by choosing Data Analysis... under the Tools menu and subsequently selecting the Regression Analysis Tool. Wewill be presented with the following dialog:

The tables We wish to estimate the regression line: y = 0 + 1 x1 + 2 x2 We do this using the Data analysis Add-in and Regression. We should obtain the following results:

Interpretation of the regression coefficients Y-interception is the 0 term, variable X1 is the slope or 1 term and variable X2 is the slope or 2 term.

Graphs Finally we have a quick look at the graphs. We asked for (and we got) residual plots - but what we really wanted was the plot of the residuals against the predicted values. In linear regression, this would be fine. In multiple regression, it's not what we want.

Multiple Regression in Maple & not impressed by regression command

Maple: Compare original & fitted data xref:=[seq(j,j=1..nops(yy))]; pl:=ScatterPlot(xref,yy,symbol=cross):p2:=ScatterPlot(xref,yf,symbol=circle):display([p1,p2],title="cross-data,circles=fitted");

Multiple regression in Statistica

Problem 2 Suppose you are the product manager of the factory and you have the test results for some microchips on two different tests. From these two tests, would like to determine whether the microchips should be accepted or rejected. To help you make the decision, you have a dataset of test results on past microchips, from which you can build a logistic regression model.

The result

Specifying the dependent and independent variables

Assumptions of logistic regression

Notation

Likelihood function

Numerical optimization

Gradient Descent Iteratively updating the weights in this fashion increases likelihood each round. The likelihood is convex, so we eventually reach the maximum. We are near the maximum when changes in the weights are small. We choose to stop when the sum of the absolute values of the weight differences is less than some small number, e.g. 10-6.

Octave GNU Octave is a high-level interpreted language, primarily intended for numerical computations. It provides capabilities for the numerical solution of linear and nonlinear problems, and for performing other numerical experiments. It also provides extensive graphics capabilities for data visualization and manipulation. Octave is normally used through its interactive command line interface, but it can also be used to write non-interactive programs. Octave is distributed under the terms of the GNU General Public License.

Octave’s language Octave has extensive tools for solving common numerical linear algebra problems, finding the roots of nonlinear equations, integrating ordinary functions, manipulating polynomials, and integrating ordinary differential and differential-algebraic equations. It is easily extensible and customizable via user-defined functions written in Octave's own language, or using dynamically loaded modules written in C++.

Octave implementation (Problem 1) • Cost at initial theta: 0.693137 • Gradient at initial theta: -0.1, -12.009217, -11.262842 • Train accuracy: 89% • Theta:

Evaluating logistic regression (Problem 1, prediction) After learning the parameters, we can use the model to predict whether a particular student will be admitted. For a student with an Exam 1 score of 45 and an Exam 2 score of 85, you should expect to see an admission probability of 0.776:

Logistic regression in Statistica. Problem 1 In one STATISTICA application, multiple analyses can be open simultaneously and can be of the same or a different kind, each of them performed on the same or a different input data set (multiple input data files can be opened simultaneously).

Logistic regression in Statistica. Problem 2 All graphs and spreadsheets are automatically linked to the data

Overfitting Overfitting is very important problem for all machine learning algorithms We can find a hypothesis that predicts perfectly the training data but does not generalize well to new data We are seeing an instance here: if we have a lot of parameters, the hypothesis ”memorizes” the data points, but is wild everywhere else.

Problem 2. Feature mapping One way to fit the data better is to create more features from each data point. We will map the features into all polynomial terms of x1 and x2 up to the sixth power. As a result of this mapping, our vector of two features (the scores on two tests) has been transformed into a 28-dimensional vector. A logistic regression classifier trained on this higher-dimension feature vector will have a more complex decision boundary and will appear nonlinear when drawn in our 2-dimensional plot.

Regularized logistic regression The derivation and optimization of regularized logistic regression is very similar to regular logistic regression. The benefit of adding the regularization term is that we enforce a tradeoff between matching the training data and generalizing to future data. For our regularized objective, we add the squared L2 norm. The derivatives are nearly the same, the only differences being the addition of regularization terms.

Octave implementation (Problem 2 , prediction) • We should predict Y = 1 when p  0.5 and Y = 0 when p < 0.5. • This means guessing 1 whenever Tx is non-negative, and 0 otherwise. • If we use the second degree polynomials, then p=0.534488. • If we use the sixth degree polynomials, then p=0.804873,

Regularization Regularization involves introducing additional information in order to solve an ill-posed problem or to prevent overfitting. Underfitting ( = 50) No regularization (Overfitting) ( = 0) Too much regularization (Underfitting) ( = 100) ( = 1)

Model accuracy A way to test for errors in models created by step-wise regression is to not rely on the model's F-statistic, significance, or multiple-r, but instead assess the model against a set of data that was not used to create the model. The class of techniques is called cross-validation. Accuracy is measured as correctly classified records in the holdout sample. There are four possible classifications: prediction of 0 when the holdout sample has a 0 True Negative/TN prediction of 0 when the holdout sample has a 1 False Negative/FN prediction of 1 when the holdout sample has a 0 False Positive/FP prediction of 1 when the holdout sample has a 1 True Positive/TP

Four possible outcomes

Precision and Recall These classifications are used to measure Precision and Recall: The percent of correctly classified observations in the holdout sample is referred to the assessed model accuracy. Additional accuracy can be expressed as the model's ability to correctly classify 0, or the ability to correctly classify 1 in the holdout dataset. The holdout model assessment method is particularly valuable when data are collected in different settings (e.g., at different times or places) or when models are assumed to be generalizable.

Additional analysis

Evaluation of two models The model accuracy increases when sixth degree polynomials are used. Second degree polynomials Sixth degree polynomials

Question 1

Question 2

Question 3

Question 3. Choice

Question 3. Answer

Modeling the probability of a binary outcome

Modeling the probability of a binary outcome

Presentation Transcript

Probability in Modeling

What is Interaction for A Binary Outcome?

Modeling BRDF by a Probability Distribution

Modeling of Interacting Binary Systems

Outcome of the

Warm Up Find the theoretical probability of each outcome 1. rolling a 6 on a number cube.

OUTCOME OF THE WORKSHOP

Finding Probability Using Tree Diagrams and Outcome Tables

Probability of detecting compact binary coalescence with enhanced LIGO

Doing Analyses on Binary Outcome

Modeling the association between a binary outcome, Y, and an “exposure”, X

Review of statistical modeling and probability theory

Warm Up Find the theoretical probability of each outcome 1. rolling a 6 on a number cube.

Modeling Uncertainty: Probability Distributions

The History Of Binary

OUTCOME OF THE WORKSHOP

Outcome of the conference

Modeling Uncertainty: Probability Distributions

Warm Up Find the theoretical probability of each outcome 1. rolling a 6 on a number cube.

Warm Up Find the theoretical probability of each outcome 1. rolling a 6 on a number cube.

DO NOW 4/27/2016 Find the theoretical probability of each outcome.

Simple Expression for Probability of Error for Binary-FSK