Lecture 2 math primer
This presentation is the property of its rightful owner.
Sponsored Links
1 / 71

Lecture 2: Math Primer PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on
  • Presentation posted in: General

Lecture 2: Math Primer. Machine Learning CUNY Graduate Center. Today. Probability and Statistics Naïve Bayes Classification Linear Algebra Matrix Multiplication Matrix Inversion Calculus Vector Calculus Optimization Lagrange Multipliers. Classical Artificial Intelligence.

Download Presentation

Lecture 2: Math Primer

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Lecture 2 math primer

Lecture 2: Math Primer

Machine Learning

CUNY Graduate Center


Today

Today

  • Probability and Statistics

    • Naïve Bayes Classification

  • Linear Algebra

    • Matrix Multiplication

    • Matrix Inversion

  • Calculus

    • Vector Calculus

    • Optimization

    • Lagrange Multipliers


Classical artificial intelligence

Classical Artificial Intelligence

Expert Systems

Theorem Provers

Shakey

Chess

Largely characterized by determinism.


Modern artificial intelligence

Modern Artificial Intelligence

Fingerprint ID

Internet Search

Vision – facial ID, object recognition

Speech Recognition

Asimo

Jeopardy!

Statistical modeling to generalize from data.


Two caveats about statistical modeling

Two Caveats about Statistical Modeling

Black Swans

The Long Tail


Black swans

Black Swans

  • In the 18th Century, black swans were discovered in Western Australia

  • Black Swans are rare, sometimes unpredictable events, that have extreme impact

  • Almost all statistical models underestimate the likelihood of unseen events.

In the 17th Century, all known swans were white.

Based on evidence, it is impossible for a swan to be anything other than white.


The long tail

The Long Tail

  • Many events follow an exponential distribution

  • These distributions have a very long “tail”.

    • I.e. A large region with significant probability mass, but low likelihood at any particular point.

  • Often, interesting events occur in the Long Tail, but it is difficult to accurately model behavior in this region.


Boxes and balls

Boxes and Balls

2 Boxes, one red and one blue.

Each contain colored balls.


Boxes and balls1

Boxes and Balls

Suppose we randomly select a box, then randomly draw a ball from that box.

The identity of the Box is a random variable, B.

The identity of the ball is a random variable, L.

B can take 2 values, r, or b

L can take 2 values, g or o.


Boxes and balls2

Boxes and Balls

Given some information about B and L, we want to ask questions about the likelihood of different events.

What is the probability of selecting an apple?

If I chose an orange ball, what is the probability that I chose from the blue box?


Some basics

Some basics

  • The probability (or likelihood) of an event is the fraction of times that the event occurs out of n trials, as n approaches infinity.

  • Probabilities lie in the range [0,1]

  • Mutually exclusive events are events that cannot simultaneously occur.

    • The sum of the likelihoods of all mutually exclusive events must equal 1.

  • If two events are independent then,

    p(X, Y) = p(X)p(Y)

    p(X|Y) = p(X)


Joint probability p x y

Joint Probability – P(X,Y)

A Joint Probability function defines the likelihood of two (or more) events occurring.

Let nij be the number of times event i and event j simultaneously occur.


Generalizing the joint probability

Generalizing the Joint Probability


Marginalization

Marginalization

Consider the probability of X irrespective of Y.

The number of instances in column j is the sum of instances in each cell

Therefore, we can marginalize or “sum over” Y:


Conditional probability

Conditional Probability

  • Consider only instances where X = xj.

  • The fraction of these instances where Y = yi is the conditional probability

    • “The probability of y given x”


Relating the joint conditional and marginal

Relating the Joint, Conditional and Marginal


Sum and product rules

Sum and Product Rules

Sum Rule

Product Rule

In general, we’ll refer to a distribution over a random variable as p(X) and a distribution evaluated at a particular value as p(x).


Bayes rule

Bayes Rule


Interpretation of bayes rule

Interpretation of Bayes Rule

Posterior

Prior

Likelihood

Prior: Information we have before observation.

Posterior: The distribution of Y after observing X

Likelihood: The likelihood of observing X given Y


Boxes and balls with bayes rule

Boxes and Balls with Bayes Rule

  • Assuming I’m inherently more likely to select the red box (66.6%) than the blue box (33.3%).

  • If I selected an orange ball, what is the likelihood that I selected the red box?

    • The blue box?


Boxes and balls3

Boxes and Balls


Na ve bayes classification

Naïve Bayes Classification

This is a simple case of a simple classification approach.

Here the Box is the class, and the colored ball is a feature, or the observation.

We can extend this Bayesian classification approach to incorporate more independent features.


Na ve bayes classification1

Naïve Bayes Classification

Some theory first.


Na ve bayes classification2

Naïve Bayes Classification

Assuming independent features simplifies the math.


Na ve bayes example data

Naïve Bayes Example Data


Na ve bayes example data1

Naïve Bayes Example Data

Prior:


Na ve bayes example data2

Naïve Bayes Example Data


Na ve bayes example data3

Naïve Bayes Example Data


Continuous probabilities

Continuous Probabilities

So far, X has been discrete where it can take one of M values.

What if X is continuous?

Now p(x) is a continuous probability density function.

The probability that x will lie in an interval (a,b) is:


Continuous probability example

Continuous probability example


Properties of probability density functions

Properties of probability density functions

Sum Rule

Product Rule


Expected values

Expected Values

Given a random variable, with a distribution p(X), what is the expected value of X?


Multinomial distribution

Multinomial Distribution

If a variable, x, can take 1-of-K states, we represent the distribution of this variable as a multinomial distribution.

The probability of x being in state k is μk


Expected value of a multinomial

Expected Value of a Multinomial

The expected value is the mean values.


Gaussian distribution

Gaussian Distribution

One Dimension

D-Dimensions


Gaussians

Gaussians


How machine learning uses statistical modeling

How machine learning uses statistical modeling

  • Expectation

    • The expected value of a function is the hypothesis

  • Variance

    • The variance is the confidence in that hypothesis


Variance

Variance

The variance of a random variable describes how much variability around the expected value there is.

Calculated as the expected squared error.


Covariance

Covariance

The covariance of two random variables expresses how they vary together.

If two variables are independent, their covariance equals zero.


Linear algebra

Linear Algebra

  • Vectors

    • A one dimensional array.

    • If not specified, assume x is a column vector.

  • Matrices

    • Higher dimensional array.

    • Typically denoted with capital letters.

    • n rows by m columns


Transposition

Transposition

Transposing a matrix swaps columns and rows.


Transposition1

Transposition

Transposing a matrix swaps columns and rows.


Addition

Addition

  • Matrices can be added to themselves iff they have the same dimensions.

    • A and B are both n-by-m matrices.


Multiplication

Multiplication

  • To multiply two matrices, the inner dimensions must be the same.

    • An n-by-m matrix can be multiplied by an m-by-k matrix


Inversion

Inversion

  • The inverse of an n-by-n or squarematrix A is denoted A-1, and has the following property.

  • Where I is the identity matrix is an n-by-n matrix with ones along the diagonal.

    • Iij = 1 iffi = j, 0 otherwise


Identity matrix

Identity Matrix

Matrices are invariant under multiplication by the identity matrix.


Helpful matrix inversion properties

Helpful matrix inversion properties


Lecture 2 math primer

Norm

The norm of a vector,x, represents the euclidean length of a vector.


Positive definite ness

Positive Definite-ness

  • Quadratic form

    • Scalar

    • Vector

  • Positive Definite matrix M

  • Positive Semi-definite


Calculus

Calculus

Derivatives and Integrals

Optimization


Derivatives

Derivatives

A derivative of a function defines the slope at a point x.


Derivative example

Derivative Example


Integrals

Integrals

Integration is the inverse operation of derivation (plus a constant)

Graphically, an integral can be considered the area under the curve defined by f(x)


Integration example

Integration Example


Vector calculus

Vector Calculus

Derivation with respect to a matrix or vector

Gradient

Change of Variables with a Vector


Derivative w r t a vector

Derivative w.r.t. a vector

Given a vector x, and a function f(x), how can we find f’(x)?


Derivative w r t a vector1

Derivative w.r.t. a vector

Given a vector x, and a function f(x), how can we find f’(x)?


Example derivation

Example Derivation


Example derivation1

Example Derivation

Also referred to as the gradient of a function.


Useful vector calculus identities

Useful Vector Calculus identities

Scalar Multiplication

Product Rule


Useful vector calculus identities1

Useful Vector Calculus identities

Derivative of an inverse

Change of Variable


Optimization

Optimization

Have an objective function that we’d like to maximize or minimize, f(x)

Set the first derivative to zero.


Optimization with constraints

Optimization with constraints

  • What if I want to constrain the parameters of the model.

    • The mean is less than 10

  • Find the best likelihood, subject to a constraint.

  • Two functions:

    • An objective function to maximize

    • An inequality that must be satisfied


Lagrange multipliers

Lagrange Multipliers

Find maxima of f(x,y) subject to a constraint.


General form

General form

Maximizing:

Subject to:

Introduce a new variable, and find a maxima.


Example

Example

Maximizing:

Subject to:

Introduce a new variable, and find a maxima.


Example1

Example

Now have 3 equations with 3 unknowns.


Example2

Example

Eliminate Lambda

Substitute and Solve


Why does machine learning need these tools

Why does Machine Learning need these tools?

  • Calculus

    • We need to identify the maximum likelihood, or minimum risk. Optimization

    • Integration allows the marginalization of continuous probability density functions

  • Linear Algebra

    • Many features leads to high dimensional spaces

    • Vectors and matrices allow us to compactly describe and manipulate high dimension al feature spaces.


Why does machine learning need these tools1

Why does Machine Learning need these tools?

  • Vector Calculus

    • All of the optimization needs to be performed in high dimensional spaces

    • Optimization of multiple variables simultaneously – Gradient Descent

    • Want to take a marginal over high dimensional distributions like Gaussians.


Next time

Next Time

Linear Regression and Regularization

Read Chapter 1.1, 3.1, 3.3


  • Login