Machine learning cs 165b spring 2012
This presentation is the property of its rightful owner.
Sponsored Links
1 / 53

Machine Learning CS 165B Spring 2012 PowerPoint PPT Presentation


  • 50 Views
  • Uploaded on
  • Presentation posted in: General

Machine Learning CS 165B Spring 2012. Course outline. Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks (Ch. 4) Linear classifiers Support Vector Machines Bayesian Learning (Ch. 6) Bayesian Networks Clustering

Download Presentation

Machine Learning CS 165B Spring 2012

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Machine learning cs 165b spring 2012

Machine LearningCS 165BSpring 2012


Course outline

Course outline

  • Introduction (Ch. 1)

  • Concept learning (Ch. 2)

  • Decision trees (Ch. 3)

  • Ensemble learning

  • Neural Networks (Ch. 4)

  • Linear classifiers

  • Support Vector Machines

  • Bayesian Learning (Ch. 6)

  • Bayesian Networks

  • Clustering

  • Computational learning theory

Midterm on Wednesday


Midterm wednesday may 2

Midterm Wednesday May 2

  • Topics (till today’s lecture)

  • Content

    • (40%) Short questions

    • (20%) Concept learning and hypothesis spaces

    • (20%) Decision trees

    • (20%) Artificial Neural Networks

  • Practice midterm will be posted today

  • Can bring one regular 2-sided sheet & calculator


Background on probability statistics

Background on Probability & Statistics

  • Random variable, sample space, event (union, intersection)

  • Probability distribution

    • Discrete (pmf)

    • Continuous (pdf)

    • Cumulative (cdf)

  • Conditional probability

    • Bayes Rule

    • P(C ≥ 2 | M = 0)

  • Independence of random variables

    • Are C and M independent?

  • Choose which of two envelopes contains a higher number

    • Allowed to peak at one of them

3 coins

C is the count of heads

M =1 iff all coins match


Background on probability statistics1

Background on Probability & Statistics

  • Common distributions

    • Bernoulli

    • Uniform

    • Binomial

    • Gaussian (Normal)

    • Poisson

  • Expected value, variance, standard deviation


Approaches to classification

Approaches to classification

  • Discriminant functions:

    • Learn the boundary between classes.

  • Infer conditional class probabilities:

    • Choose the most probable class

What kind of classifier is logistic regression?


Discriminant functions

Nearest

Neighbor

Decision

Tree

Nonlinear

Functions

Linear

Functions

Discriminant Functions

  • They can be arbitrary functions of x, such as:

Sometimes, transform the data and then learn a linear function


High dimensional data

High-dimensional data

Gene expression

Face images

Handwritten digits


Why feature reduction

Why feature reduction?

  • Most machine learning and data mining techniques may not be effective for high-dimensional data

    • Curse of Dimensionality

    • Query accuracy and efficiency degrade rapidly as the dimension increases.

  • The intrinsic dimension may be small.

    • For example, the number of genes responsible for a certain type of disease may be small.


Why feature reduction1

Why feature reduction?

  • Visualization: projection of high-dimensional data onto 2D or 3D.

  • Data compression: efficient storage and retrieval.

  • Noise removal: positive effect on query accuracy.


Applications of feature reduction

Applications of feature reduction

  • Face recognition

  • Handwritten digit recognition

  • Text mining

  • Image retrieval

  • Microarray data analysis

  • Protein classification


Feature reduction algorithms

Feature reduction algorithms

  • Unsupervised

    • Latent Semantic Indexing (LSI): truncated SVD

    • Independent Component Analysis (ICA)

    • Principal Component Analysis (PCA)

  • Supervised

    • Linear Discriminant Analysis (LDA)


Principal component analysis pca

Principal Component Analysis (PCA)

  • Summarization of data with many variables by a smaller set of derived (synthetic, composite) variables

  • PCA based on SVD

    • So, look at SVD first


Singular value decomposition svd

Singular Value Decomposition (SVD)

  • Intuition: find the axis that shows the greatest variation, and project all points to this axis

f2

e2

e1

f1

14


Svd mathematical formulation

SVD: mathematical formulation

  • Let A be an m x n real matrix of m n-dimensional points

  • SVD decomposition

    • A = U x L x VT

    • U(m x m) is orthogonal: UTU = I

    • V(n x n) is orthogonal: VTV = I

    • L(m x n) has r positive non-zero singular values in descending order on its diagonal

  • Columns of U are the orthogonal eigenvectors of AAT (called the left singular vectors of A)

    • AAT = (U x L x VT ) (U x L x VT )T = U x L x LTx UT = U x L2x UT

  • Columns of V are the orthogonal eigenvectors of ATA (called the right singular vectors of A)

    • ATA = (U x L x VT )T (U x L x VT )= V x LTx L x VT = V x L2x VT

  • L contains the square root of the eigenvalues of AAT (or ATA)

    • These are called the singular values (positive real)

    • r is the rank of A, AAT , ATA

  • U defines the column space of A, V the row space.


Svd example

SVD - example


Svd example1

x

x

=

v1

SVD - example

  • A = ULVT


Svd example2

SVD - example

  • A = ULVT

variance (‘spread’) on the v1 axis

x

x

=


Dimensionality reduction

x

x

=

Dimensionality reduction


Dimensionality reduction1

Dimensionality reduction

  • set the smallest singular values to zero:

x

x

=


Dimensionality reduction2

Dimensionality reduction

x

x

~


Dimensionality reduction3

Dimensionality reduction

x

x

~


Dimensionality reduction4

Dimensionality reduction

x

x

~


Dimensionality reduction5

Dimensionality reduction

~


Dimensionality reduction6

Dimensionality reduction

‘spectral decomposition’ of the matrix:

x

x

=


Dimensionality reduction7

Dimensionality reduction

‘spectral decomposition’ of the matrix:

l1

x

x

=

u1

u2

l2

v1T

v2T


Dimensionality reduction8

l1

l2

u1

u2

v1T

v2T

Dimensionality reduction

‘spectral decomposition’ of the matrix:

n

=

+

+...

m


Dimensionality reduction9

l1

l2

u1

u2

vT1

vT2

Dimensionality reduction

‘spectral decomposition’ of the matrix:

n

r terms

=

+

+...

m

m x 1

1 x n


Dimensionality reduction10

l1

l2

u1

u2

vT1

vT2

Dimensionality reduction

approximation / dim. reduction:

by keeping the first few terms (how many?)

m

=

+

+...

n

assume: l1 >= l2 >= ...


Dimensionality reduction11

l1

l2

u1

u2

vT1

vT2

Dimensionality reduction

A heuristic: keep 80-90% of ‘energy’ (= sum of squares of li’s)

m

=

+

+...

n

assume: l1 >= l2 >= ...


Dimensionality reduction12

Dimensionality reduction

  • Matrix V in the SVD decomposition

  • (A = UΛVT ) is used to transform the data.

  • AV (= UΛ) defines the transformed dataset.

  • For a new data element x, xV defines the transformed data.

  • Keeping the first k (k < n) dimensions, amounts to keeping only the first k columns of V.


Optimality of svd

å

√åλi2

=

=

2

A

A

[

i

,

j

]

F

-

£

-

A

A

A

B

k

2

2

Optimality of SVD

  • Let A = U L VT

  • A = ∑ λiuiviT

  • TheFrobenius norm of an m x n matrix M is

  • Let Ak = the above summation using the k largest eigenvalues.

    Theorem: [Eckart and Young] Among all m x n matrices B of rank at most k, we have that:

  • “Residual” variation is information in A that is not retained. Balancing act between

    • clarity of representation, ease of understanding

    • oversimplification: loss of important or relevant information.

-

£

-

A

A

A

B

k

F

F


Principal components analysis pca

Principal Components Analysis (PCA)

  • Transfer the dataset to the center by subtracting the means: let matrix A be the result.

  • Compute the covariance matrix ATA.

  • Project the dataset along a subset of the eigenvectors of ATA.

    • Matrix V in the SVD decomposition contains these.

  • Also known as K-L transform.


Principal component analysis pca1

Principal Component Analysis (PCA)

  • Takes a data matrix of m objects by n variables, which may be correlated, and summarizes it by uncorrelated axes (principal components or principal axes) that are linear combinations of the original n variables

  • The first k components display as much as possible of the variation among objects.


2d example of pca

2D Example of PCA


Configuration is centered

Configuration is Centered

  • each variable is adjusted to a mean of zero (by subtracting the mean from each value).


Principal components are computed

Principal Components are Computed

  • PC 1 has the highest possible variance (9.88)

  • PC 2 has a variance of 3.03

  • PC 1 and PC 2 have zero covariance.


Machine learning cs 165b spring 2012

PC 1

PC 2

  • Each principal axis is a linear combination of the original two variables


Feature reduction algorithms1

Feature reduction algorithms

  • Unsupervised

    • Latent Semantic Indexing (LSI): truncated SVD

    • Independent Component Analysis (ICA)

    • Principal Component Analysis (PCA)

  • Supervised

    • Linear Discriminant Analysis (LDA)


Course outline1

Course outline

  • Introduction (Ch. 1)

  • Concept learning (Ch. 2)

  • Decision trees (Ch. 3)

  • Ensemble learning

  • Neural Networks (Ch. 4)

  • Linear classifiers

  • Support Vector Machines

  • Bayesian Learning (Ch. 6)

  • Bayesian Networks

  • Clustering

  • Computational learning theory


Midterm analysis

Midterm analysis

  • Grade distribution

  • Solution to ANN problem

  • Makeup problem on Wednesday

    • 20 minutes

    • 15 points

    • Bring a calculator


Fisher s linear discriminant

Fisher’s linear discriminant

  • A simple linear discriminant function is a projection of the data down to 1-D.

    • So choose the projection that gives the best separation of the classes. What do we mean by “best separation”?

  • An obvious direction to choose is the direction of the line joining the class means.

    • But if the main direction of variance in each class is not orthogonal to this line, this will not give good separation (see the next figure).

  • Fisher’s method chooses the direction that maximizes the ratio of between class variance to within class variance.

    • This is the direction in which the projected points contain the most information about class membership (under Gaussian assumptions)


Fisher s linear discriminant1

Fisher’s linear discriminant

When projected onto the line joining the class means, the classes are not well separated.

Fisher chooses a direction that makes the projected classes much tighter, even though their projected means are less far apart.


Fisher s linear discriminant derivation

Fisher’s linear discriminant (derivation)

Find the best direction w for accurate classification.

A measure of the separation between the projected points is the difference of the sample means.

If mi is the d-dimensional sample mean from Di given by

the sample mean from the projected points Yigiven by

the difference of the projected sample means is:


Fisher s linear discriminant derivation1

Fisher’s linear discriminant (derivation)

Define scatterfor the projection:

Choose w in order to maximize

is called the total within-class scatter.

Define scatter matricesSi(i = 1, 2) and Sw by


Fisher s linear discriminant derivation2

Fisher’s linear discriminant (derivation)

We obtain


Fisher s linear discriminant derivation3

Fisher’s linear discriminant (derivation)

where

In terms of SB and Sw, J(w) can be written as:


Fisher s linear discriminant derivation4

Fisher’s linear discriminant (derivation)


Fisher s linear discriminant derivation5

Fisher’s linear discriminant (derivation)

A vector w that maximizes J(w) must satisfy

In the case that Sw is nonsingular,


Linear discriminant

Linear discriminant

  • Advantages:

    • Simple: O(d) space/computation

    • Knowledge extraction: weighted sum of attributes; positive/negative weights, magnitudes (credit scoring)


Non linear models

Non-linear models

  • Quadratic discriminant:

  • Higher-order (product) terms:

    Map from x to z using nonlinear basis functions and use a linear discriminant in z-space


Linear model two classes

Linear model: two classes


Geometry of classification

Geometry of classification

w is orthogonal to the decision surface

w 0 = b

D = distance of decision surface from origin

Consider any point x on the decision surface. Then D = wTx / ||w|| = −b / ||w||

d(x) = distance of x from decision surface

x = xp+ d(x) w/||w||

wTx + b = wTxp+ d(x) wTw/||w|| + b

g(x) = (wTxp+ b) + d(x) ||w||

d(x) = g(x) / ||w|| = wTx / ||w|| − D


  • Login