slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Lecture 3 Machine Learning PowerPoint Presentation
Download Presentation
Lecture 3 Machine Learning

Loading in 2 Seconds...

play fullscreen
1 / 46

Lecture 3 Machine Learning - PowerPoint PPT Presentation


  • 143 Views
  • Uploaded on

C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. Lecture 3 Machine Learning (Elena Marchiori’s slides adapted) Bioinformatics Data Analysis and Tools. heringa@few.vu.nl. Supervised Learning.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Lecture 3 Machine Learning


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

C

E

N

T

E

R

F

O

R

I

N

T

E

G

R

A

T

I

V

E

B

I

O

I

N

F

O

R

M

A

T

I

C

S

V

U

Lecture 3

Machine Learning

(Elena Marchiori’s slides adapted)Bioinformatics Data Analysis and Tools

heringa@few.vu.nl

supervised learning
Supervised Learning

property of interest

observations

System (unknown)

supervisor

Train dataset

?

ML algorithm

new observation

model

prediction

Classification

unsupervised learning
Unsupervised Learning

ML for unsupervised

learning attempts to discover interesting

structure in the available data

Data mining, Clustering

what is your question
What is your question?
  • What are the targets genes for my knock-out gene?
  • Look for genes that have different time profiles between different cell types.

Gene discovery, differential expression

  • Is a specified group of genes all up-regulated in a specified conditions?

Gene set, differential expression

  • Can I use the expression profile of cancer patients to predict survival?
  • Identification of groups of genes that are predictive of a particular class of tumors?

Class prediction, classification

  • Are there tumor sub-types not previously identified?
  • Are there groups of co-expressed genes?

Class discovery, clustering

  • Detection of gene regulatory mechanisms.
  • Do my genes group into previously undiscovered pathways?

Clustering. Often expression data alone is not enough, need to incorporate functional and other information

slide5

Basic principles of discrimination

  • Each object associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG)
  • Aim:predict Y from X.

Predefined

Class

{1,2,…K}

K

1

2

Objects

Y = Class Label = 2

X = Feature vector

{colour, shape}

Classification rule ?

X = {red, square}

Y = ?

discrimination and prediction
Discrimination and Prediction

Learning Set

Data with known classes

Prediction

Classification

rule

Data with unknown classes

Classification

Technique

Class

Assignment

Discrimination

example a classification problem
Example: A Classification Problem
  • Categorize images of fish—say, “Atlantic salmon” vs. “Pacific salmon”
  • Use features such as length, width, lightness, fin shape & number, mouth position, etc.
  • Steps
    • Preprocessing (e.g., background subtraction)
    • Feature extraction/feature weighting
    • Classification

example from Duda & Hart

classification in bioinformatics
Classification in Bioinformatics
  • Computational diagnostic: early cancer detection
  • Tumor biomarker discovery
  • Protein structure prediction (threading)
  • Protein-protein binding sites prediction
  • Gene function prediction
slide9

Learning set

Good Prognosis

recurrence > 5 yrs

Bad prognosis

recurrence < 5yrs

Good Prognosis

recurrence > 5yrs

?

Predefine

classes

Clinical

outcome

Objects

Array

Feature vectors

Gene

expression

new

array

Reference

L van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan.

.

Classification

rule

classification techniques
Classification Techniques
  • K Nearest Neighbor classifier
  • Support Vector Machines
instance based learning ibl
Instance Based Learning (IBL)

Key idea: just store all training examples <xi,f(xi)>

Nearest neighbor:

  • Given query instance xq, first locate nearest training example xn, then estimate f(xq)=f(xn)

K-nearest neighbor:

  • Given xq, take vote among its k nearest neighbors (if discrete-valued target function)
  • Take mean of values of k nearest neighbors (if real-valued) f(xq)=i=1k f(xi)/k
k nearest neighbor
K-Nearest Neighbor
  • The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms.
  • An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors.
  • k is a positive integer, typically small. If k = 1, then the object is simply assigned to the class of its nearest neighbor.
  • K-NN can do multiple class prediction (more than two cancer subtypes, etc.)
  • In binary (two class) classification problems, it is helpful to choose k to be an odd number as this avoids tied votes.

Adapted from Wikipedia

k nearest neighbor1
K-Nearest Neighbor
  • A lazy learner …
  • Issues:
    • How many neighbors?
    • What similarity measure?

Example of k-NN classification. The test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles. If k = 3 it is classified to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 it is classified to first class (3 squares vs. 2 triangles inside the outer circle).

From Wikipedia

which similarity or dissimilarity measure
Which similarity or dissimilarity measure?
  • A metric is a measure of the similarity or dissimilarity between two data objects
  • Two main classes of metric:
    • Correlation coefficients (similarity)
      • Compares shape of expression curves
      • Types of correlation:
        • Centered.
        • Un-centered.
        • Rank-correlation
    • Distance metrics (dissimilarity)
      • City Block (Manhattan) distance
      • Euclidean distance
correlation a measure between 1 and 1
Correlation (a measure between -1 and 1)
  • Pearson Correlation Coefficient (centered correlation)

Sx = Standard deviation of x

Sy = Standard deviation of y

You can use absolute correlation to capture both positive and negative correlation

Positive correlation

Negative correlation

potential p itfalls
Potentialpitfalls

Correlation = 1

distance metrics
City Block (Manhattan) distance:

Sum of differences across dimensions

Less sensitive to outliers

Diamond shaped clusters

Euclidean distance:

Most commonly used distance

Sphere shaped cluster

Corresponds to the geometric distance into the multidimensional space

Y

Condition 2

X

Condition 1

Distance metrics

Y

Condition 2

X

Condition 1

where gene X = (x1,…,xn) and gene Y=(y1,…,yn)

euclidean vs correlation i
Euclidean vs Correlation (I)
  • Euclidean distance
  • Correlation
when to consider nearest neighbors
When to Consider Nearest Neighbors
  • Instances map to points in RN
  • Less than 20 attributes per instance
  • Lots of training data

Advantages:

  • Training is very fast
  • Learn complex target functions
  • Do not loose information

Disadvantages:

  • Slow at query time
  • Easily fooled by irrelevant attributes
voronoi diagrams
Voronoi Diagrams
  • Voronoi diagrams partition a space with objects in the same way as happens when you throw a number of pebbles in water -- you get concentric circles that will start touching and by doing so delineate the area for each pebble (object).
  • The area assigned to each object can now be used for weighting purposes
  • A nice example from sequence analysis is by Sibbald, Vingron and Argos (1990)
  • Sibbald, P. and Argos, P. 1990. Weighting aligned protein or nucleic acid sequences to correct for unequal representation. JMB 216:813-818.
voronoi diagram
Voronoi Diagram

query point qf

nearest neighbor qi

3 nearest neighbors
3-Nearest Neighbors

query point qf

3 nearest neighbors

2x,1o

Can use Voronoi areas for weighting

7 nearest neighbors
7-Nearest Neighbors

query point qf

7 nearest neighbors

3x,4o

k nearest neighbors
k-Nearest Neighbors
  • The best choice of k depends upon the data; generally, larger values of k reduce the effect of noise on the classification, but make boundaries between classes less distinct.
  • A good k can be selected by various heuristic techniques, for example, cross-validation. If k = 1, the algorithm is called the nearest neighbor algorithm.
  • The accuracy of the k-NN algorithm can be severely degraded by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their importance.
  • Much research effort has been put into selecting or scaling features to improve classification, e.g. using evolutionary algorithms to optimize feature scaling.
nearest neighbor
Nearest Neighbor
  • Approximate the target function f(x) at the single query point x = xq
  • Locally weighted regression = generalization of IBL
curse of dimensionality
Curse of Dimensionality

Imagine instances are described by 20 attributes (features) but only 10 are relevant to target function

Curse of dimensionality: nearest neighbor is easily misled when the instance space is high-dimensional

One approach: weight the features according to their relevance!

  • Stretch j-th axis by weight zj, where z1,…,zn chosen to minimize prediction error
  • Use cross-validation to automatically choose weights z1,…,zn
  • Note setting zj to zero eliminates this dimension alltogether (feature subset selection)
practical implementations
Practical implementations
  • Weka – IBk
  • Optimized – Timbl
example tumor classification
Example: Tumor Classification
  • Reliable and precise classification essential for successful cancer treatment
  • Current methods for classifying human malignancies rely on a variety of morphological, clinical and molecular variables
  • Uncertainties in diagnosis remain; likely that existing classes are heterogeneous
  • Characterize molecular variations among tumors by monitoring gene expression (microarray)
  • Hope: that microarrays will lead to more reliable tumor classification (and therefore more appropriate treatments and better outcomes)
tumor classification using gene expression data
Tumor Classification Using Gene Expression Data

Three main types of ML problems associated with tumor classification:

  • Identification of new/unknown tumor classes using gene expression profiles (unsupervised learning – clustering)
  • Classification of malignancies into known classes (supervised learning – discrimination)
  • Identification of “marker” genes that characterize the different tumor classes (feature or variable selection).
slide30

Example Leukemia experiments (Golub et al 1999)

  • Goal. To identify genes which are differentially expressed in acute lymphoblastic leukemia (ALL) tumours in comparison with acute myeloid leukemia (AML) tumours.
  • 38 tumour samples: 27 ALL, 11 AML.
  • Data from Affymetrix chips, some pre-processing.
  • Originally 6,817 genes; 3,051 after reduction.
  • Data therefore 3,051  38 array of expression values.

Acute lymphoblastic leukemia (ALL) is the most common malignancy in children 2-5 years in age, representing nearly one third of all pediatric cancers.

Acute Myeloid Leukemia (AML) is the most common form of myeloid leukemia in adults (chronic lymphocytic leukemia is the most common form of leukemia in adults overall). In contrast, acute myeloid leukemia is an uncommon variant of leukemia in children. The median age at diagnosis of acute myeloid leukemia is 65 years of age.

slide31

Learning set

Predefine

classes

Tumor type

B-ALL

T-ALL

AML

T-ALL

?

Objects

Array

Feature vectors

Gene

expression

new

array

Reference

Golub et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531-537.

Classification

Rule

slide33
SVM
  • SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s.
  • SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data.
  • SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc.
  • Most popular optimization algorithms for SVMs are SMO [Platt ’99] and SVMlight[Joachims’ 99], both use decomposition to hill-climb over a subset of αi’s at a time.
  • Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and-see manner.
slide34
SVM
  • In order to discriminate between two classes, given a training dataset
    • Map the data to a higher dimension space (feature space)
    • Separate the two classes using an optimal linear separator
feature space mapping
Feature Space Mapping
  • Map the original data to some higher-dimensional feature space where the training set is linearly separable:

Φ: x→φ(x)

the kernel trick
The “Kernel Trick”
  • The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj
  • If every datapoint is mapped into high-dimensional space via some transformation Φ: x→φ(x), the inner product becomes:

K(xi,xj)= φ(xi)Tφ(xj)

  • A kernel function is some function that corresponds to an inner product in some expanded feature space.
  • Example:

2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,

Need to show that K(xi,xj)= φ(xi)Tφ(xj):

K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2=

= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] =

= φ(xi)Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]

linear separators
Linear Separators

Which one is the best?

optimal hyperplane

ρ

Optimal hyperplane

Support vectors uniquely characterize optimal hyper-plane

margin

Optimal hyper-plane

Support vector

optimal hyperplane geometric view
Optimal hyperplane: geometric view

The first class

The second class

soft margin classification
Soft Margin Classification
  • What if the training set is not linearly separable?
  • Slack variablesξican be added to allow misclassification of difficult or noisy examples.

ξj

ξk

weakening the constraints
Weakening the constraints

Weakening the constraints

Allow that the objects do not strictly obey the constraints

Introduce ‘slack’-variables

slide43
SVM
  • Advantages:
    • maximize the margin between two classes in the feature space characterized by a kernel function
    • are robust with respect to high input dimension
  • Disadvantages:
    • difficult to incorporate background knowledge
    • Sensitive to outliers
classifying new examples
Classifying new examples
  • Given new point x, its class membership is sign[f(x, *, b*)], where

Data enters only in the form of dot products!

and in general

Kernel function

classification cv error
Classification: CV error

N samples

  • Training error
    • Empirical error
  • Error on independent test set
    • Test error
  • Cross validation (CV) error
    • Leave-one-out (LOO)
    • n-fold CV

splitting

N/n samples for testing

N(n-1)/n samples for training

Count errors

Summarize CV error rate