Instance based classification
1 / 27

Instance-based Classification - PowerPoint PPT Presentation

  • Uploaded on

Instance-based Classification. Examine the training samples each time a new query instance is given. The relationship between the new query instance and training examples will be checked to assign a class label to the query instance. KNN: k-Nearest Neighbor.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Instance-based Classification' - remy

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Instance based classification
Instance-based Classification

  • Examine the training samples each time a new query instance is given.

  • The relationship between the new query instance and training examples will be checked to assign a class label to the query instance.

Knn k nearest neighbor
KNN: k-Nearest Neighbor

  • A test sample x can be best predicted by determining the most common class label among k training samples to which x is most similar.

  • Xj—jth training sample, yj—the class label for xj, Nx—the set of k nearest neighbors of x in training set. Estimate the probability x belongs to ith class:

Knn k nearest neighbor con t
KNN: k-Nearest Neighbor, con’t

  • Proportion of K nearest neighbors that belong to ith class:

  • The ith class which maximizes the proportion above will be assigned as the label of x.

  • Variants of KNN: filtering out irrelevant genes before applying KNN.

Instance based classification

Molecular Classification of Cancer

Class Discovery and Class Prediction by Gene Expression Monitoring

Instance based classification

Publication Info

  • "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

  • Golub, Slonim, Tamayo, Huard, Gaasenbeek, Mesirov, Coller, Lo, Downing, Caligiuri, Bloomfield, Lander

    • Appears in Science Volume 286, October 15, 1999

  • Whitehead Institute/MIT Center for Genome Research


    • ...and Dana-Farber (Boston), St. Jude (Memphis), Ohio State

  • ...additional publications by same group shows similar technique applied to different disease modalities.

Instance based classification

Cancer Classification

Class Discovery: defining previously unrecognized tumor subtypes

Class Prediction: assignment of particular tumor samples to already-defined classes

  • Given bone marrow samples:

    • Which cancer classes are present among sample?

    • How many cancer classes? 2, 4?

    • Given samples are from leukemia patients, what type of leukemia is each sample (AML vs ALL)?

Instance based classification

Leukemia: Definitions & Symptoms

  • Cancer of bone marrow

  • Myelogenous or lymphocytic, acute or chronic

  • Acute Myelogenous Leukemia (AML) vs Acute Lymphocytic Leukemia (ALL)

  • Marrow cannot produce appropriate amount of red and white blood cells

  • Anemia -> weakness, minor infections; Platlet deficiency -> easy bruising

  • AML: 10,000 new adult cases per year

  • ALL: 3,500/2,400 new adult/child cases per year

  • AML vs. ALL in adults & children

Instance based classification

Leukemia: Treatment & expected outcome

  • Diagnosis via highly specialized laboratory

  • ALL: 58% survival rate

  • AML: 14% survival rate

  • Treatment: chemotherapy, bone marrow transplant

    • ALL: corticosteroids, vincristine, methotrexate, L-asparaginase

    • AML: daunorubicin, cytarabine

  • Correct diagnosis very important for treatment options and expected outcome!!!

  • Microarray could provide systematic diagnosis option


Instance based classification

Leukemia: Data set

  • 38 bone marrow samples (27 AML, 11 AML)

  • 6817 human gene probes

Cancer class prediction
Cancer Class Prediction

  • Learning Task

    • Given: Expression profiles of leukemia patients

    • Compute: A model distinguishing disease classes (e.g., AML vs. ALL patients) from expression data.

  • Classification Task

    • Given: Expression profile of a new patient + A learned model (e.g., one computed in a learning task)

    • Determine: The disease class of the patient (e.g., whether the patient has AML or ALL)

Cancer class prediction1

g1,1L g1,nÃclass1

g2,1 L g2,nÃclass2


gm,1L gm,nÃclassm

Cancer Class Prediction

  • n genes measured in m patients

Vector for a patient

Cancer class prediction approach
Cancer Class Prediction Approach

  • Rank genes by their correlation with class variable (AML/ALL)

  • Select subset of “informative” genes

  • Have these genes do a weighted vote to classify a previously unclassified patient.

  • Test validity of predictors.

Ranking genes
Ranking Genes

  • Rank genes by how predictive they are (individually) of the class…

g1,1L g1,nÃclass1

g2,1 L g2,nÃclass2


gm,1L gm,nÃclassm

Ranking genes1
Ranking Genes

  • Split the expression values for a given gene g into two pools – one for each class (AML vs. ALL)

  • Determine their mean m and standard deviation sigma of each pool

  • Rank genes by correlation metric (separation)

    P(g, class) = (mALL - mAML)/(sALL + sAML)

    The mean difference between the classes

    relative to the SD within the classes.

Instance based classification

Neighborhood Analysis

Each gene g: V(g) = (e1, e2, …, en), ei: expression level of gene g in ith sample.

Idealized pattern: c = (c1, c2, …, cn), ci: 1 or 0 (sample I belongs to class 1 or 2.

C* idealized random pattern.

Counting the number of genes having various levels of correlation

with C, compared with the corresponding distribution obtained for

random pattern C*.

Selecting informative genes
Selecting Informative Genes

  • Select the kALL top ranked genes (highly expressed in ALL) and the kAML bottom ranked genes (highly expressed in AML)

    P(g, class) = (mALL - mAML)/(sALL + sAML)

In Golub’s paper, 25 most positively correlated and 25

most negatively correlated genes are selected.

Instance based classification

Determine significant genes

1% significance level means 1% of random neighborhoods contain as many points as observed neighborhood.

P(g,c)>0.30 is 709 genes (intersects 1%)

Median is ~150 genes (if totally random)

Weighted voting
Weighted Voting

  • Given a new patient to classify, each of the selected genes casts a weighted vote for only one class.

  • The class that gets the most vote is the prediction.

Weighted voting1
Weighted Voting

  • Suppose that x is the expression level measured for gene g in the patient

    V = P(g,class) X |x – [mALL + mAML]/2|

Distance from the measurement to the class boundary -- reflecting the deviation of the expression level in the sample from the average of AML and ALL

Weight for gene g – weighting factor reflecting how well the gene is correlated with the class distinction

Instance based classification


  • Weighted vote:

    • VAML=Sviwi|vi is vote for AML where vi=|xi-(mAML+mALL)/2|

Prediction strength
Prediction Strength

  • Can assess the “strength” of a prediction as follows:

    PS = (Vwinner – Vloser)/(Vwinner+ Vloser)

    where Vwinner is the summed vote (absolute value) from the winning class, and Vloser is the summed vote (absolute value) for the losing class

Prediction strength1
Prediction Strength

  • When classifying new cases, the algorithm ignores those cases where the strength of the prediction is below a threshold…

  • Prediction =

    • [ALL, if VALL > VAMLÆ PS > q

    • [AML, if VAML > VALLÆ PS > q

    • [No-call, otherwise.


  • Cross validation with the original set of patients

    • For i = 1 to 38

      • Hold the ith sample aside

      • Use the other 37 samples to determine weights

      • With this set of weights, make prediction on the ith samples

  • Testing with another set of 34 patients…

Instance based classification

Prediction: Results

  • "Training set" results were 36/38 with 100% accuracy, 2 unknown via cross-validation (37 train, 1 test)

  • Independent "test set" consisted of 34 samples

    • 24 bone marrow samples, 10 peripheral blood samples

    • NOTE: "training set" was ONLY bone marrow samples

    • "test set" contained childhood AML samples, different laboratories

  • Strong predictions (PS=0.77) for 29/34 samples with 100% accuracy

  • Low prediction strength from questionable laboratory

Slection of 8-200 genes gives roughly

the same prediction quality.

Cancer class discovery
Cancer Class Discovery

  • Given

    • Expression profiles of leukemia patients

  • Do

    • Cluster the profiles, leading to discovery of the subclasses of leukemia represented by the set of patients

Cancer class discovery experiment
Cancer Class Discovery Experiment

  • Cluster the expression profiles of 38 patients in the training set

    • Using self-organizing maps with a predefined number of clusters (say, k)

  • Run with k = 2

    • Cluster 1 contained 1 AML, 24 ALL

    • Cluster 2 contained 10 AML, 3 ALL

Cancer class discovery experiment1
Cancer Class Discovery Experiment

  • Run with k = 4

    • Cluster 1 contained mostly AML

    • Cluster 2 contained mostly T-cell ALL

    • Cluster 3 contained mostly B-cell ALL

    • Cluster 4 contained mostly B-cell ALL

  • It is unlikely that the clustering algorithm was able to discover the distinction between T-cell and B-cell ALL cases