instance based classification
Download
Skip this Video
Download Presentation
Instance-based Classification

Loading in 2 Seconds...

play fullscreen
1 / 27

Instance-based Classification - PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on

Instance-based Classification. Examine the training samples each time a new query instance is given. The relationship between the new query instance and training examples will be checked to assign a class label to the query instance. KNN: k-Nearest Neighbor.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Instance-based Classification' - remy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
instance based classification
Instance-based Classification
  • Examine the training samples each time a new query instance is given.
  • The relationship between the new query instance and training examples will be checked to assign a class label to the query instance.
knn k nearest neighbor
KNN: k-Nearest Neighbor
  • A test sample x can be best predicted by determining the most common class label among k training samples to which x is most similar.
  • Xj—jth training sample, yj—the class label for xj, Nx—the set of k nearest neighbors of x in training set. Estimate the probability x belongs to ith class:
knn k nearest neighbor con t
KNN: k-Nearest Neighbor, con’t
  • Proportion of K nearest neighbors that belong to ith class:
  • The ith class which maximizes the proportion above will be assigned as the label of x.
  • Variants of KNN: filtering out irrelevant genes before applying KNN.
slide4
Molecular Classification of Cancer

Class Discovery and Class Prediction by Gene Expression Monitoring

slide5

Publication Info

  • "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
  • Golub, Slonim, Tamayo, Huard, Gaasenbeek, Mesirov, Coller, Lo, Downing, Caligiuri, Bloomfield, Lander
    • Appears in Science Volume 286, October 15, 1999
  • Whitehead Institute/MIT Center for Genome Research
  • http://www-genome.wi.mit.edu/cancer
    • ...and Dana-Farber (Boston), St. Jude (Memphis), Ohio State
  • ...additional publications by same group shows similar technique applied to different disease modalities.
slide6

Cancer Classification

Class Discovery: defining previously unrecognized tumor subtypes

Class Prediction: assignment of particular tumor samples to already-defined classes

  • Given bone marrow samples:
    • Which cancer classes are present among sample?
    • How many cancer classes? 2, 4?
    • Given samples are from leukemia patients, what type of leukemia is each sample (AML vs ALL)?
slide7

Leukemia: Definitions & Symptoms

  • Cancer of bone marrow
  • Myelogenous or lymphocytic, acute or chronic
  • Acute Myelogenous Leukemia (AML) vs Acute Lymphocytic Leukemia (ALL)
  • Marrow cannot produce appropriate amount of red and white blood cells
  • Anemia -> weakness, minor infections; Platlet deficiency -> easy bruising
  • AML: 10,000 new adult cases per year
  • ALL: 3,500/2,400 new adult/child cases per year
  • AML vs. ALL in adults & children
slide8

Leukemia: Treatment & expected outcome

  • Diagnosis via highly specialized laboratory
  • ALL: 58% survival rate
  • AML: 14% survival rate
  • Treatment: chemotherapy, bone marrow transplant
    • ALL: corticosteroids, vincristine, methotrexate, L-asparaginase
    • AML: daunorubicin, cytarabine
  • Correct diagnosis very important for treatment options and expected outcome!!!
  • Microarray could provide systematic diagnosis option
  • BUT ONLY ONE TYPE OF DIAGNOSTIC TOOL!!!
slide9

Leukemia: Data set

  • 38 bone marrow samples (27 AML, 11 AML)
  • 6817 human gene probes
cancer class prediction
Cancer Class Prediction
  • Learning Task
    • Given: Expression profiles of leukemia patients
    • Compute: A model distinguishing disease classes (e.g., AML vs. ALL patients) from expression data.
  • Classification Task
    • Given: Expression profile of a new patient + A learned model (e.g., one computed in a learning task)
    • Determine: The disease class of the patient (e.g., whether the patient has AML or ALL)
cancer class prediction1

g1,1L g1,nÃclass1

g2,1 L g2,nÃclass2

MOM

gm,1L gm,nÃclassm

Cancer Class Prediction
  • n genes measured in m patients

Vector for a patient

cancer class prediction approach
Cancer Class Prediction Approach
  • Rank genes by their correlation with class variable (AML/ALL)
  • Select subset of “informative” genes
  • Have these genes do a weighted vote to classify a previously unclassified patient.
  • Test validity of predictors.
ranking genes
Ranking Genes
  • Rank genes by how predictive they are (individually) of the class…

g1,1L g1,nÃclass1

g2,1 L g2,nÃclass2

MOM

gm,1L gm,nÃclassm

ranking genes1
Ranking Genes
  • Split the expression values for a given gene g into two pools – one for each class (AML vs. ALL)
  • Determine their mean m and standard deviation sigma of each pool
  • Rank genes by correlation metric (separation)

P(g, class) = (mALL - mAML)/(sALL + sAML)

The mean difference between the classes

relative to the SD within the classes.

slide15

Neighborhood Analysis

Each gene g: V(g) = (e1, e2, …, en), ei: expression level of gene g in ith sample.

Idealized pattern: c = (c1, c2, …, cn), ci: 1 or 0 (sample I belongs to class 1 or 2.

C* idealized random pattern.

Counting the number of genes having various levels of correlation

with C, compared with the corresponding distribution obtained for

random pattern C*.

selecting informative genes
Selecting Informative Genes
  • Select the kALL top ranked genes (highly expressed in ALL) and the kAML bottom ranked genes (highly expressed in AML)

P(g, class) = (mALL - mAML)/(sALL + sAML)

In Golub’s paper, 25 most positively correlated and 25

most negatively correlated genes are selected.

slide17

Determine significant genes

1% significance level means 1% of random neighborhoods contain as many points as observed neighborhood.

P(g,c)>0.30 is 709 genes (intersects 1%)

Median is ~150 genes (if totally random)

weighted voting
Weighted Voting
  • Given a new patient to classify, each of the selected genes casts a weighted vote for only one class.
  • The class that gets the most vote is the prediction.
weighted voting1
Weighted Voting
  • Suppose that x is the expression level measured for gene g in the patient

V = P(g,class) X |x – [mALL + mAML]/2|

Distance from the measurement to the class boundary -- reflecting the deviation of the expression level in the sample from the average of AML and ALL

Weight for gene g – weighting factor reflecting how well the gene is correlated with the class distinction

slide20

Prediction

  • Weighted vote:
    • VAML=Sviwi|vi is vote for AML where vi=|xi-(mAML+mALL)/2|
prediction strength
Prediction Strength
  • Can assess the “strength” of a prediction as follows:

PS = (Vwinner – Vloser)/(Vwinner+ Vloser)

where Vwinner is the summed vote (absolute value) from the winning class, and Vloser is the summed vote (absolute value) for the losing class

prediction strength1
Prediction Strength
  • When classifying new cases, the algorithm ignores those cases where the strength of the prediction is below a threshold…
  • Prediction =
    • [ALL, if VALL > VAMLÆ PS > q
    • [AML, if VAML > VALLÆ PS > q
    • [No-call, otherwise.
experiments
Experiments
  • Cross validation with the original set of patients
    • For i = 1 to 38
      • Hold the ith sample aside
      • Use the other 37 samples to determine weights
      • With this set of weights, make prediction on the ith samples
  • Testing with another set of 34 patients…
slide24

Prediction: Results

  • "Training set" results were 36/38 with 100% accuracy, 2 unknown via cross-validation (37 train, 1 test)
  • Independent "test set" consisted of 34 samples
    • 24 bone marrow samples, 10 peripheral blood samples
    • NOTE: "training set" was ONLY bone marrow samples
    • "test set" contained childhood AML samples, different laboratories
  • Strong predictions (PS=0.77) for 29/34 samples with 100% accuracy
  • Low prediction strength from questionable laboratory

Slection of 8-200 genes gives roughly

the same prediction quality.

cancer class discovery
Cancer Class Discovery
  • Given
    • Expression profiles of leukemia patients
  • Do
    • Cluster the profiles, leading to discovery of the subclasses of leukemia represented by the set of patients
cancer class discovery experiment
Cancer Class Discovery Experiment
  • Cluster the expression profiles of 38 patients in the training set
    • Using self-organizing maps with a predefined number of clusters (say, k)
  • Run with k = 2
    • Cluster 1 contained 1 AML, 24 ALL
    • Cluster 2 contained 10 AML, 3 ALL
cancer class discovery experiment1
Cancer Class Discovery Experiment
  • Run with k = 4
    • Cluster 1 contained mostly AML
    • Cluster 2 contained mostly T-cell ALL
    • Cluster 3 contained mostly B-cell ALL
    • Cluster 4 contained mostly B-cell ALL
  • It is unlikely that the clustering algorithm was able to discover the distinction between T-cell and B-cell ALL cases
ad