Knowledge based analysis of microarray gene expression data using support vector machines
Download
1 / 15

Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines - PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on

Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines. Michael P. S. Brown, William Noble Grundy, David Lin, Nello Cristianini, Charles Sugnet, Terrence S. Furey, Manuel Ares, Jr. David Haussler. Proceedings of the National Academy of Sciences. 2000. Overview.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines' - yeo-clayton


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Knowledge based analysis of microarray gene expression data using support vector machines

Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines

Michael P. S. Brown, William Noble Grundy, David Lin, Nello Cristianini, Charles Sugnet, Terrence S. Furey, Manuel Ares, Jr. David Haussler

Proceedings of the National Academy of Sciences. 2000


Overview
Overview using Support Vector Machines

  • Objective: Classify genes based on functionality

  • Observation: Genes of similar function yield similar expression pattern in microarray hybridization experiments

  • Method: Use SVM to build classifiers, using microarray gene expression data.


Previous methods
Previous Methods using Support Vector Machines

  • Most current methods employ unsupervised learning methods (at the time of the publication)

  • Genes are grouped using clustering algorithms based on a distance measure

    • Hierarchical clustering

    • Self-organizing maps


Dna microarray data
DNA Microarray Data using Support Vector Machines

  • Each data point represents the ratio of expression levels of a particular gene in an experimental condition and a reference condition

    • n genes on a single chip

    • m experiments performed

    • The results is an n by m matrix of expression-level ratios

m experiments

m-element expression vector for a single gene

n genes


Dna microarray data1
DNA Microarray Data using Support Vector Machines

  • Normalized logarithmic ratio

    • For gene X, in experience i, define:

      • Ei is the expression level in the experiment

      • Ri is the expression level in the reference state

      • Xi=(x1, x2,..., xn) is the normalized logarithmic ratio

  • Xi is positive when the gene is induced (turned up)

  • Xi is negative when the gene is repressed (turned down)


Support vector machines
Support Vector Machines using Support Vector Machines

  • Searches for a hyperplane that

    • Maximizes the margin

    • Minimizes the violation of the margin

* Edda Leopold† and Jörg Kindermann


Linear inseparability
Linear Inseparability using Support Vector Machines

  • What if data points are not linearly separable?

* Andrew W. Moore


Linear inseparability1
Linear Inseparability using Support Vector Machines

  • Map the data to higher-dimension space

* Andrew W. Moore


Linear inseparability2
Linear Inseparability using Support Vector Machines

  • Problems with mapping data to higher-dimension space

    • Overfitting

      • SVM chooses the maximum margin, and deals well with overfitting

    • High computational cost

      • SVM kernels only involve dot products between points (cheap!)


Svm kernels
SVM Kernels using Support Vector Machines

  • K(X, Y) is function that calculates a measure of similarity between X and Y

    • Dot product

      • K(X,Y) = X.Y

      • Simplest kernel. Linear hyperplane

    • Degree d polynomials

      • K(X,Y) = (X.Y + 1)d

    • Gaussian

      • K(X,Y) = exp(-|X - Y|2/22)


Experimental dataset
Experimental Dataset using Support Vector Machines

  • Expression data from the budding yeast

    • 2467 genes (n)

    • 79 experiments (m)

    • Dataset available on Stanford web site

  • Six functional classes

    • From the Munich Information Centre for Protein Sequences Yeast Genome Database

    • Class definitions come from biochemical and genetic studies

  • Training data:

    • positive labels: set of genes that have a common function

    • Negative labels: set of genes known not to be a member of this function class


Experimental design
Experimental Design using Support Vector Machines

  • Compare the performance of

    • SVM (with degree 1 kernel, i.e. linear))

    • SVM (with degree 2 kernel)

    • SVM (with degree 3 kernel)

    • SVM (Gaussian)

    • Parzen Windows

    • Fisher’s Linear Discriminate

    • C4.5 Decision Trees

    • MOC1 Decision Trees


Experimental design1
Experimental Design using Support Vector Machines

  • Define the cost of method M

    • C(M) = fp(M) + 2.fn(M)

    • False negatives are weighted higher because the number of true negatives is larger

  • Cost of each method is compared to:

    • C(N) = cost of classifying everything as negative

  • Cost saving of method M is :

    • S(M) = C(N) - C(M)


Experimental results
Experimental Results using Support Vector Machines

  • SVMs outperform other methods

  • All classifiers fail to recognize the HTH protein

    • this is expected

    • Members of this class are not “similarly regulated”


Consistently misclassified genes
Consistently Misclassified Genes using Support Vector Machines

  • 20 genes are consistently misclassified by 4 SVM kernels, in different experiments

    • Difference between the expression data and definitions based on protein structures.

    • Many of the false positives are known to be important for the functional class (even though they are not included as part of the class)


ad