1 / 47

Source: Gert Lanckriet s slides

Project and Presentation. Final projectDue on April 26thLength: 20?30 pagesHand in a hard copy, as well as an electronic copy of the report and any related source codes (blackboard).Presentation (April 26th and May 1st)5-minute presentation for each studentEmail TA the slides one day before t

inocencia
Download Presentation

Source: Gert Lanckriet s slides

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Source: Gert Lanckriet’s slides In this… KM / SDPIn this… KM / SDP

    2. Project and Presentation Final project Due on April 26th Length: 20—30 pages Hand in a hard copy, as well as an electronic copy of the report and any related source codes (blackboard). Presentation (April 26th and May 1st) 5-minute presentation for each student Email TA the slides one day before the presentation jianhuic@gmail.com

    3. Overview Find a mapping f such that, in the new feature space, problem solving is easier (e.g. linear). SVM, PCA, LDA, CCA, etc The kernel is defined as the inner product between data points in this new feature space. Similarity measure Valid kernels Kernel construction Kernels from pairwise similarities Diffusion kernels for graphs Kernels for vectors Kernels for abstract data Documents (string) Protein Sequence New kernels from existing kernels

    4. How to choose the optimal kernel? Many different types of kernels for the same data. Different kernels for protein sequences Many different kernels from different types of data Different data sources in bioinformatics Question: How to choose the optimal kernel? Active research in machine learning Simple approach: kernel alignment (with the kernel based on label)

    5. Outline of lecture Introduction Kernel based learning Kernel design for different data sources Learning the optimal Kernel Experiments

    6. During the past decade, a heterogeneous spectrum of data became available describing the genome: - Seq. Data -> similarities between proteins / genes - mRNA expression levels associated with a gene: under different experimental conditionsDuring the past decade, a heterogeneous spectrum of data became available describing the genome: - Seq. Data -> similarities between proteins / genes - mRNA expression levels associated with a gene: under different experimental conditions

    7. Membrane protein prediction

    8. Different data sources are likely to contain different and thus partly independent information about the task at hand Protein-protein interactions are best expressed as graphsDifferent data sources are likely to contain different and thus partly independent information about the task at hand Protein-protein interactions are best expressed as graphs

    9. Kernel-based learning methods have already proven to be a very useful tool in bioinformaticsKernel-based learning methods have already proven to be a very useful tool in bioinformatics

    10. Kernel methods work by embedding data items (genes, proteins, etc.) into a (possibly high dimensional) Euclidean vector space. The embedding is performed implicitely: instead of giving explicit coordinates, the inner product is specified. This can be done by defining a kernel function that specifies the inners product between any pair of data items (whichever needed), this function can be regarded as similarity measure betwene data items. When the amount of data is finite (e.g. here: finite number of genes under consideration), the kernel values between all pair of data points can be organized in a kernel matrix: Gram matrix, which fully describes that embedding. A matrix that is symmm and pos def is a valid kernel matrix in a sense that a mapping/embedding exists….Kernel methods work by embedding data items (genes, proteins, etc.) into a (possibly high dimensional) Euclidean vector space. The embedding is performed implicitely: instead of giving explicit coordinates, the inner product is specified. This can be done by defining a kernel function that specifies the inners product between any pair of data items (whichever needed), this function can be regarded as similarity measure betwene data items. When the amount of data is finite (e.g. here: finite number of genes under consideration), the kernel values between all pair of data points can be organized in a kernel matrix: Gram matrix, which fully describes that embedding. A matrix that is symmm and pos def is a valid kernel matrix in a sense that a mapping/embedding exists….

    11. Kernel methods work by embedding data items (genes, proteins, etc.) into a (possibly high dimensional) Euclidean vector space. The embedding is performed implicitely: instead of giving explicit coordinates, the inner product is specified. This can be done by defining a kernel function that specifies the inners product between any pair of data items (whichever needed), this function can be regarded as similarity measure betwene data items. When the amount of data is finite (e.g. here: finite number of genes under consideration), the kernel values between all pair of data points can be organized in a kernel matrix: Gram matrix, which fully describes that embedding. A matrix that is symmm and pos def is a valid kernel matrix in a sense that a mapping/embedding exists….Kernel methods work by embedding data items (genes, proteins, etc.) into a (possibly high dimensional) Euclidean vector space. The embedding is performed implicitely: instead of giving explicit coordinates, the inner product is specified. This can be done by defining a kernel function that specifies the inners product between any pair of data items (whichever needed), this function can be regarded as similarity measure betwene data items. When the amount of data is finite (e.g. here: finite number of genes under consideration), the kernel values between all pair of data points can be organized in a kernel matrix: Gram matrix, which fully describes that embedding. A matrix that is symmm and pos def is a valid kernel matrix in a sense that a mapping/embedding exists….

    12. Kernel methods work by embedding data items (genes, proteins, etc.) into a (possibly high dimensional) Euclidean vector space. The embedding is performed implicitely: instead of giving explicit coordinates, the inner product is specfied. This can be done by defining a kernel function that specifies the inners product between any pair of data items (whichever needed), this function can be regarded as similarity measure between data items. When the amount of data is finite (e.g. here: finite number of genes under consideration), the kernel values between all pair of data points can be organized in a kernel matrix: Gram matrix, which fully describes that embedding. A matrix that is symmm and pos def is a valid kernel matrix in a sense that a mapping/embedding exists….Kernel methods work by embedding data items (genes, proteins, etc.) into a (possibly high dimensional) Euclidean vector space. The embedding is performed implicitely: instead of giving explicit coordinates, the inner product is specfied. This can be done by defining a kernel function that specifies the inners product between any pair of data items (whichever needed), this function can be regarded as similarity measure between data items. When the amount of data is finite (e.g. here: finite number of genes under consideration), the kernel values between all pair of data points can be organized in a kernel matrix: Gram matrix, which fully describes that embedding. A matrix that is symmm and pos def is a valid kernel matrix in a sense that a mapping/embedding exists….

    14. Find good hyperplane (w,b) Rd+1 that classifies this and future data points as good as possible

    16. Intuition (Vapnik, 1965) if linearly separable: Separate the data Place hyerplane “far” from the data: large margin

    17. If not linearly separable: Allow some errors Still, try to place hyerplane “far” from each class

    20. Hand-writing recognition (e.g., USPS) Computational biology (e.g., micro-array data) Text classification Face detection Face expression recognition Time series prediction (regression) Drug discovery (novelty detection)

    21. Kernel-based learning methods represent data by means of a kernel matrix or function, which defines similarities between pairs of genes, proteins, … Such similarities can be established using a broad spectrum of data (examples later on), as long as the corresponding kernel matrix is positive definite: it’s ok! In that case, we can interprete it is the inner products in some high dimensional space, in which we can train our favorite linear classification algorithm… Kernel matrix <-> kernel function So we can have a very heterogeneous set of data up here, and every kernel function/matrix is geared towards a specific type of data, thus extracting a specific type of information from a data set. Just like this each set of data describes the genome partially, in a heteregeneous way, so does each kernel, but in a homogeneous way (all compatible matrices). So, here we have a chance to fuse the many partial descriptions of the data: by combining/fusing/mixing those compatible kernel matrices in a way that is statistically optimal, computationally efficient and robust, we can try to find a kernel K that best represents all of the information available for a given learning task. We’ll explain further on how this can be accomplished.Kernel-based learning methods represent data by means of a kernel matrix or function, which defines similarities between pairs of genes, proteins, … Such similarities can be established using a broad spectrum of data (examples later on), as long as the corresponding kernel matrix is positive definite: it’s ok! In that case, we can interprete it is the inner products in some high dimensional space, in which we can train our favorite linear classification algorithm… Kernel matrix <-> kernel function So we can have a very heterogeneous set of data up here, and every kernel function/matrix is geared towards a specific type of data, thus extracting a specific type of information from a data set. Just like this each set of data describes the genome partially, in a heteregeneous way, so does each kernel, but in a homogeneous way (all compatible matrices). So, here we have a chance to fuse the many partial descriptions of the data: by combining/fusing/mixing those compatible kernel matrices in a way that is statistically optimal, computationally efficient and robust, we can try to find a kernel K that best represents all of the information available for a given learning task. We’ll explain further on how this can be accomplished.

    23. Each matrix entry is an mRNA expression measurement. Each column is an experiment. Each row corresponds to a gene.

    24. Normalized scalar product Similar vectors receive high values, and vice versa.

    25. Use general similarity measurement for vector data: Gaussian kernel

    30. Pairwise interactions can be represented as a graph or a matrix. The simplest kernel counts the number of shared interactions between each pair.

    31. A general method for establishing similarities between nodes of a graph. Based upon a random walk. Efficiently accounts for all paths connecting two nodes, weighted by path lengths.

    32. Integral plasma membrane proteins serve several functions. Often, one divides them into four classes: \emph{transporters}, \emph{linkers}, \emph{enzymes} and \emph{receptors}. - The transporters serve as gates through the cellmembrane, generally for charged or polar molecules that otherwise could not pass the hydrophobic lipid bilayer the plasma membrane consists of. Linkers have a structural function in the cell membrane. Some membrane proteins are merely enzymes, moderating biochemical reactions inside or outside the cell. - Receptors are capable of receiving biochemical signals from inside or outside the cell, thus triggering a reaction on the other side of the membrane. In particular, inside the membrane receptors often interact with kinases\footnote{Kinase is a generic name for enzymes that attach a phosphate to a protein, opposite in action to phosphatases; these enzymes are important metabolic regulators.}, thus initiating a signaling pathway in the cell triggered by an extracellular stimulus.Integral plasma membrane proteins serve several functions. Often, one divides them into four classes: \emph{transporters}, \emph{linkers}, \emph{enzymes} and \emph{receptors}. - The transporters serve as gates through the cellmembrane, generally for charged or polar molecules that otherwise could not pass the hydrophobic lipid bilayer the plasma membrane consists of. Linkers have a structural function in the cell membrane. Some membrane proteins are merely enzymes, moderating biochemical reactions inside or outside the cell. - Receptors are capable of receiving biochemical signals from inside or outside the cell, thus triggering a reaction on the other side of the membrane. In particular, inside the membrane receptors often interact with kinases\footnote{Kinase is a generic name for enzymes that attach a phosphate to a protein, opposite in action to phosphatases; these enzymes are important metabolic regulators.}, thus initiating a signaling pathway in the cell triggered by an extracellular stimulus.

    33. We will develop a kernel motivated by the low-frequency alternation of hydrophobic and hydrophilic regions in membrane proteins. However, we also demonstrate that the hydropathy profile only provides partial info: additional info is gained from sequence homology and prot-prot interactionsWe will develop a kernel motivated by the low-frequency alternation of hydrophobic and hydrophilic regions in membrane proteins. However, we also demonstrate that the hydropathy profile only provides partial info: additional info is gained from sequence homology and prot-prot interactions

    34. Dir. Inc. …. -> known to be usefull in identifying membrane proteinsDir. Inc. …. -> known to be usefull in identifying membrane proteins

    35. Kernel-based learning methods represent data by means of a kernel matrix or function, which defines similarities between pairs of genes, proteins, … Such similarities can be established using a broad spectrum of data (examples later on), as long as the corresponding kernel matrix is positive definite: it’s ok! In that case, we can interprete it is the inner products in some high dimensional space, in which we can train our favorite linear classification algorithm… Kernel matrix <-> kernel function So we can have a very heterogeneous set of data up here, and every kernel function/matrix is geared towards a specific type of data, thus extracting a specific type of information from a data set. Just like this each set of data describes the genome partially, in a heteregeneous way, so does each kernel, but in a homogeneous way (all compatible matrices). So, here we have a chance to fuse the many partial descriptions of the data: by combining/fusing/mixing those compatible kernel matrices in a way that is statistically optimal, computationally efficient and robust, we can try to find a kernel K that best represents all of the information available for a given learning task. We’ll explain further on how this can be accomplished.Kernel-based learning methods represent data by means of a kernel matrix or function, which defines similarities between pairs of genes, proteins, … Such similarities can be established using a broad spectrum of data (examples later on), as long as the corresponding kernel matrix is positive definite: it’s ok! In that case, we can interprete it is the inner products in some high dimensional space, in which we can train our favorite linear classification algorithm… Kernel matrix <-> kernel function So we can have a very heterogeneous set of data up here, and every kernel function/matrix is geared towards a specific type of data, thus extracting a specific type of information from a data set. Just like this each set of data describes the genome partially, in a heteregeneous way, so does each kernel, but in a homogeneous way (all compatible matrices). So, here we have a chance to fuse the many partial descriptions of the data: by combining/fusing/mixing those compatible kernel matrices in a way that is statistically optimal, computationally efficient and robust, we can try to find a kernel K that best represents all of the information available for a given learning task. We’ll explain further on how this can be accomplished.

    36. Let’s forget about everything and consider learning the optimal kernel? How can we do this? Convex: local = global optimumLet’s forget about everything and consider learning the optimal kernel? How can we do this? Convex: local = global optimum

    37. Let’s forget about everything and consider learning the optimal kernel? How can we do this?Let’s forget about everything and consider learning the optimal kernel? How can we do this?

    38. Learning the Optimal Kernel

    39. - Convex subset: good for us: we want a subset obtained by mixing our kernels somehow --- here we take: linear subspace in the cone, spanned by those kernels, where we wanna learn the weights - for SVMs, maximum margin classifiers: - Convex subset: good for us: we want a subset obtained by mixing our kernels somehow --- here we take: linear subspace in the cone, spanned by those kernels, where we wanna learn the weights - for SVMs, maximum margin classifiers:

    40. Learning the optimal Kernel

    41. Learning the optimal Kernel

    42. - Convex subset: good for us: we want a subset obtained by mixing our kernels somehow --- here we take: linear subspace in the cone, spanned by those kernels, where we wanna learn the weights - for SVMs, maximum margin classifiers: - Convex subset: good for us: we want a subset obtained by mixing our kernels somehow --- here we take: linear subspace in the cone, spanned by those kernels, where we wanna learn the weights - for SVMs, maximum margin classifiers:

    43. - Convex subset: good for us: we want a subset obtained by mixing our kernels somehow --- here we take: linear subspace in the cone, spanned by those kernels, where we wanna learn the weights - for SVMs, maximum margin classifiers: - Convex subset: good for us: we want a subset obtained by mixing our kernels somehow --- here we take: linear subspace in the cone, spanned by those kernels, where we wanna learn the weights - for SVMs, maximum margin classifiers:

    46. Next class Student presentation Schedule is online Send the ppt slides to TA one day earlier Email: jianhuic@gmail.com

    47. Survey Clustering Classification Regression Semi-supervised learning Dimensionality reduction Manifold learning Kernel learning

More Related