1 / 21

Efficient Gene Selection with Rough Sets From Gene Expression Data

Efficient Gene Selection with Rough Sets From Gene Expression Data. Lijun sun Duoqian Miao Hongyun Zhang. Introduction Rough Sets Based Feature Selection Rough Sets Based Gene Selection Method Experimental Results Conclusion. Introduction-1.

avalon
Download Presentation

Efficient Gene Selection with Rough Sets From Gene Expression Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Gene Selection with Rough Sets From Gene Expression Data Lijun sun Duoqian Miao Hongyun Zhang

  2. Introduction • Rough Sets Based Feature Selection • Rough Sets Based Gene Selection Method • Experimental Results • Conclusion

  3. Introduction-1 • different cells or a cell under different conditions yield different microarray results • comparisons of gene expression data derived from microarray results between normal and tumor cells can provide the important information for tumor classification

  4. Introduction-2 • gene expression data set has very unique characteristics which are very different from all the previous data used for classification. • Most publicly available gene expression data has the following properties: • high dimensionality: up to tens of thousands of genes, • very small data set size: less than 100 • most genes are not related to cancer classification. • Problems: • Noise • Dimensionality • Cost of classification algorithms

  5. Introduction-3 • only a very small fraction of genes are informative for a certain task • Base the classification on only a subset of the genes • Reduce dimensionality – for convenience, decrease running time • Drop noisy/irrelevant genes – for accuracy • For biological insight

  6. Introduction-4 • feature ranking approach is most commonly used for feature selection • Filter approaches remove irrelevant features according to general characteristics of the data. • (1)Use a filter to rank all the genes in the data. • (2) Choose the first n − 1 genes as the best feature subset • simple, easy, better generalization • problems: Feature sets so obtained have certain redundancy because genes in similar pathways probably allhave very similar scores

  7. Rough Sets Based Feature Selection-1 • Basic conception of rough sets • decision table: is denoted by T = (U, C U{d}), • where U is universe of discourse, • C is called condition attribute sets • {d} is decision feature. • Rows of the decision table correspond to objects, and columns correspond to attributes

  8. Rough Sets Based Feature Selection-2 • Indiscernibility Relation . Let aA, P ⊆ A. A binary relation IND(P), called the indiscernibility relation, is defined as the following:

  9. Rough Sets Based Feature Selection-3 • Indispensable and Dispensable Attribute Anattribute C is a indispensable attribute if Anattribute cC is a dispensable attribute if

  10. Rough Sets Based Feature Selection-4 • Reduct The subset of attributes is a reduct of attribute C if And

  11. Rough Sets Based Feature Selection-5 • Core. The set of all indispensable features in C is ,where is the set of all reducts of C with respect to D.

  12. Rough Sets Based Feature Selection-6 • Rough sets based feature selection • An optimal feature subset selection based on the rough set theory can be viewed as finding such a reduct R , with the best classifying properties. R will be used to instead of C in a rule discovery algorithm.

  13. Rough Sets Based Gene Selection Method-1 • Our learning problem is to select high discriminate genes for cancer classification from gene expression data • Gene expression data setcan be formalized as a decision system T = (U, C U{d}), • where universe U = {x1, x2, ……, xm} is a set of tumors. • The conditional attributes set C = {g1 , g2 ,……, gn} contains each gene, • the decision attribute d corresponds to class label of each sample. • Each attribute giC is represented by a vector gi = {x1,i, x2,i, ……, xm,i}, i=1,2,……,n, where xk,iis the expression level of gene i at sample k, k=1,2,……m.

  14. Rough Sets Based Gene Selection Method-2 • Two steps of our method • Step 1 :use filter kind of method to obtain a feature subset • T-test is used as the filter, • Assuming that there are two classes of samples in a gene expression data set, the t-value for gene g is given by:

  15. Rough Sets Based Gene Selection Method-3 • Step 2 :Use rough set attribute reduction to find a minimal reduct • information entropy is using as the heuristic information • Given the partition by D, U/IND(D), of U, the entropy based on the partition by aC,U / IND(a), of U, is given by

  16. Experiment-1 • Data Set: The acute leukemia data of Golub et al. (1999) http://www.genome.wi.mit.edu/MPR • Consists of samples from two different types of acute leukemia, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). • The training data set has 38 bone marrow samples (27 ALL and 11 AML). Each sample has expression patterns of 7129 genes measured by the Affymetrix oligonucleotide microarray. • The test data set consists of 24 bone marrow and 10 peripheral blood samples (20 ALL and 14 AML).

  17. Experiment -2 • Only One gene is selected in the reduct X95735 • X95735 is also selected by many other methods • X95735 is the only gene identifed by J48 pruned tree And emerging patterns algorithm • X95735 is also selected by voting machine, SVM , Deb's NSGA-algorithm and Cho's work

  18. Experiment -3 • Two rules are derived: • if the expression level of X95735 >938 then the sample is classifed as AML; • If the expression level of X95735 <938 then the sample is classifed as ALL • 31 of 34 samples in test data set are correct classified

  19. The Comparison of Feature Selection and Classifcation Results

  20. conclusion • the expression level of X95735 plays an important role in distinguishing two types of acute leukemia. Role of X95735 in discerning between two types of acute leukemia samples is also verified by biological researchers • Rough set based method can find informative gene for classification • Need verify our method on more data sets

  21. Thanks!

More Related