1 / 25

Selecting Informative Genes with Parallel Genetic Algorithms

Selecting Informative Genes with Parallel Genetic Algorithms. Deodatta Bhoite Prashant Jain . Terminology. Genes DNA, mRNA Gene expression Microarrays. Microarray output. Gene Selection. Large number of irrelevant genes introduce “biological noise”

stacia
Download Presentation

Selecting Informative Genes with Parallel Genetic Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Selecting Informative Geneswith Parallel Genetic Algorithms Deodatta Bhoite Prashant Jain

  2. Terminology • Genes • DNA, mRNA • Gene expression • Microarrays

  3. Microarray output

  4. Gene Selection • Large number of irrelevant genes introduce “biological noise” • Analysis of results can be simplified by selecting only relevant genes for study • Two categories of gene selection • Filter approach selection • Wrapper approach selection

  5. Gene Selection

  6. Classifier • What is a classifier used for? • Mapping of label pairs <xi, li> to {0,1,?} • Golub-Slonim classifier • Positive value = class 1, negative value = class 2

  7. Ranking based gene selection methods • GS-correlation • Genes with most positive and negative correlation values are selected. • Tends to not select genes for which class values have large standard deviations with respect to training data (some of them may be most relevant and informative).

  8. Ranking with disorder • This method doesn’t use the actual expression levels. • Ng_I represents the set of indices that belong to class I and h(x) is the indicator function.

  9. Need for subset ranking • Individual ranking may not always result in selection of informative genes. • They ignore the relationships between genes by solely relying on individual scores. • Thus we need to explore subsets of genes to find the optimal subset for classification.

  10. Genetic Algorithm • What is a genetic algorithm? • “Genetic Algorithms are defined as global optimization procedures that use an analogy of genetic evolution of biological organisms.” • Basically genetic algorithms tend to find the best solution to a problem by following an evolutionary process.

  11. Basic Genetic Algorithm

  12. Parallel Genetic Algorithm • For large population sizes, G.A. is computationally infeasible. • Hence the use of Parallel Genetic Algorithms.

  13. Parallel Genetic Algorithm

  14. Model and Encoding • Island Model -: Each processor runs a G.A. on a subset of the population and there is periodic migration. • Fixed Length Binary String Encoding-: Here if gene is included in the subset then value is 1 else 0.

  15. Fitness Evaluation • Two Different Criteria • Classification Accuracy • Size of the subset fitness(x) = w1 * accuracy(x) + w2 *(1 – dimensionality(x)) • Here, • accuracy(x) = test accuracy of the classifier built with the gene subset represented by x • dimensionality(x)  [0,1] = the dimension of the subset

  16. Fitness Evaluation • w1 = weight assigned to accuracy • w2 = weight assigned to dimensionality • High classification accuracy and low dimension has high fitness.

  17. Data Sets Used

  18. Test Parameters • The tests were run on two processors. • The parameters of G.A. in each processor were set as -: • Population Size : 1000 • Trials : 400000 • Crossover probability: 0.6 • Mutation probability: 0.001

  19. Test Parameters • Selection Strategy: Elitist • Migration Probability: 0.002 • Crossover probability of average level to get different subpopulation with good traits of the parents. • Mutation Probability low to avoid randomness of selection. • Selection Strategy is Elitist which ensures that the best individuals are kept and hence leads to more accurate subsets of genes.

  20. Results

  21. Results • Leukemia Data Set • Subset with 29 Genes found • Classifies 36/38 training instances correctly • Classifies 30/34 test instances correctly • Colon Data Set • Subset with 30 genes found • 92% accuracy on the training data set

  22. Results Comparison • Results better than other algorithms such as G-S and NB algorithms which have accuracies less than 90% and gene numbers varying from 10 to 500.

  23. Average Performance Graphs

  24. Conclusion • Method does well in finding smaller gene subsets and better accuracies. • Fitness function needs to be something more sophisticated than the simple one used right now to ensure a final compact subset every time.

  25. Questions Thank You.

More Related