Genetic programming for mining dna chip data from cancer patients
This presentation is the property of its rightful owner.
Sponsored Links
1 / 22

Genetic Programming for Mining DNA Chip data from Cancer Patients PowerPoint PPT Presentation


  • 113 Views
  • Uploaded on
  • Presentation posted in: General

Genetic Programming for Mining DNA Chip data from Cancer Patients. W.B. Langdon & B.F. Buxton Genetic Programming and Evolving Machines, 5 (3): 251-257 September 2004 Presenter John Dynan. Why Genetic Programming ?. Applies principles Darwinism to AI

Download Presentation

Genetic Programming for Mining DNA Chip data from Cancer Patients

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Genetic programming for mining dna chip data from cancer patients

Genetic Programming for Mining DNA Chip data from Cancer Patients

W.B. Langdon & B.F. Buxton

Genetic Programming and Evolving Machines, 5 (3): 251-257 September 2004

Presenter

John Dynan


Why genetic programming

Why Genetic Programming ?

  • Applies principles Darwinism to AI

  • Allows natural selection of the Fittest Models

  • Iterative process that evolves numerous Solutions

  • Similar to the Biology of Genetic

  • Resolves over fitting issue found in other Approaches

  • DNA arrays with limited data sets (<100 Tissues)

  • Predictive nature of low expression Genes

  • Disease , treatment and prevention


What is genetic programming gp

What is Genetic Programming(GP) ?

  • Replicates Genetic Process:

    • Crossover(recombination)

    • Duplication

    • Mutation

    • Production

    • Deletion

    • DNA string of Elements (A,C,G,U=T)


Gp cross over

GP Cross Over


Biological genetic cross over

Biological Genetic Cross Over


What it is not

What it is not

  • Clustering K-means

  • Heuristic Combination of fixed Rules

  • Single set of features

  • Sequential learning process for features

  • Optimal solution

  • Controlled Feature Deletion or Addition


History

History

  • Extension of Holland(1975) Genetic Algorithms Work(Stanford):

    • Structures are programs

    • Syntax Trees

    • Nodes

      • Functions ( Mul, Add, Div, Sub, Exp ..)

      • Terminals (Attributes, Gene Expression, ..)

  • GP is a search for Terminals and Functions


Syntax tree

Syntax Tree


Array problem

µarray Problem

  • Pomeroy Data Set (url)

  • 7129 Gene Expressions

  • 60 Patents

    • 39 Survivors ( Cancer Tissues)

    • 21 Terminal (Non Cancer)

  • Compare w/ K=5 & 8 Genes - Pomeroy


Pomeroy data set snippet

Pomeroy Data Set Snippet

  • Brain_MD_30Brain_MD_31Brain_MD_32Brain_MD_33Brain_MD_34Brain_MD_35

  • Brain_MD_36Brain_MD_37Brain_MD_38Brain_MD_39Brain_MD_40Brain_MD_41

  • Brain_MD_42Brain_MD_43Brain_MD_44Brain_MD_45Brain_MD_46Brain_MD_47

  • Brain_MD_48Brain_MD_49Brain_MD_50Brain_MD_51Brain_MD_52Brain_MD_53

  • Brain_MD_54Brain_MD_55Brain_MD_56Brain_MD_57Brain_MD_58Brain_MD_59

  • Brain_MD_60

  • U08998_atTAR RNA binding protein (TRBP) mRNA206.055.0106.0323.0209.088.0

  • 179.0-493.0-40.060.0-200.0312.0-26.0-234.0127.010.0135.0-72.0

  • 46.0-77.050.0375.0-252.0-189.0-112.0-931.0193.0-125.0-1244.0-470.0

  • -683.0-261.0-18.0-90.0-3.0-57.0-201.050.0-197.0-141.0-353.0-132.0

  • -408.0-262.020.0239.0-232.0-593.0-443.06.0-316.0116.0-7.0169.0

  • -260.0-137.017.0100.0-954.0-353.0

  • U41737_atPancreatic beta cell growth factor (INGAP) mRNA15.0-87.011.0173.0177.0

  • -105.035.013.053.08.025.0 28.021.061.0-8.075.024.0

  • -135.055.0162.0139.022.0-89.013.0-177.0-384.045.0-38.0-38.0

  • -136.0-152.0-42.0-85.0-31.070.0-76.0-74.0-50.029.0-81.0145.0

  • 42.0-79.025.018.0-20.044.0-78.0192.0-66.0-73.0-39.057.0

  • -122.0-90.025.0-10.0-80.0-306.0-3.0

  • 60 2 1

  • # class0 class1

  • 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

  • 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

  • 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


Method

Method

  • The individual consist of 5 trees (mating pools)

  • N=60 fold generates 60 random models

  • N =60 fold is repeated 10 times

  • 600 Predictive Patent Survival Models

  • if Tree(i=1..5)>0, GP model positive (node)

  • Genetic modifications in tree 1 and 2

  • Trees may specialize(tissue)

  • Program Fitness (Pos/Neg) Accuracy > .5


Gp conditions

GP Conditions

  • Terminals ( µarray data)

  • Functions(+,-,/,*,exp,<,> ..)

  • Fitness Measurement(Data)

  • Program Control(loop,time)

  • Termination(Generations)


Gp dna parameters

GP DNA Parameters


Gp 1 st 2 nd data mining

GP 1st/2nd Data Mining

  • 600 GP models

  • 6970 of 7129 Attributes in GP Models

  • 404 Genes in ten or more GP Models

  • 404 Genes were used in 2nd GP run

  • Two Genes in 100 GP models

    • U08998 - 182 GP Models

    • U41737 – 193 GP Models


Gene biology

Gene Biology

  • Genes NOT highly Expressed

  • Not Found in Pomeroy Kmeams Cluster Analysis

  • U08998_at

    • TAR RNA binding protein – promotes cancer

    • TARBP1 GeneCard

  • U41737_at

    • Pancreatic beta cell growth

    • REG3A GeneCard


Gene frequency 2 nd gp

Gene Frequency 2nd GP


Final gp

Final GP

  • Limited number of functions

  • Single IF statements ( <,>,,≤)

  • Random generation of function and Genes

  • N=60 fold times 10 accuracy = 68%

  • 147 of 192 were incorrect predictors

  • 39 of 192 were correct two gene predictors


Two gene profile

Two Gene Profile


Two gene outcome

Two Gene Outcome

  •  Survived/Predicted Correct –TP

  •  Failed Treatment/Predicted Wrong – FP

  • ⃟ Survived/Predicted Wrong – FN

  •  Failed Treatment/Predicted Correct –TN

  • Darken points poor predictors

  • GP Model predictor:

  • -42 < U41737_at + 2*U0998_at


Limitations

Limitations

  • Extensive computer resources( exponential)

  • NP solution

  • Only heuristic optimal solution

  • Replications of the random selection process with various genetic evolutionary change rates, can cause different results


Bioinformatics

Bioinformatics

  • Allows the selection of low expression gene into predictive model

  • New information can be harvested by repeating execution of GP

  • 5 tree members can be isolated members of

    different organ tissues

  • Disease treatment, prediction and cured


References

References

  • 1 J. DeRisi, et al. 1998. The transcriptional program of sporulation in budding yeasts.

  • Science 282:699-705

  • 2Mitra, A; Almal, A. ; George, B.;Fry,D. ; Lenehan et. al, The use of genetic programming analysis of quantitative expression profiles… BMC Cancer 206;6:159.

  • 3University of Manchester GP Web Site URL

  • : http://dbkgroup.org/gp_home.htm

  • 4Biolograhy of GP references:

  • http://liinwww.ira.uka.de/bibliography/Ai/genetic.programming.html

  • 5Langdon,L.; and Poli, R. Foundations of Genetic Programming ,Springer –Verlag , Berlin. 2001

  • 6Koza,John; Bennett, F.;Andre, D. and Keane, Martin. Genetic Programming, Morgan Kaufmann Publishing, San Francisco, 1999.

  • 7 Hartl, D. and Jones, E. 2002. Essential Genetics 3rd ed. Boston, MA. : .Jones and Bartlett Publishers


  • Login