mining the genome l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Mining the Genome PowerPoint Presentation
Download Presentation
Mining the Genome

Loading in 2 Seconds...

play fullscreen
1 / 28

Mining the Genome - PowerPoint PPT Presentation


  • 351 Views
  • Uploaded on

Mining the Genome Filip Železný ČVUT FEL, Prague Dept. of Cybernetics Gerstner Laboratory Intro Research at ČVUT FEL Dept. of Cybernetics Nature Inspired Technologies machine learning evolutionary computation Agent Computing Robotics Computer Vision EU Projects (6 FP)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Mining the Genome' - Sophia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
mining the genome

Mining the Genome

Filip Železný

ČVUT FEL, Prague

Dept. of Cybernetics

Gerstner Laboratory

intro
Intro
  • Research at ČVUT FEL Dept. of Cybernetics
    • Nature Inspired Technologies
      • machine learning
      • evolutionary computation
    • Agent Computing
    • Robotics
    • Computer Vision
  • EU Projects (6 FP)
    • 14 running in 2005, 9 new starting 2006
machine learning data mining
Machine Learning & Data Mining
  • Supervised learning
    • given examples and their class labels
    • find a model for predicting class labels of new examples
    • also: “concept learning”, “predictive classification”, ...
  • Example
    • Given:
    • Discover:

size=small & luxury=low  affordable

machine learning
Machine Learning

Plethora of paradigms

Decision trees

Artificial NeuralNetworks

Support VectorMachines

“Symbolic”

“Subsymbolic”

“Statistical”

Learning = optimization in structure / parameter space

Learning = search

AI techniques employed (gradient descent, heuristic search)

relational learning
Relational Learning

What if examples have a structure?

Not an attribute tuple !

Description spread in multiple tables of a relational database

relational learning7
Relational Learning
  • Relational learning
    • Representing data and rules in relational logic (Prolog)
    • Exploits background knowledge (eg. “charge”)
    • Inductive Logic Programming

carcinogenic(Compound) IF has_atom(Compound, Atom) & type(Atom, carbon) & charge(Atom, Charge) & Charge > 0.0133 & has_atom(Compound, Atom2) & double_bond(Atom1, Atom2)

applications of interest
Applications of Interest

3 hot fields intersection

BIOtechnologies(genomics)

INFORMATIONtechnologies(machine learning)

NANOtechnologies(microarray chips)

background genetics
Background: GENETICS

How does a cell know what to do?

chromosomes
Chromosomes

Chromosomes get copied during mitosis

They carry the assembly instructions?

How?

Chromosomes = proteins + DNA

where is the information ??

slide12
DNA

1953: Jim Watson & Francis Crick

Discover the DNA structure.

That is where the information is.

4-symbol alphabet

Guanin, Adenin, Cytosin, Tymin

Double-helix pairing:

C-G A-T

video

the central dogma of molecular biology
The CENTRAL DOGMA of Molecular Biology
  • Gene = DNA subsequence
  • Genes code for proteins
  • Gene expression
    • DNA piece transcribes to RNA
    • RNA translates into a protein
    • Proteins `do the job’
      • - enzymes
      • - building blocks
      • - ...

video

protein coding
Protein Coding

Codon(3 bases)

DNA strand

aminoacid

Protein

protein structures
Protein structures

“resolution”

secondary structure prediction
Secondary structure prediction

Two common secondary structures

 - sheet

 - helix

Primary structure determines secondary structure.

Computational problem:Given primary structure, predict if  - sheet or  - helix

NOBODY CAN DO THAT !

secondary structure prediction17
Secondary structure prediction
  • Secondary structure prediction with ILP

[Muggleton 1992]

Using ILP, obtained rulessuch as

alpha0(A,B)  ... position(A,D,O) & not_aromatic(O) & small_or_polar(O) & position(A,B,C) & very_hydrophobic(C) & not_aromatic(C) ...etc

(22 literals)

  • Note the incorporation of background knowledge
  • Accuracy 81%, best at the time
  • Published in JrProtein Engineering
the genome project
The Genome project
  • 1993 – 2003

All human genes sequenced

Celera X NIH race

  • Challenge NOW:

annotate the genes

    • discover functions
    • interactions
    • dynamic pathways

video

genomics research
Genomics research

Verification(targeted assay)

Human intuition

Hypotheses

  • Traditional functional genomics research
  • Hypothesis - driven
    • eg. a gene is suspected to be responsible for ...
    • then tracing its expression in relevant tissues
  • “First hypothesize, then measure”
gene expression microarrays
Gene Expression Microarrays
  • Microarray chip:
    • Measures expression of tens of thousands genes simultaneously: “high-throughput”
    • pioneering technology (mid to late 90’s)
  • A grid carrying synthesized DNA probes
  •  Breakthrough in genomics research?

photo scan

genomics research22
Genomics Research
  • High-Throughput approach to functional genomics ?
  • Data-driven, unbiased, “First measure, then hypothesize”
  • Might reveal never-thought-of relationships

Microarray data

Human analysis

Hypotheses

IMPOSSIBLE (TOO MUCH DATA)

Expression of almost entire genome(tens of thousands genes)

genomics research through machine learning
Genomics Research through Machine Learning
  • AI based High-Throughput functional genomics ?

High-throughputscreening

High-performancecomputing

Microarray data

Machine Learning

Hypotheses

Interpretation

genomics research with ai
Genomics Research with AI
  • This concept has recently been proven to work
  • Golub et al., Science286:531-537 1999
    • leukemia classification model (AML vs. ALL)
    • voting of informative attributes (genes)
    • Discovery of new classes (clustering)
  • Ramaswamy et al., PNAS 98:15149-54 2001
    • Tumor classification
    • 14 classes of cancer
    • used Support Vector Machines

video

interpretable classifiers
Interpretable classifiers
  • Comprehensibility Pursuit: Rule Based Models
  • Models interpretable by biologists
  • Our work
    • D. Gamberger, N. Lavrač, F. Železný, J. Tolar Jr Biomed Informatics 37(5):269-284 2004

IF gene_20056 EXPRESSEDAND gene_23984 NOT_EXPRESSEDTHEN cancer_class = AML

Class

exploiting background knowledge
Exploiting Background knowledge
  • Tons of genomic background knowledge available
  • Relational learning would allow to exploit it!
relational genomic data mining
Relational Genomic Data Mining
  • Our current work

Combining expression & gene annotation data

Rule Based Model

relational genomic data mining28
Relational Genomic Data Mining
  • Example rule algorithmically discovered
  • ... open end, no conclusions

expressed_in_all(Gene) IF has_location(Gene, integral_to_membrane) & has_function(Gene, receptor_activity)

Expression of genescoding for proteinslocated in the integral to membrane cell component,whose functions include receptor activity, has a high correlation with the BCR class of acute lymphoblastic leukemia (ALL) and a low correlation with other classes of ALL.