machine learning in bioinformatics n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Machine Learning in Bioinformatics PowerPoint Presentation
Download Presentation
Machine Learning in Bioinformatics

Loading in 2 Seconds...

play fullscreen
1 / 35

Machine Learning in Bioinformatics - PowerPoint PPT Presentation


  • 168 Views
  • Uploaded on

Machine Learning in Bioinformatics. Simon Colton The Computational Bioinformatics Laboratory. Talk Overview. Our research group Aims, people, publications Machine learning A balancing act Bioinformatics Holy grails Our bioinformatics research projects From small to large

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Machine Learning in Bioinformatics' - zaria


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
machine learning in bioinformatics

Machine Learning in Bioinformatics

Simon Colton

The Computational

Bioinformatics Laboratory

talk overview
Talk Overview
  • Our research group
    • Aims, people, publications
  • Machine learning
    • A balancing act
  • Bioinformatics
    • Holy grails
  • Our bioinformatics research projects
    • From small to large
  • A future direction
    • Integration of reasoning techniques
computational bioinformatics laboratory
Computational BioinformaticsLaboratory
  • Our aim is to:
    • Study the theory, implementation and application of computational techniques to problems in biology and medicine
  • Our emphasis is on:
    • Machine learning representations, algorithms and applications
  • Our favourite techniques are:
    • ILP, SLPs, ATF, ATP, CSP, GAs, SVMs
    • Kernel methods, Bayes nets, Action Languages
  • The (major) research tools we’ve produced are:
    • Progol, HR, MetaLog (in production)
the research group members
The Research Group Members
  • Hiroaki Watanabe (RA, BBSRC)
  • Alireza Tamaddoni-Nezhad (RA, DTI)
  • Stephen Muggleton (Professor)
  • Ali Hafiz (PhD)
  • Huma Lodhi (RA, DTI)
  • Simon Colton (Lecturer)
  • Jung-Wook Bang (RA, DTI)
  • (Nicos Angeloupolos, now in York) (RA, BBSRC)
  • Room 407
    • http://www.doc.ic.ac.uk/bioinformatics
some external collaborators
Some External Collaborators
  • Mike Sternberg (Biochemistry, Imperial)
  • Jeremy Nicholson (Biomedical Sciences, Imperial)
  • Steve Oliver (Biology, Manchester)
  • Ross King (Computing, Aberystwyth)
  • Doug Kell (Chemistry, Manchester)
  • Chris Rawlings (Oxagen)
  • Charlie Hodgman (GSK)
  • Alan Bundy (Informatics, Edinburgh)
  • Toby Walsh (Cork Constraint Computation Centre)
some departmental collaborators
Some Departmental Collaborators
  • Krysia Broda, Allesandro Russo, Oliver Ray
    • Aspects of ILP and ALP
  • Marek Sergot
    • Action Languages
  • Tony Kakas (Visiting professor, Cyprus)
    • Abductive Logic Programming
machine learning overview
Machine Learning Overview
  • Ultimately about writing programs which improve with experience
    • Experience through data
    • Experience through knowledge
    • Experience through experimentation (active)
  • Some common tasks:
    • Concept learning for prediction
    • Clustering
    • Association rule mining
maintaining a balance
Maintaining a Balance

Predictive

tasks

Supervised

learning

Know

what you’re

looking for

Don’t know

what you’re

looking for

Don’t know

you’re even

looking

Unsupervised

learning

Descriptive

tasks

a partial characterisation of learning tasks
A Partial Characterisation of Learning Tasks
  • Concept learning
  • Outlier/anomaly detection
  • Clustering
  • Concept formation
  • Conjecture making
  • Puzzle generation
  • Theory formation
maintaining a balance in predictive descriptive tasks
Maintaining a Balance in Predictive/Descriptive tasks
  • Predictive tasks
    • From accuracy to understanding
    • Need to show statistical significance
      • But hypotheses generated often need to be understandable
    • Difference between the stock market and biology
  • Descriptive tasks
    • From pebbles to pearls
    • Lots of rubbish produced
      • Cannot rely on statistical significance
    • Have to worry about notions of interestingness
      • And provide tools to extract useful information from output
maintaining a balance in scientific discovery tasks
Maintaining a Balance in Scientific Discovery tasks
  • Machine learning researchers
    • Are generally not domain scientists also
  • Extremely important to collaborate
    • To provide interesting projects
      • Remembering that we are scientists not IT consultants
    • To gain materials
      • Data, background knowledge, heuristics,
    • To assess the value of the output
inductive logic programming
Inductive Logic Programming
  • Concept/rule learning technique (usually)
    • Hypotheses represented as Logic Programs
  • Search for LPs
    • From general to specific or vice-versa
      • One method is inverse entailment
    • Use measures to guide the search
      • Predictive accuracy and compression (info. theory)
    • Search performed within a language bias
  • Produces good accuracy and understanding
    • Logic programs are easier to decipher than ANNs
  • Our implementation: Progol (and others)
example learned lp
Example learned LP
  • Predicting protein folds from helices

fold('Four-helical up-and-down bundle',P) :-

helix(P,H1),

length(H1,hi),

position(P,H1,Pos),

interval(1 =< Pos =< 3),

adjacent(P,H1,H2),

helix(P,H2).

stochastic logic programs
Stochastic Logic Programs
  • Generalisation of HMMs
  • Probabilistic logic programs
    • More expressive language than LPs
    • Quantative rather than qualitative
      • Express arbitrary intervals over probability distributions
  • Issues in learning SLPs
    • Structure estimation
    • Parameter estimation
  • Applications
    • More appropriate for biochemical networks
automated theory formation
Automated Theory Formation
  • Descriptive learning technique
    • Which can also be used for prediction tasks
  • Cycle of activity
    • Form concepts, make hypotheses, explain hypotheses, evaluate concepts, start again,…
    • 15 production rules for concepts
    • 7 methods to discover and extract conjectures
    • Uses third party software to prove/disprove (maths)
    • 25 heuristic measures of interestingness
  • Project: see whether this works in bioinformatics
  • Our implementation: HR
other machine learning methods used in our group
Other Machine Learning Methods used in our Group
  • Genetic algorithms
    • To perform ILP search (Alireza)
  • Bayes nets
    • Introduction of hidden nodes (Philip)
  • Kernel methods
    • Relational kernels for SVMs and regression (Huma)
  • Action Languages
    • Stochastic (re)actions (Hiraoki)
bioinformatics overview
Bioinformatics Overview
  • “Bioinformatics is the study of information content and information flow in biological systems and proceses” (Michael Liebman)
    • Not just storage and analysis of huge DNA sequences
  • “Bioinformaticians have to be a Jack of all trades and a master of one” (Charlie Hodgman, GSK)
  • Highly collaborative
    • biology, mathematics, statistics, computer science, biochemistry, physics, chemistry, medicine, …
from sequence to structure
From Sequence to Structure
  • There is a computer program…?

attcgatcgatcgatcgatcaggcgcgcta

Cgagcggcgaggacctcatcatcgatcag…

MRPQAPGSLVDPNEDELRMAPWYWGRISREEAKSILHGKPDGSFLVRDALSMKGEYTLTLMKDGCEKLIKICHMDRKYGFIETDLFNSVVEMINYYKENSLSMYNKTLDITLSNPIVRAREDEESQPHGDLCLLSNEFIRTCQLLQNLEQNLENKRNSFNAIREELQEKKLHQSVFGNTEKIFRNQIKLNESFMKAPADA……

holy grail number one
Holy Grail Number One
  • From protein sequence to protein function
  • HGP data needs to be interpreted
    • Genome split into genes, which code for a protein
    • Biological function of protein dictated by structure
  • Structure of many proteins already determined
    • By X-ray crystallography
  • Best idea so far: given a new gene sequence
    • Find sequence most similar to it with known structure
      • And look at the structure/function of the protein
  • Other alternatives
    • Use ML techniques to predict where secondary structures will occur (e.g., hairpins, alpha-helices, beta-sheets)
holy grail number two
Holy Grail Number Two
  • Drug companies lose millions
    • Developing drugs which turn out to be toxic
  • Predictive Toxicology
    • Determine in advance which will be toxic
  • Approach 1: Mapping molecules to toxicity
    • Using ML and statistical techniques
  • Approach 2:
    • Producing metabolic explanations of toxic effects
    • Using probabilistic logics to represent pathways
      • And learning structures and parameters over this
other aims of bioinformatics
Other aims of Bioinformatics
  • Organisation of Data
    • Cross referencing
    • Data integration is a massive problem
  • Analysing data from
    • High-throughput methods for gene expression
    • Ask Yike about this!
  • Produce Ontologies
    • And get everyone to use them?
some current bioinformatics projects
Some Current Bioinformatics Projects
  • SGC
    • The Substructure Server
  • SGC and SHM
    • Discovery in medical ontologies
  • SHM
    • Studying biochemical networks (£400k, BBSRC)
    • Closed loop learning (£200k, EPSRC)
    • The Metalog project (£1.1 million, DTI)
    • APRIL 2 (£400k, EC)
a substructure server
A Substructure Server
  • Lesson from Automated Theorem Proving
    • Best (most complex) methods not most used
      • Other considerations: ease of use, stability, simplicity, e.g., Otter
  • Aim: provide a simple predictive toxicology program
    • Via a server with a very simple interface
  • Sub-projects
    • Find substructures in many positives, few negatives: Colton
      • Simple Prolog program, writing Java version, use ILP??
    • Put program on server: Anandathiyagar (MSc.)
    • Distribute process over our Linux cluster: Darby (MEng.)
    • Babel preprocessor (50+ repns), Rasmol back-end: ???
using medical ontologies
Using Medical Ontologies
  • Use Ontology and ML for database integration
    • Muggleton and Tamaddoni-Nezhad
    • Bridge between two disparate databases
      • LIGAND (biochemical reactions)
      • Enzyme classification system (EC) = ontology
  • Automated ontology maintenance
    • Colton and Traganidas (MSc. Last year)
    • Gene Ontology (big project)
    • Use data to find links between GO terms
      • Equivalence and implication finding using HR
studying biochemical networks
Studying Biochemical Networks
  • Use SLPs to find mappings between genomes
    • Map function of pairs of homologous proteins
      • E.g., mouse and human
    • Homology is probabilistic
  • Developed SLP learning algorithms
  • Initial results applying them in biological networks
  • Work by
    • Muggleton, Angeloupolos and Watanabe
closed loop machine learning
Closed Loop Machine Learning
  • Active learning
    • Information theoretic algorithm designs and chooses the most informative and lowest cost experiments to carry out
      • Implemented in the ASE-Progol system
    • Learning generates hypotheses
    • Being studied by Ali Hafiz (PhD)
  • Idea: use machine learning to guide experimentation
    • using a real robot geneticist in a cyclic process
  • Aims of current project: determine the function of genes
  • Cost savings of 2 to 4 times over alternatives
  • Upcoming Nature article
april 2
APRIL 2
  • Applications of Probabalistic Relational Induction in Logic
  • Aim: develop representations and learning algorithms for probabilistic logics
  • Applications: bioinformatics
    • Metabolic networks
    • Phylo-genetics
  • 2 RAs at Imperial (with Mike Sternberg)
    • Starting in January
the metalog project overview
The Metalog Project Overview
  • Aim:
    • Modelling disease pathways and predicting toxicity
    • Gap filling: existing representations correct but incomplete
    • Predict where the toxin is acting (focus)
  • Multi-layered problem representation
    • Meta-network level (Bayes nets) Philip
    • Network level (SLPs) Huma
    • Biochemical reaction level (LPs) Alireza
    • Problog lingua-franca developed
      • to represent learned knowledge
  • NMR Data from metabonomics from Jeremy Nicholson
  • KEGG Background knowledge from Mike Sternberg
the metalog project progress
The Metalog Project Progress
  • Year 1 achievements (all objectives achieved)
  • Function predictions from LIGAND
  • Mapping between KEGG and metabolic networks
  • Initial Bayes-net model
    • Drawn much interest from experts
      • Agrees with KEGG, and disagrees in interesting ways
      • Interaction between metabolytes which are not explained
  • Year 2
    • Working towards abductive model for gap filling
future directions for machine learning in bioinformatics
Future Directions for Machine Learning in Bioinformatics
  • In-silico modelling of complete organisms
  • Representation and reasoning at all levels
    • From patient to the molecule
  • Probabalistic models
    • For more complex biological processes
      • Such as biochemical pathways
biochemical pathways
Biochemical Pathways
  • 1/120th of a biochemical network
future directions for my research
Future Directions for My Research
  • Descriptive Induction meets Biology data
  • Most ML bioinformatics projects are predictive
    • Very carefully compressed notions of interestingness
      • Into a single measure: predictive accuracy
      • Domain scientist not bombarded with a lot of information
      • A correctly answered question can be highly revealing
  • Can we push this envelope slightly?
    • Use descriptive induction (WARMR, CLAUDIEN, HR)
      • To tell biologists something they weren’t expecting about the data they have collated
    • Have to worry hard about dull output
      • Need to determine heuristics from domain scientists
more future directions
More Future Directions
  • Put “Automated Reasoning” back together again
    • Essential for scientific discovery
  • ML, ATP, CSP, etc., all work well individually
    • Surely work better in combination…
  • Improve ATP to prove a different theorem?
    • Make flexible using CSP and ATP
  • Improve ML by rationalising input concepts?
    • Use ATF and ATP to find concepts and hypotheses
  • Improve CSP by introducing additional constraints
    • Use ATF, ML to find constraints, ATP to prove them