- By
**zaria** - Follow User

- 165 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Machine Learning in Bioinformatics' - zaria

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Talk Overview

- Our research group
- Aims, people, publications
- Machine learning
- A balancing act
- Bioinformatics
- Holy grails
- Our bioinformatics research projects
- From small to large
- A future direction
- Integration of reasoning techniques

Computational BioinformaticsLaboratory

- Our aim is to:
- Study the theory, implementation and application of computational techniques to problems in biology and medicine
- Our emphasis is on:
- Machine learning representations, algorithms and applications
- Our favourite techniques are:
- ILP, SLPs, ATF, ATP, CSP, GAs, SVMs
- Kernel methods, Bayes nets, Action Languages
- The (major) research tools we’ve produced are:
- Progol, HR, MetaLog (in production)

The Research Group Members

- Hiroaki Watanabe (RA, BBSRC)
- Alireza Tamaddoni-Nezhad (RA, DTI)
- Stephen Muggleton (Professor)
- Ali Hafiz (PhD)
- Huma Lodhi (RA, DTI)
- Simon Colton (Lecturer)
- Jung-Wook Bang (RA, DTI)
- (Nicos Angeloupolos, now in York) (RA, BBSRC)
- Room 407
- http://www.doc.ic.ac.uk/bioinformatics

Some External Collaborators

- Mike Sternberg (Biochemistry, Imperial)
- Jeremy Nicholson (Biomedical Sciences, Imperial)
- Steve Oliver (Biology, Manchester)
- Ross King (Computing, Aberystwyth)
- Doug Kell (Chemistry, Manchester)
- Chris Rawlings (Oxagen)
- Charlie Hodgman (GSK)
- Alan Bundy (Informatics, Edinburgh)
- Toby Walsh (Cork Constraint Computation Centre)

Some Departmental Collaborators

- Krysia Broda, Allesandro Russo, Oliver Ray
- Aspects of ILP and ALP
- Marek Sergot
- Action Languages
- Tony Kakas (Visiting professor, Cyprus)
- Abductive Logic Programming

Machine Learning Overview

- Ultimately about writing programs which improve with experience
- Experience through data
- Experience through knowledge
- Experience through experimentation (active)
- Some common tasks:
- Concept learning for prediction
- Clustering
- Association rule mining

Maintaining a Balance

Predictive

tasks

Supervised

learning

Know

what you’re

looking for

Don’t know

what you’re

looking for

Don’t know

you’re even

looking

Unsupervised

learning

Descriptive

tasks

A Partial Characterisation of Learning Tasks

- Concept learning
- Outlier/anomaly detection
- Clustering
- Concept formation
- Conjecture making
- Puzzle generation
- Theory formation

Maintaining a Balance in Predictive/Descriptive tasks

- Predictive tasks
- From accuracy to understanding
- Need to show statistical significance
- But hypotheses generated often need to be understandable
- Difference between the stock market and biology
- Descriptive tasks
- From pebbles to pearls
- Lots of rubbish produced
- Cannot rely on statistical significance
- Have to worry about notions of interestingness
- And provide tools to extract useful information from output

Maintaining a Balance in Scientific Discovery tasks

- Machine learning researchers
- Are generally not domain scientists also
- Extremely important to collaborate
- To provide interesting projects
- Remembering that we are scientists not IT consultants
- To gain materials
- Data, background knowledge, heuristics,
- To assess the value of the output

Inductive Logic Programming

- Concept/rule learning technique (usually)
- Hypotheses represented as Logic Programs
- Search for LPs
- From general to specific or vice-versa
- One method is inverse entailment
- Use measures to guide the search
- Predictive accuracy and compression (info. theory)
- Search performed within a language bias
- Produces good accuracy and understanding
- Logic programs are easier to decipher than ANNs
- Our implementation: Progol (and others)

Example learned LP

- Predicting protein folds from helices

fold('Four-helical up-and-down bundle',P) :-

helix(P,H1),

length(H1,hi),

position(P,H1,Pos),

interval(1 =< Pos =< 3),

adjacent(P,H1,H2),

helix(P,H2).

Stochastic Logic Programs

- Generalisation of HMMs
- Probabilistic logic programs
- More expressive language than LPs
- Quantative rather than qualitative
- Express arbitrary intervals over probability distributions
- Issues in learning SLPs
- Structure estimation
- Parameter estimation
- Applications
- More appropriate for biochemical networks

Automated Theory Formation

- Descriptive learning technique
- Which can also be used for prediction tasks
- Cycle of activity
- Form concepts, make hypotheses, explain hypotheses, evaluate concepts, start again,…
- 15 production rules for concepts
- 7 methods to discover and extract conjectures
- Uses third party software to prove/disprove (maths)
- 25 heuristic measures of interestingness
- Project: see whether this works in bioinformatics
- Our implementation: HR

Other Machine Learning Methods used in our Group

- Genetic algorithms
- To perform ILP search (Alireza)
- Bayes nets
- Introduction of hidden nodes (Philip)
- Kernel methods
- Relational kernels for SVMs and regression (Huma)
- Action Languages
- Stochastic (re)actions (Hiraoki)

Bioinformatics Overview

- “Bioinformatics is the study of information content and information flow in biological systems and proceses” (Michael Liebman)
- Not just storage and analysis of huge DNA sequences
- “Bioinformaticians have to be a Jack of all trades and a master of one” (Charlie Hodgman, GSK)
- Highly collaborative
- biology, mathematics, statistics, computer science, biochemistry, physics, chemistry, medicine, …

From Sequence to Structure

- There is a computer program…?

attcgatcgatcgatcgatcaggcgcgcta

Cgagcggcgaggacctcatcatcgatcag…

MRPQAPGSLVDPNEDELRMAPWYWGRISREEAKSILHGKPDGSFLVRDALSMKGEYTLTLMKDGCEKLIKICHMDRKYGFIETDLFNSVVEMINYYKENSLSMYNKTLDITLSNPIVRAREDEESQPHGDLCLLSNEFIRTCQLLQNLEQNLENKRNSFNAIREELQEKKLHQSVFGNTEKIFRNQIKLNESFMKAPADA……

Holy Grail Number One

- From protein sequence to protein function
- HGP data needs to be interpreted
- Genome split into genes, which code for a protein
- Biological function of protein dictated by structure
- Structure of many proteins already determined
- By X-ray crystallography
- Best idea so far: given a new gene sequence
- Find sequence most similar to it with known structure
- And look at the structure/function of the protein
- Other alternatives
- Use ML techniques to predict where secondary structures will occur (e.g., hairpins, alpha-helices, beta-sheets)

Holy Grail Number Two

- Drug companies lose millions
- Developing drugs which turn out to be toxic
- Predictive Toxicology
- Determine in advance which will be toxic
- Approach 1: Mapping molecules to toxicity
- Using ML and statistical techniques
- Approach 2:
- Producing metabolic explanations of toxic effects
- Using probabilistic logics to represent pathways
- And learning structures and parameters over this

Other aims of Bioinformatics

- Organisation of Data
- Cross referencing
- Data integration is a massive problem
- Analysing data from
- High-throughput methods for gene expression
- Ask Yike about this!
- Produce Ontologies
- And get everyone to use them?

Some Current Bioinformatics Projects

- SGC
- The Substructure Server
- SGC and SHM
- Discovery in medical ontologies
- SHM
- Studying biochemical networks (£400k, BBSRC)
- Closed loop learning (£200k, EPSRC)
- The Metalog project (£1.1 million, DTI)
- APRIL 2 (£400k, EC)

A Substructure Server

- Lesson from Automated Theorem Proving
- Best (most complex) methods not most used
- Other considerations: ease of use, stability, simplicity, e.g., Otter
- Aim: provide a simple predictive toxicology program
- Via a server with a very simple interface
- Sub-projects
- Find substructures in many positives, few negatives: Colton
- Simple Prolog program, writing Java version, use ILP??
- Put program on server: Anandathiyagar (MSc.)
- Distribute process over our Linux cluster: Darby (MEng.)
- Babel preprocessor (50+ repns), Rasmol back-end: ???

Using Medical Ontologies

- Use Ontology and ML for database integration
- Muggleton and Tamaddoni-Nezhad
- Bridge between two disparate databases
- LIGAND (biochemical reactions)
- Enzyme classification system (EC) = ontology
- Automated ontology maintenance
- Colton and Traganidas (MSc. Last year)
- Gene Ontology (big project)
- Use data to find links between GO terms
- Equivalence and implication finding using HR

Studying Biochemical Networks

- Use SLPs to find mappings between genomes
- Map function of pairs of homologous proteins
- E.g., mouse and human
- Homology is probabilistic
- Developed SLP learning algorithms
- Initial results applying them in biological networks
- Work by
- Muggleton, Angeloupolos and Watanabe

Closed Loop Machine Learning

- Active learning
- Information theoretic algorithm designs and chooses the most informative and lowest cost experiments to carry out
- Implemented in the ASE-Progol system
- Learning generates hypotheses
- Being studied by Ali Hafiz (PhD)
- Idea: use machine learning to guide experimentation
- using a real robot geneticist in a cyclic process
- Aims of current project: determine the function of genes
- Cost savings of 2 to 4 times over alternatives
- Upcoming Nature article

APRIL 2

- Applications of Probabalistic Relational Induction in Logic
- Aim: develop representations and learning algorithms for probabilistic logics
- Applications: bioinformatics
- Metabolic networks
- Phylo-genetics
- 2 RAs at Imperial (with Mike Sternberg)
- Starting in January

The Metalog Project Overview

- Aim:
- Modelling disease pathways and predicting toxicity
- Gap filling: existing representations correct but incomplete
- Predict where the toxin is acting (focus)
- Multi-layered problem representation
- Meta-network level (Bayes nets) Philip
- Network level (SLPs) Huma
- Biochemical reaction level (LPs) Alireza
- Problog lingua-franca developed
- to represent learned knowledge
- NMR Data from metabonomics from Jeremy Nicholson
- KEGG Background knowledge from Mike Sternberg

The Metalog Project Progress

- Year 1 achievements (all objectives achieved)
- Function predictions from LIGAND
- Mapping between KEGG and metabolic networks
- Initial Bayes-net model
- Drawn much interest from experts
- Agrees with KEGG, and disagrees in interesting ways
- Interaction between metabolytes which are not explained
- Year 2
- Working towards abductive model for gap filling

Future Directions for Machine Learning in Bioinformatics

- In-silico modelling of complete organisms
- Representation and reasoning at all levels
- From patient to the molecule
- Probabalistic models
- For more complex biological processes
- Such as biochemical pathways

Biochemical Pathways

- 1/120th of a biochemical network

Future Directions for My Research

- Descriptive Induction meets Biology data
- Most ML bioinformatics projects are predictive
- Very carefully compressed notions of interestingness
- Into a single measure: predictive accuracy
- Domain scientist not bombarded with a lot of information
- A correctly answered question can be highly revealing
- Can we push this envelope slightly?
- Use descriptive induction (WARMR, CLAUDIEN, HR)
- To tell biologists something they weren’t expecting about the data they have collated
- Have to worry hard about dull output
- Need to determine heuristics from domain scientists

More Future Directions

- Put “Automated Reasoning” back together again
- Essential for scientific discovery
- ML, ATP, CSP, etc., all work well individually
- Surely work better in combination…
- Improve ATP to prove a different theorem?
- Make flexible using CSP and ATP
- Improve ML by rationalising input concepts?
- Use ATF and ATP to find concepts and hypotheses
- Improve CSP by introducing additional constraints
- Use ATF, ML to find constraints, ATP to prove them

Download Presentation

Connecting to Server..