automated theory formation first steps in bioinformatics
Download
Skip this Video
Download Presentation
Automated Theory Formation: First Steps in Bioinformatics

Loading in 2 Seconds...

play fullscreen
1 / 25

Automated Theory Formation: First Steps in Bioinformatics - PowerPoint PPT Presentation


  • 62 Views
  • Uploaded on

Automated Theory Formation: First Steps in Bioinformatics. Simon Colton Computational Bioinformatics Laboratory. Machine Learning (ML) Questions. Given some background information Concepts, hypotheses (axioms) Given some positive examples And some negative examples Find me an explanation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Automated Theory Formation: First Steps in Bioinformatics' - nalanie-kyle


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
automated theory formation first steps in bioinformatics

Automated Theory Formation:First Steps in Bioinformatics

Simon Colton

Computational Bioinformatics Laboratory

machine learning ml questions
Machine Learning (ML)Questions
  • Given some background information
    • Concepts, hypotheses (axioms)
  • Given some positive examples
    • And some negative examples
  • Find me an explanation
    • Why the positives are positive
    • And the negatives are negative
example predictive toxicology
Example: Predictive Toxicology
  • Given some theory from chemistry
    • Structure of molecules, well known substructures
  • Given some examples of toxic drugs
    • And some examples of non-toxic drugs
  • Question: Why are the toxic drugs toxic?
automated theory formation atf questions
Automated Theory Formation (ATF) Questions
  • Given some background information
    • Concepts, hypotheses (axioms)
  • And some objects of interest
    • Numbers, Molecules, etc.
  • Find something interesting
  • Interesting things could be:
    • Concepts, examples, hypotheses, explanations
atf overview
ATF Overview
  • Scientific theories contain (at least):
    • Concepts: salt, acid, base
    • Hypotheses: acid + base => salt + water
    • Explanations: transfer of electrons, dissolving
  • So, ATF should do (at least):
    • Concept formation, Conjecture making
    • Hypothesis proving and disproving.
  • Also needs to:
    • Measure interestingness, present results, etc.
hr theory formation system
HR Theory Formation System
  • Developed in maths
    • Designed to be general purpose system
  • Concept-based theory formation
    • Tries to make concept
    • Makes conjecture when it can’t make a concept
    • Tries to explain conjectures
  • Conjecture-based theory formation
    • Fix faulty conjectures with concept formation
    • PhD work of Alison Pease, based on Lakatos
concept formation in hr
Concept Formation in HR
  • 10 General Production Rules
    • Take in old concepts, produce new concepts

Size

[a,b] : b|a

Split

[a,n]:n = |{b:b|a}|

[a] : 2|a

Negate

Split

Compose

[a]:2=|{b:b|a}|

[a] : not 2|a

[a]:2=|{b:b|a}| & not 2|a

(Odd Prime Numbers)

conjecture making
Conjecture Making
  • Empirical checks are performed
    • After each attempt to invent a new concept
  • If the concept has no examples
    • Makes non-existence conjecture
  • If concept has same examples as previous
    • Makes an equivalence conjecture
  • If another concept subsumes the concept
    • Makes an implication conjecture
conjecture extraction
Conjecture Extraction
  • Suppose HR makes equivalence conjecture:
    • P(a) & Q(a)  R(a) & S(a)
  • Extracts:
    • P(a) & Q(a) => R(a), P(a) & Q(a) => S(a)
    • R(a) & S(a) => P(a), R(a) & S(a) => Q(a)
  • Tries to Extract: P(a) => R(a), Q(a) => R(a), etc.
    • Prime implicates (require proving, though)
  • Important: gets Horn Clauses
    • Can be expressed in Prolog…..
explanation generation
Explanation Generation
  • In mathematical domains
    • HR relies on automated theorem provers
    • And Model generators
      • To find counterexamples
    • E.g., group theory: a*a=a  a=id (prove easily)
  • In biological/chemistry domains
    • Possibly: visualisation tools, reaction pathways
greatest hits
Greatest Hits
  • Please ask me over coffee about:
    • Pre-processing constraint problems
    • Learning properties of quadratic residues
    • Inventing integer sequences
    • Puzzle generation
    • Adding to the TPTP library
    • Setting mathematical tutorial questions
long term aim in bioinformatics
Long term aim in Bioinformatics
  • Develop an ATF system similar to HOMER
    • But working in biological domains
  • Biologist provides little background info
    • In a format they are happy with
  • Program provides results
    • Intelligent, interesting, not too much,
    • And very little rubbish
  • Automated assistant for biology
short term aim in bioinformatics
Short term aim in Bioinformatics
  • HR can work with biological data
    • Takes input similar to Muggleton’s Progol
  • Use HR to solve ML problems
    • See how bad an idea that is
  • Use theory formation to improve ML
    • Integrate HR and Progol somehow
na ve approach to ml tasks
Naïve Approach to ML Tasks
  • Give HR the same input as Progol
    • Get it to form a theory
  • Look at the theory
    • Extract concepts which do well on the task
    • i.e., they look similar to target concept
  • Not a goal-based approach
    • Bad idea (slow)
less na ve approach
Less Naïve Approach
  • Improve search using “forward look-ahead”
    • ICML Paper
  • This has evolved to “reactive search”
    • Uses HR’s own Java interpreter
    • HR reacts to certain events in theory formation
      • Scripts supplied by the user
  • HR also makes “near-conjectures”
  • Faster approach, but still fairly slow
example mutagenesis42 data
Example – Mutagenesis42 Data
  • Mutagenesis similar to carcinogenisis
  • 42 drugs supplied with atom-bond details
    • Atom type, number & charge, bond type (1-8)
  • 13 are mutagenic (active), 29 are not active
  • Progol learned this concept (88% accurate)
    • active(A) :- bond(A,B,C,2), bond(A,D,B,1),atm(A,D,c,21,E)

1

2

c,21

?

?

hr s results
HR’s Results
  • Using reactive search, four PRs, 30K steps
  • HR learned this concept:
    • active(A) :- bond(A,B,C,1), atm(B,F,21), bond(A,C,D,E)
    • Also 88% accurate
    • But, Progol’s answer “better”
    • Because higher information content (fewer ?s)
    • Biologists sometimes want more information
      • Is this really a simpler answer?

1

?

?,21

?

?

slide18
But…..
  • HR also made these equivalence conjectures
    • And extracted them (+100 more) for us

atm(B,X,21)  atm(B,c,21)

atm(B,X,38)  atm(B,n,38)

bond(A,B,C,X1) & atm(C,X2,38)  bond(A,B,C,1) & atm(C,X3,38)

bond(A,X1,B,X2) & atm(B,X3,38)  bond(A,B,X4,2), atm(B,X5,38)

  • We used these to re-write HR’s answer
    • By hand, but hope to automate
giving us this answer
Giving us this answer:
  • Remember that Progol’s Answer was:

1

2

c,21

?

n,38

1

2

c,21

?

?

  • So, we filled in one of the blanks!
are we making a meal of this
Are we making a meal of this?
  • Yes, possibly for the mutagenesis data
    • I was worried about the difficulty of this problem
  • In the last week I’ve written a
    • 200-line Prolog program which runs quite fast
    • And can be distributed over multiple processors
    • And can be easily understood by biologists
  • And gets these results….
template search results
Template search – Results
  • Nice result one (88% accurate, lots of info)

1

2

c,21

n,38

o,40

2

o,40

  • Nice result two (95% accurate)

1

1

2

7

1

7

c,21

c,?

c,195

n,38

o,40

c,22

?

c,22

h,3

-0.132

0.145

template search assumptions
Template Search - Assumptions
  • Connected substructures
    • Are interesting answers
    • Progol’s answers are all substructures
  • More specific substructures are not so bad
    • Biologists may even want lots of information
    • Don’t forget that they want to do science
  • Each learned concept will be true of
    • At least one active (positive) molecule
template search overview
Template Search - Overview

?

?

?,?

?,?

?,?

  • User chooses template for substructures
  • User specifies how many ?s are allowed
    • E.g., 3 out of 8 in the above template
  • Algorithm starts with the first positive
    • Extracts all substructures in the template
  • Then takes the next positive,
    • for each substructure in the set
      • Add the LGG so that it fits both positives
      • Don’t go under the IC limit
template search final part
Template Search – Final Part
  • For all the substructures
    • Take a disjunction
      • Which achieves the best accuracy
  • Distribution of this algorithm possible
    • We’re getting a big Linux farm
    • PPP – Processor Per Positive
      • finds substructures true of one positive
      • combine answers at the end
conclusions future work
Conclusions & Future Work
  • Automated Theory Formation
    • May be useful to bioinformatics
    • Use HR’s theory to improve Progol’s results
      • Possibly by pre-processing Progol’s input
      • Or by post-processing the learned concept
  • Template search
    • Maybe a good idea? Possibly not new….
    • Not bad results for the Mutagenesis42 dataset
ad