- 47 Views
- Uploaded on
- Presentation posted in: General

Automated Theory Formation: First Steps in Bioinformatics

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Automated Theory Formation:First Steps in Bioinformatics

Simon Colton

Computational Bioinformatics Laboratory

- Given some background information
- Concepts, hypotheses (axioms)

- Given some positive examples
- And some negative examples

- Find me an explanation
- Why the positives are positive
- And the negatives are negative

- Given some theory from chemistry
- Structure of molecules, well known substructures

- Given some examples of toxic drugs
- And some examples of non-toxic drugs

- Question: Why are the toxic drugs toxic?

- Given some background information
- Concepts, hypotheses (axioms)

- And some objects of interest
- Numbers, Molecules, etc.

- Find something interesting
- Interesting things could be:
- Concepts, examples, hypotheses, explanations

- Scientific theories contain (at least):
- Concepts: salt, acid, base
- Hypotheses: acid + base => salt + water
- Explanations: transfer of electrons, dissolving

- So, ATF should do (at least):
- Concept formation, Conjecture making
- Hypothesis proving and disproving.

- Also needs to:
- Measure interestingness, present results, etc.

- Developed in maths
- Designed to be general purpose system

- Concept-based theory formation
- Tries to make concept
- Makes conjecture when it can’t make a concept
- Tries to explain conjectures

- Conjecture-based theory formation
- Fix faulty conjectures with concept formation
- PhD work of Alison Pease, based on Lakatos

- 10 General Production Rules
- Take in old concepts, produce new concepts

Size

[a,b] : b|a

Split

[a,n]:n = |{b:b|a}|

[a] : 2|a

Negate

Split

Compose

[a]:2=|{b:b|a}|

[a] : not 2|a

[a]:2=|{b:b|a}| & not 2|a

(Odd Prime Numbers)

- Empirical checks are performed
- After each attempt to invent a new concept

- If the concept has no examples
- Makes non-existence conjecture

- If concept has same examples as previous
- Makes an equivalence conjecture

- If another concept subsumes the concept
- Makes an implication conjecture

- Suppose HR makes equivalence conjecture:
- P(a) & Q(a) R(a) & S(a)

- Extracts:
- P(a) & Q(a) => R(a), P(a) & Q(a) => S(a)
- R(a) & S(a) => P(a), R(a) & S(a) => Q(a)

- Tries to Extract: P(a) => R(a), Q(a) => R(a), etc.
- Prime implicates (require proving, though)

- Important: gets Horn Clauses
- Can be expressed in Prolog…..

- In mathematical domains
- HR relies on automated theorem provers
- And Model generators
- To find counterexamples

- E.g., group theory: a*a=a a=id (prove easily)

- In biological/chemistry domains
- Possibly: visualisation tools, reaction pathways

- Please ask me over coffee about:
- Pre-processing constraint problems
- Learning properties of quadratic residues
- Inventing integer sequences
- Puzzle generation
- Adding to the TPTP library
- Setting mathematical tutorial questions
- …

- Develop an ATF system similar to HOMER
- But working in biological domains

- Biologist provides little background info
- In a format they are happy with

- Program provides results
- Intelligent, interesting, not too much,
- And very little rubbish

- Automated assistant for biology

- HR can work with biological data
- Takes input similar to Muggleton’s Progol

- Use HR to solve ML problems
- See how bad an idea that is

- Use theory formation to improve ML
- Integrate HR and Progol somehow

- Give HR the same input as Progol
- Get it to form a theory

- Look at the theory
- Extract concepts which do well on the task
- i.e., they look similar to target concept

- Not a goal-based approach
- Bad idea (slow)

- Improve search using “forward look-ahead”
- ICML Paper

- This has evolved to “reactive search”
- Uses HR’s own Java interpreter
- HR reacts to certain events in theory formation
- Scripts supplied by the user

- HR also makes “near-conjectures”
- Faster approach, but still fairly slow

- Mutagenesis similar to carcinogenisis
- 42 drugs supplied with atom-bond details
- Atom type, number & charge, bond type (1-8)

- 13 are mutagenic (active), 29 are not active
- Progol learned this concept (88% accurate)
- active(A) :- bond(A,B,C,2), bond(A,D,B,1),atm(A,D,c,21,E)

1

2

c,21

?

?

- Using reactive search, four PRs, 30K steps
- HR learned this concept:
- active(A) :- bond(A,B,C,1), atm(B,F,21), bond(A,C,D,E)
- Also 88% accurate
- But, Progol’s answer “better”
- Because higher information content (fewer ?s)
- Biologists sometimes want more information
- Is this really a simpler answer?

1

?

?,21

?

?

- HR also made these equivalence conjectures
- And extracted them (+100 more) for us
atm(B,X,21) atm(B,c,21)

atm(B,X,38) atm(B,n,38)

bond(A,B,C,X1) & atm(C,X2,38) bond(A,B,C,1) & atm(C,X3,38)

bond(A,X1,B,X2) & atm(B,X3,38) bond(A,B,X4,2), atm(B,X5,38)

- And extracted them (+100 more) for us
- We used these to re-write HR’s answer
- By hand, but hope to automate

- Remember that Progol’s Answer was:

1

2

c,21

?

n,38

1

2

c,21

?

?

- So, we filled in one of the blanks!

- Yes, possibly for the mutagenesis data
- I was worried about the difficulty of this problem

- In the last week I’ve written a
- 200-line Prolog program which runs quite fast
- And can be distributed over multiple processors
- And can be easily understood by biologists

- And gets these results….

- Nice result one (88% accurate, lots of info)

1

2

c,21

n,38

o,40

2

o,40

- Nice result two (95% accurate)

1

1

2

7

1

7

c,21

c,?

c,195

n,38

o,40

c,22

?

c,22

h,3

-0.132

0.145

- Connected substructures
- Are interesting answers
- Progol’s answers are all substructures

- More specific substructures are not so bad
- Biologists may even want lots of information
- Don’t forget that they want to do science

- Each learned concept will be true of
- At least one active (positive) molecule

?

?

?,?

?,?

?,?

- User chooses template for substructures

- User specifies how many ?s are allowed
- E.g., 3 out of 8 in the above template

- Algorithm starts with the first positive
- Extracts all substructures in the template

- Then takes the next positive,
- for each substructure in the set
- Add the LGG so that it fits both positives
- Don’t go under the IC limit

- for each substructure in the set

- For all the substructures
- Take a disjunction
- Which achieves the best accuracy

- Take a disjunction
- Distribution of this algorithm possible
- We’re getting a big Linux farm
- PPP – Processor Per Positive
- finds substructures true of one positive
- combine answers at the end

- Automated Theory Formation
- May be useful to bioinformatics
- Use HR’s theory to improve Progol’s results
- Possibly by pre-processing Progol’s input
- Or by post-processing the learned concept

- Template search
- Maybe a good idea? Possibly not new….
- Not bad results for the Mutagenesis42 dataset