Biological data mining
1 / 12

Biological Data Mining - PowerPoint PPT Presentation

  • Uploaded on

Biological Data Mining. A comparison of Neural Network and Symbolic Techniques Grantholder. Professor Martyn Ford Centre for Molecular Design University of Portsmouth [email protected] Collaborators.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Biological Data Mining' - unity-pope

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Biological data mining

Biological Data Mining

A comparison of Neural Network and Symbolic Techniques



Professor Martyn Ford

Centre for Molecular Design

University of Portsmouth

[email protected]


  • Dr Anthony Browne School of Computing, Information Systems and Mathematics, London Guildhall University. [email protected]

  • Professor Philip Picton School of Technology and Design, University College Northampton. [email protected]

  • Dr David Whitley Centre for Molecular Design, University of Portsmouth. [email protected]


  • The project aims:

    • to develop & validate techniques for extracting explicit information from bioinformatic data

    • to express this information as logical rules and decision trees

    • to apply these new procedures to a range of scientific problems related to bioinformatics and cheminformatics

Extracting information
Extracting information

  • Artificial neural networks (ANNs) can be used to identify the non-linear relationships that underlie bioinformatic data, but . . .

    • trained ANNs do not lead to a concise and explicit model

    • specifying the underlying structure is therefore difficult

    • as a result, ANNs are often regarded as ‘black boxes’

Data mining and neural networks
Data Mining and Neural Networks

  • Standard data mining algorithms exist (such as ID3 or C5) so why use an ANN? It would be advantageous if the rules extracted:

    • Give a better fit to the data with the same number of rules (i.e. explain the data more accurately);

    • Give the same fit to the data with less rules (i.e. explain the data more comprehensibly); or

    • Give both a better fit to the data and use less rules (i.e. explain the data more comprehensibly and more accurately).

Extracting decision trees
Extracting Decision Trees

  • The TREPAN procedure (Craven,1996)

    • extracts decision trees from ANNs

    • performs better than the symbolic learning algorithms ID3 and C5

    • the current implementation is restricted to a particular network architecture, but

    • the underlying algorithm is independent of network architecture


  • Builds a decision tree representing the function the ANN has learnt by recursively partitioning the input space.

  • Draws query instances by taking into account the distribution of instances in the problem domain.

  • For real-valued features uses kernel density estimates to generate a model of the underlying data that is used to select instances for presentation to the network.


  • Builds the decision tree in a best-first manner:

    • as each node is added the fidelity of the decision tree to the ANN is maximised

    • this is done by examining the significance of the distributions at consecutive levels of the tree (Kolmogorov-Smirnoff test for real valued features, chi-squared for discrete ones)

  • Allows the user to control the size of the final tree by selecting appropriate stopping criteria.


  • Implement the TREPAN algorithm in a portable format, independent of network architecture.

  • Extend the algorithm to enable the extraction of regression trees.

  • Provide a Bayesian formulation for the decision tree extraction algorithm.

  • Compare the performance of these algorithms with existing symbolic data mining techniques (ID3/C5).


  • Apply the extracted decision trees

    • to searches of bioinformatic databases

      • protein databases

      • genomic databases

    • to searches of cheminformatic databases

      • chemical libraries

      • natural product databases

    • to investigate ligand/receptor binding

    • to quantify molecular similarity/diversity

    • to identify new leads and optimise properties

Case study ligand interaction with gpcrs
Case study: ligand interaction with GPCRs

  • 28 GPCRs

  • a number of putative interaction sites

  • 3 principal properties of amino acids (AAs)

  • MLR results for 2 ligands