biological data mining
Skip this Video
Download Presentation
Biological Data Mining

Loading in 2 Seconds...

play fullscreen
1 / 12

Biological Data Mining - PowerPoint PPT Presentation

  • Uploaded on

Biological Data Mining. A comparison of Neural Network and Symbolic Techniques Grantholder. Professor Martyn Ford Centre for Molecular Design University of Portsmouth [email protected] Collaborators.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Biological Data Mining' - unity-pope

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
biological data mining

Biological Data Mining

A comparison of Neural Network and Symbolic Techniques



Professor Martyn Ford

Centre for Molecular Design

University of Portsmouth

[email protected]

  • Dr Anthony Browne School of Computing, Information Systems and Mathematics, London Guildhall University. [email protected]
  • Professor Philip Picton School of Technology and Design, University College Northampton. [email protected]
  • Dr David Whitley Centre for Molecular Design, University of Portsmouth. [email protected]
  • The project aims:
    • to develop & validate techniques for extracting explicit information from bioinformatic data
    • to express this information as logical rules and decision trees
    • to apply these new procedures to a range of scientific problems related to bioinformatics and cheminformatics
extracting information
Extracting information
  • Artificial neural networks (ANNs) can be used to identify the non-linear relationships that underlie bioinformatic data, but . . .
    • trained ANNs do not lead to a concise and explicit model
    • specifying the underlying structure is therefore difficult
    • as a result, ANNs are often regarded as ‘black boxes’
data mining and neural networks
Data Mining and Neural Networks
  • Standard data mining algorithms exist (such as ID3 or C5) so why use an ANN? It would be advantageous if the rules extracted:
    • Give a better fit to the data with the same number of rules (i.e. explain the data more accurately);
    • Give the same fit to the data with less rules (i.e. explain the data more comprehensibly); or
    • Give both a better fit to the data and use less rules (i.e. explain the data more comprehensibly and more accurately).
extracting decision trees
Extracting Decision Trees
  • The TREPAN procedure (Craven,1996)
    • extracts decision trees from ANNs
    • performs better than the symbolic learning algorithms ID3 and C5
    • the current implementation is restricted to a particular network architecture, but
    • the underlying algorithm is independent of network architecture
  • Builds a decision tree representing the function the ANN has learnt by recursively partitioning the input space.
  • Draws query instances by taking into account the distribution of instances in the problem domain.
  • For real-valued features uses kernel density estimates to generate a model of the underlying data that is used to select instances for presentation to the network.
  • Builds the decision tree in a best-first manner:
    • as each node is added the fidelity of the decision tree to the ANN is maximised
    • this is done by examining the significance of the distributions at consecutive levels of the tree (Kolmogorov-Smirnoff test for real valued features, chi-squared for discrete ones)
  • Allows the user to control the size of the final tree by selecting appropriate stopping criteria.
  • Implement the TREPAN algorithm in a portable format, independent of network architecture.
  • Extend the algorithm to enable the extraction of regression trees.
  • Provide a Bayesian formulation for the decision tree extraction algorithm.
  • Compare the performance of these algorithms with existing symbolic data mining techniques (ID3/C5).
  • Apply the extracted decision trees
    • to searches of bioinformatic databases
      • protein databases
      • genomic databases
    • to searches of cheminformatic databases
      • chemical libraries
      • natural product databases
    • to investigate ligand/receptor binding
    • to quantify molecular similarity/diversity
    • to identify new leads and optimise properties
case study ligand interaction with gpcrs
Case study: ligand interaction with GPCRs
  • 28 GPCRs
  • a number of putative interaction sites
  • 3 principal properties of amino acids (AAs)
  • MLR results for 2 ligands