190 likes | 492 Views
7 th December 2010. The Sixth Australasian Ontology Workshop, Adelaide University of South Australia. A Visual Analytics Approach to Augmenting Formal Concepts with Relational Background Knowledge in a Biological Domain. Elma Akand *, Mike Bain, Mark Temple
 
                
                E N D
7th December 2010 The Sixth Australasian Ontology Workshop, Adelaide University of South Australia A Visual Analytics Approach to Augmenting Formal Concepts with Relational Background Knowledge in a Biological Domain Elma Akand*, Mike Bain, Mark Temple *CSE, UNSW/School of Biomedical and Health Sciences,UWS
Outline • Machine learning and data mining in bioinformatics • Domain Ontologies in biomedical applications • Formal Concept Analysis • MCW algorithm (Mining Closed itemsets for Web apps) • BioLattice – a web based browser • Experimental Application:systems biology Part-1: Concept ranking by gene interaction Part-2: Relational learning of multiple-stress rules
Machine learning & Data mining in Bioinformatics • Bioinformatics “Bioinformatics is the study of information content and information flow in biological systems and processes” (Michael Liebman,1995) • Machine Learning & Data mining -Can offer automatic knowledge acquisition -Process to discover knowledge by analyzing data from different perspectives and can contribute greatly in building knowledge base • Our work: focus on knowledge-based machine learning • Previous work:learning from ontologies • Current work:ontology construction by learning • Potential application areas: ontologies– central to eCommerce, eHealth • Current application area: systemsbiology– predict gene function, data integration
Ontology • In philosophy - concerned with nature and relations of being • In knowledge representation-study of categorization of things: Ontology Ontology – "specification of a conceptualization” (Gruber, 1993) Conceptualization – "formalization of knowledge in declarative form” (Genesereth and Nilsson, 1987) Informal Ontology Upper Ontology Natural language General Formal Ontology Domain Ontology First order logic or a variant Specific
Gene Ontology x y a b x gene: x concepts : a ,b relations : (i) x a (ii) x b and (iii) b  a • Missing concepts and relations • One gene annotated with different GO terms with a term specialization of other
Formal Concept Analysis (FCA) • Mathematical order theory (Rudolf Wille in the early 80s) -Derives conceptual structures out of data -Method for data analysis, knowledge representation and information management • Components -Formal context, concept , concept lattice
({cats, gibbons, dogs, dolphins, humans, whales}, {}) Top ({cats, gibbons, dogs}, {hair-covered}) ({gibbons, dolphins, humans, whales}, {intelligent}) 5 6 ({dolphins, whales}, {intelligent, marine}) ({cats, dogs}, {hair-covered, four-legged}) 3 ({gibbons, humans}, {intelligent, thumbed}) 2 4 1 ({gibbons}, {intelligent, hair-covered, thumbed}) Bottom ({}, {intelligent, hair-covered, thumbed, marine, four-legged}) Formal concepts in a concept lattice • Formal context: an n by m Boolean matrix m attributes A columns n objects O rows • Formal concept: Galois connection <X, Y> X is a subset of A, Y is a subset of O • Concept lattice loosely interpretable in ontology terms: concept definitions and cf. T-box sub-concept relations concept membership cf. A-box by objects
FCA in data mining • FCA can be seen as a clustering technique in machine learning -Most of the work is ina propositional framework • In data mining closed itemset miningis an efficient alternative to FCA A frequent itemset X is closed if there exists no proper superset Y such that Y⊃X with support(Y)=support(X) E.g., if X = {a,b,c,d} and Y ={a,b,c,d,e} and support(Y)=support(X), then X is not closed • Parameters to avoid building entire lattice -Extent size must be greater than minsup • Existing closed itemset mining algorithms -Data structures to speed up closeditemset mining -But may not build lattice, or include extents
MCW algorithm (Mining Closed itemsets for Web apps) • Vertical data format • IT-tree (itemset-tidsettree) search space -node has Xxt(X) and all children have prefix X • Pruning - 4 set difference closure operators • Subsumption check - A look-up table to record all attributes and their occurrences in closed concepts • Lattice - adding concepts following a general to specific order Is {TA}{135} closed? i(135)={TAWC}
Closure operators Based on CHARM (Zaki, 2005) {TA}{135}={TW}{135} ->{TAW}{135} {D}{2456}⊂{C}{123456}->{DC}{2456} {D}{2456} and {W}{12345}->{DW}{245}
Concept lattice as a visual analytics approach • Visual analytics -combination of information visualization with machine learning and data analysis (Keim et al., 2008) • Visualization of concept lattice -provides overview of the structure of the domain • means for further data analysis, e.g., classification, clustering,implicationdiscovery, rule learning • Previous work -lattice navigation since Godin etal. (1993) -Browsable concept lattice, e.g., Kim & Compton (2004) • Our current work - on augmenting concept lattice by integrating multiple sources of knowledge(Gene Ontology, protein interactions) for further analysis & machine learning
Browsable concept lattice more general
Biological validation (1) : synthetic lethality Synthetic lethal interaction if cell is viable when either gene A or B are individually deleted, but cannot grow when both are deleted. Our results show that 72 (119) concepts in the lattice more likely than random chance at p < 0.01 (p < 0.05) to contain synthetic lethal pairs.
Biological validation (2) : ILP learning of concept definitions Biochemical pathway data Protein-protein interaction data Ontology data Inductive Logic Programming Transcription factor binding data (ChIP-chip) Microarray gene-expression data First-order rule concept(A):- ppi(B,A,C), ppi(B,A,E), ppi(B,C,E) tfbinds(D,C),fbinds(F,E)
Example rule: RSM19 required for H2O2 response; RSM19, RSM22 and MRPS17 in “mitochondrial ribosomal small subunit” stable complex; and RSM22, MRPS17 bound by transcription factors under amino acid starvation. Transcription factors
Conclusions • Many real-world domains are data-intensive • Machine learning and data mining applications required to generate predictive and useful outputs • We focus on knowledge-based learning for comprehensibility – use ontologies • Formal concept analysis as a framework for ontology structure • Use data mining techniques for efficient concept lattice generation • Visual analytics approach: browsable lattice, added background knowledge • Initial validation on a case study from yeast systems biology
Future work • Investigate pseudo-intents to simplify concept lattice • Investigate variants of concept lattice structures • -e.g., concept lattice of inverse context • Add concept definitions to background knowledge in ILP