1 / 34

Planning to Learn with a Knowledge Discovery Ontology

Planning to Learn with a Knowledge Discovery Ontology. Monika Žáková, Petr Křemen, Filip Železný (Czech Technical University, Prague) Nada Lavrač (Institute Jozef Stefan, Ljubljana). Motivation. FP6 SEVENPRO project : “semantic engineering environment”

Download Presentation

Planning to Learn with a Knowledge Discovery Ontology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Planning to Learn with a Knowledge Discovery Ontology Monika Žáková, Petr Křemen, Filip Železný(Czech Technical University, Prague) Nada Lavrač(Institute Jozef Stefan, Ljubljana)

  2. Motivation FP6 SEVENPRO project: “semantic engineering environment” • integration of knowledge from various sources e.g. different CAD software, ERP, etc. by means of a layer of semantic annotations • a significant part of engineering knowledge has a rich relational structure (CAD designs, documents, simulation models, ERP databases)  traditional ML techniques and tools unsuitable Goals: • making implicit knowledge contained e.g. in CAD designs explicit for reuse, training, quality control • develop a tool for RDM capable of dealing with semantic annotations and producing results in a semantic format

  3. Design Example

  4. Example in the CAD ontology: <rdfs:Classrdf:ID="PrismSolFeature"></rdfs:Class> <rdfs:Classrdf:ID="SolidExtrude"> <rdfs:subClassOfrdf:resource="#PrismSolFeature"/> </rdfs:Class> declaring it in background knowledge: subclass(prismSolFeature, solidExtrude). hasFeature(B, F1):-hasFeature(B,F2),subclassTC(F1,F2). problemwith subsumption: C = liner(P):-hasBody(P,B),hasFeature(B,prismSolFeature). D= liner(P):-hasBody(P,B),hasFeature(B,solidExtrude). itdoes not hold C D clause D not obtained by applying a specialization refinement operator onto clause C our approach: extend refinement operator with taxonomies on predicates and terms

  5. Sorted Refinement Downward Δ,Σ-refinement • extension of sorted refinement proposed by Frisch • defined using 3 refinement rules: • adding a literal to the conjunction • replacing a sort with pred1(x1:τ1,…,xn:τn) with one of its direct subsortspred1 (x1:τ1’,…,xn:τn) • replacing a literal pred1(x1:τ1,…,xn:τn) with one of its direct subrelationspred2(x1:τ1,…,xn:τn)

  6. Feature Taxonomy • information about feature subsumption hierarchy stored and passed to the propositional learner • assume that features f1,…, fnhave been generated with corresponding conjunctive bodies b1,…, bn • elementary subsumption matrix E of n rows and n columns is defined such that Ei,j= 1 whenever bi∈ρΔ,Σ(bi) and Ei,j= 0 otherwise • exclusion matrix X of n rows and n columns is defined such that Xi,j= 1 whenever i= j or bi∈ ρΔ,Σ(ρΔ,Σ(… ρΔ,Σ(bj) …)) and Xi,j= 0 otherwise.

  7. Propositional Rule Learning 2 propositional algorithms adapted to utilize matrices E, X • Top-down deterministic algorithm • stems from the rule inducer of RSD • Stochastic local DNF algorithm • (Rückert 2003, Paes 2006) • search in the space of DNF formulas • refinement done by local non-deterministic DNF term changes • using matrices E, X can: • prevent the combination of a feature and its subsumee within the conjunction (both) • specialize a conjunction by replacing a feature with its direct subsumee (Top-down only)

  8. RDM Core Overview Predicates declarations Propositional learning (Weka, R) mode hasBody( +CADPart, -Body). mode hasMaterial(+CADPart, -Material). mode hasSketch(+CADPart, -Sketch). mode hasLength(+Sketch, -float). Features Sort theory Feature construction subClassOf(CADPart,CADEntity). subClassOf(CADAssembly,CADEntity). … subPropertyOf(hasCircularSketch, hasSketch). subPropertyOf(firstFeature, hasFeature). Propositional rule learning (adapted) Background knowledge (Hornlogic) Feature subsumption table Examples Subsumption and exclusion matrix eItem(eItemT_BA1341). eItem(eItemT_BA1342). eItem(eItemT_BA1343).

  9. RDM Manager = tool developed for running the RDM tasks Functionalities: • Obtaining relevant data by means of SPARQL query to semantic repository • Converting data from semantic representation into format acceptable by the DM algorithms (Prolog, arff, csv, etc.) • Propositionalization by generating first order features • Enhanced propositional rule learning algorithms • Third party propositional learning algorithms integrated by means of wrappers e.g. • rule learner RIPPER (Cohen 1995) • association rules - Apriori • decision trees – J48 algorithm (for all above WEKA implementation used) • clustering – distance-based PCA (implemented in R) • Storing information about DM processes and their results in semantic representation

  10. Knowledge Discovery Ontology Foreseen queries that guided the design of the ontology • User: • Give me all rule-based classifiers found for class C on dataset D with error estimate < 5% • Give me the rule-based algorithm with shortest average runtime for datasets D, E and F • Developer: • Give me all pairs of model classes with equivalent expressiveness for which no conversion program is available • Give me all parameter settings for experiments with dataset D and algorithm A and their respective runtimes accuracy results

  11. Example Queries to the KD ontology • Obvious idea: if the system knows all it can do, it can plan complex KD workflows • Example: a planning system queries to the ontology for generating decision tree from a relational dataset through propositionalization • Give me a program that takes a classified relational dataset represented as Prolog facts and produces an arff file • A program that take an arff file and produces a decision tree

  12. Motivation for Workflow Generation • user: • RDM algorithms utilizing background knowledge and relational learning through propositionalization and subsequent propositional learning quite complex  we want to hide as much of complexity as possible from the user • developer/data miner: • storing information about the whole process  repeatability of experiments • individual components developed by different people  can focus on experimenting with parameters of some components and view other as black box

  13. Main Classes of KD Ontology • main notions : Knowledge andAlgorithm • representation language: OWL-DL • densely interlinked knowledge structures, not just taxonomies • highly optimized reasoners available (Pellet, RacerPro, Fact++, ...)

  14. Knowledge 5 subclasses: • Dataset • LogicalKnowledge • NonLogicalKnowledge • Pattern = MiningResult • multiple formats may be attached to each Knowledge class • each knowledge instance has a specified KnowledgeFormat Knowledge and example some Example subclassOf Knowledge andhasExpressivitysome Expressivity andhasFormatsomeKnowledgeFormat Knowledgeand notLogicalKnowledge Knowledge andproducedBysomeAlgorithmExecution

  15. Expressivity Expressivity hierarchy Protégé

  16. Algorithms Algorithm • a mapping from knowledge to knowledge • not just induction, all executable elements incl. preprocessing, ... • definition of inputs, outputs and parameters ApriorisubclassOfNamedAlgorithmand inputsome (Dataset andhasExpressivityonlySingleRelationStructure and format only {ARFF,CSV}) and output some (MiningResultand contains onlyAssociationRule)and minMetricsome doubleand minSupportsome doubleand numOfRulessomepositiveInteger

  17. Algorithms (2) • atomic (named) vs. composite (workflows) • types of algorithms modeled as classes e.g. ClusteringAlgorithm • each algorithm description is modeled as a subclass of class NamedAlgorithm (like Apriori above) • instances of class AlgorithmExecution represent executions of algorithms • thus, to access a particular algorithm, we need to pose a schema query to the OWL ontology – SPARQL-DL

  18. Pattern • Result of a data mining algorithm • Describes a mapping from knowledge to knowledge • Defined as: • Example: association rules KnowledgeandproducedBysomeAlgorithmExecution subclassOf contains only (AtomicKnowledge andsingleResultAnnotationsomeanySimpleType) MiningResult andproducedByonlyAssociationRulesAlgorithmExecution and contains only AssociationRule AssociationRulesubclassOfAtomicKnowledge and antecedent some And and consequent some And and confidence some double and support some double

  19. Anticipated Usage of the KD Ontology • a specialization of relevant OWL-S ontology parts – mainly the Process class. • during the planning inputs and outputs will be matched w.r.t. their format and expressivity to filter out invalid algorithm bindings • beyond the workflow generation : • management of the SoA knowledge in the KD domain • storing and managing KD workflow results – for example for meta-learning, experiment repeatibility

  20. Workflow Construction Automatic workflow construction • Converting KD task described using classes from the KD ontology into a planning problem described in PDDL • Generating a plan using a planning algorithm • Storing the generated abstract workflow in form of semantic annotation • Instantiating the abstract workflow with specific algorithm configurations available in the KD ontology

  21. Workflow-related Classes of KD Ontology KD ontology extended with workflow-related classes: • ProblemDescription– defined using properties • init specifying the available input data and knowledge • goal specifying the desired results • Action– defined by • Algorithm, which is executed • startTime, durationand • immediately preceedingActions • Workflow– currently a DAG of Actions with a link to ProblemDescription from which it was generated

  22. Problem Description Example • Example: generating relational association rules from a classified relational dataset with relational background knowledge expressed in OWL-DL RelationalAssociationRules subClassOfProblemDescription and goal some (MiningResult and contains onlyAssociationRule) and init some (LogicalKnowledge andhasExpressivitysome OWL-DL andhasFormatsome {RDFXML}) and init some (LogicalKnowledge andhasExpressivitysomeRelationalStructure andhasFormatsome {RDFXML}) and init some (ClassifiedInstanceSet andhasFormatsome {RDFXML})

  23. Conversion into a Planning Task Described in PDDL • ontology classified using FACT reasoner to generate inferred hierarchy on algorithms, knowledge and patterns • names generated for classes defined using OWL restrictions • domain description in PDDL • generated by converting Algorithms into PDDL actions, with inputs specifying the preconditions and outputs specifying the effects • both inputs and outputs are currently restricted to conjunction of OWL classes • problem description in PDDL • generated in the same way from ProblemDescription

  24. Algorithm Definition Example Description in KD ontology (in DL formalism ) ApriorisubClassOfNamedAlgorithmand inputsome (Dataset andhasExpressivityonlySingleRelationStructure and format only {ARFF}) and output some (MiningResultand contains onlyAssociationRule)and minMetricsome doubleand minSupportsome doubleand numOfRulessomepositiveInteger Description used for planning (in PDDL ) (:action AprioriAlgorithm :parameters ( ?v0 – Dataset_SingleRelationStructure ?v1 – ARFF ?v2 – MiningResult_contains_AssociationRule) :precondition (and (available ?v0) (format (?v0 ?v1)) :effect (and (available ?v2))

  25. Planning Algorithm • based on Fast-Forward planning system (Hoffman, 2001) • enforced hill climbing algorithm to perform forward state space search • goal distances estimated using relaxed GRAPHPLAN • i.e. ignoring delete lists of the operators • returns the discovered workflows with lowest number of processing steps

  26. Generated Workflow for CAD Designs

  27. RDM Manager implementation RDM GUI Semantic Server Agent R D M O n t o l o g y RDM Manager Tool RDM Web Service RDM Engine Algorithm Implementation 1 Algorithm Implementation n …

  28. RDM GUI

  29. Related Work (planning to learn) • Most relevant: NEXT System [Bernstein & Deanzer] • (Our best understanding:) • Linear plans • Preprocessing-Induction-Postprocessing template • We try for a template-free plan (DAG) Propositionalized Data Feature construction (inductive) Multi- relational data Propositional learning(inductive) Feature evaluation(deductive)

  30. Related Work (DM workflows and DM assistants ) • workflows for DM • myGrid/Taverna, Triana, DataMiningGrid, Kepler, KnowledgeGrid, CAMLET, Pegasus, MiningMart • manual workflow composition, focus on workflow execution • focus on DM from relational databases • relevant efforts in formalization of DM processes • DM assistants • MetaL, StatLog - classification of DM methods, metrics for comparing the methods, finding suitable methods for a given dataset

  31. Related Work (DM ontologies) • existing DM ontologies • ontologies for classical DM - 3 stages: induction, pre- and post-processing • focus on hierarchy of DM algorithms and propositional dataset description • DAMON – KnowledgeGrid project [Cannataro & Comito] • DataMiningGrid application description schema [Stankovski et al.] • DM ontology for IDEA [Bernstein et al.] • myGrid ontology – for bioinformatics, includes biological domain concepts http://www.mygrid.org.uk/ontology/ • other work towards KD process formalization • CinQ and IQ projects (EU FP6) • Sašo Džeroski: Towards a General Framework for Data Mining

  32. Related Work (Semantic Web Service Composition) • essentially creating workflows based on semantic description of the ingredients • popular approach: convert semantic description to PDDL and use suitably adapted planning techniques [Klusch et al.], [Liu et al.] • we have adapted this approach for DM workflows using KD ontology • future work: individual DM algorithms as web services?

  33. Open Issues • Reactive planning / exploration • Currently planning towards a desired kind of result, not quality • Conversion of knowledge • From more to less expressive • How can we constrain what should remain from the original information? • Can this be done at all without semantic meta-data?

  34. Open Issues • Tighter integration of the ontology with planning • Currently: simple rewriting of algorithm annotations into PDDL actions • Work-in-progress: planner poses SPARQL queries to retrieve relevant actions • Computational platform: • GRID or web services?

More Related