1 / 29

KDDML: A Middleware Language and System for Knowledge Discovery in Databases

KDDML: A Middleware Language and System for Knowledge Discovery in Databases. Dipartimento di Informatica, Università di Pisa A. Romei, S. Ruggieri, F. Turini Thirteenth Italian Symposium on Sistemi Evoluti per Basi di Dati (SEBD-2005) Brixen, Italy – 19-22 June, 2005. Application Area: KDD.

shay
Download Presentation

KDDML: A Middleware Language and System for Knowledge Discovery in Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KDDML: A Middleware Language and System for Knowledge Discovery in Databases Dipartimento di Informatica, Università di Pisa A. Romei, S. Ruggieri, F. Turini Thirteenth Italian Symposium onSistemi Evoluti per Basi di Dati (SEBD-2005) Brixen, Italy – 19-22 June, 2005

  2. Application Area: KDD • Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying • valid, • novel, • potentially useful, • understandable patterns in data. SEBD 2005 - Brixen, June 2005

  3. The CRISP-DM process • Main focus on automatic-phases: • Data pre-processing • Modeling • Post-processing • Model evaluation SEBD 2005 - Brixen, June 2005

  4. In this work • KDDML: an XML-based middlewarelanguage and system in support of the KDD process. • KDDML as language. • KDDML as system. SEBD 2005 - Brixen, June 2005

  5. Requirements • R1: data/models repository should be available for storing input, output and intermediate objects of the KDD process. • Several representations of data can be available. • Automatic format conversions. • Automatic meta-data mapping (e.g., ARFF, SQL). • R2: specifying logical meta-data (meta-model) in addition to the physical data (model). • R3: compositionality of mining operations in the design of the language (closure principle). • R4: high extensibility of the system architecture. SEBD 2005 - Brixen, June 2005

  6. KDDML as XML-based System • XML as data/model representation (R1, R2). • Machine-processable language. • XML as language definition. • Ensures compositionality of operators (R3). • Extensibility and modularity (R4). SEBD 2005 - Brixen, June 2005

  7. Data/Model Representation SEBD 2005 - Brixen, June 2005

  8. Data Format • Separing the logical data from the physical instances. • Data schema via proprietary XML. • Actual data stored in CSV (Comma Separated Values). • CSV has been chosen as a trade-off between readability (binary file) and space occupation (xml). SEBD 2005 - Brixen, June 2005

  9. Physical Data Logical Metadata Data Format: Example <KDDML_TABLE data_file=“census.csv”> <SCHEMA logical_name=“census” number_of_attributes=“6” number_of_instances=“16”> <ATTRIBUTE name=“age” number_of_missed_values=“0“ type=“numeric”> <NUMERIC_DESCRIPTION mean=“40.75” variance=“237.8” min=“18.0” max=“70.0”/> </ATTRIBUTE> <ATTRIBUTE name=“education” number_of_missed_values=“3“ type=“nominal”> <NOMINAL_DESCRIPTION number_of_values=“4”> <VALUE value=“HS-grad” cardinality=“3”/> <VALUE value=“masters” cardinality=“2”/> …. </NOMINAL_DESCRIPTION> </ATTRIBUTE> …. </SCHEMA> </KDDML_TABLE> SEBD 2005 - Brixen, June 2005

  10. Model Format • PMML (Predictive Model Markup Language) • An industry standard for actual models representation as XML documents. • Consists of DTDs for a wide spectrum of models, including RdA, decision trees, clustering, regression, neural networks. • It does not cover the process of extracting models, but the exchange of extracted knowledge. SEBD 2005 - Brixen, June 2005

  11. Logical Metadata Physical Model Model Format: Example <PMML version="2.0"> …. <DataDictionary> <DataField name="id" optype="continuous" /> … <DataField name="amount" optype="continuous" /> </DataDictionary> <TreeModel modelName="censusTree" splitCharacteristic="multiSplit"> <MiningSchema> <MiningField name="id" usageType="supplementary" /> … <MiningField name="class" usageType="predicted" /> </MiningSchema> <Node score="" recordCount="48842"> <True/> <ScoreDistribution value="<=50K" recordCount ="37155" /> ... </Node> </PMML> SEBD 2005 - Brixen, June 2005

  12. Language SEBD 2005 - Brixen, June 2005

  13. Closure Principle (1) • Arguments of an operator must be of an appropriate type and sequence. • We denote the signature of an operator op:t1 x … x tn t by defining a DTD for KDDML queries that constraints sub-elements to be of type t1, … , tn. SEBD 2005 - Brixen, June 2005

  14. Closure Principle (2) <!ELEMENT TREE_CLASSIFY ((%kdd_query_trees;), (%kdd_query_table;))> <!ATTLIST TREE_CLASSIFY xml_dest %string; #IMPLIED> Where: • kdd_query_trees: all operators returning a classification tree; • kdd_query_table: all operators returning a table; • TREE_CLASSIFY belongs to the kdd_query_table entity. fTREE_CLASSIFY: tree x table  table SEBD 2005 - Brixen, June 2005

  15. KDDML Types • The set of types of KDDML operators consists of: • Table, PPtable • Tree, clusters, rda, sequence, hierarchy • Algs, condition, expression SEBD 2005 - Brixen, June 2005

  16. KDDML Query structure <OPERATOR_NAME xml_dest="results.xml" att1="v1" ... attM="vM"> <ARG1_NAME> .... </ARG1_NAME> ... <ARGn_NAME> .... </ARGn_NAME> </OPERATOR_NAME> • The structure of a KDDML query has a precise format. • XML tags element correspond to operation on data and models; • XML attributes correspond to parameters of those operations • XML sub-elements define the arguments passed to the operators (KDDML Types). SEBD 2005 - Brixen, June 2005

  17. Example (1) • Construction and application of a decision tree. • Loading of an ARFF source as training set. • Simple sampling on training set. • Construction of a decision tree on sampled training set. • Target attribute: play. • Algorithm: C4.5. • Loading of a test set from the system repository. • Application of the decision tree on the test set. SEBD 2005 - Brixen, June 2005

  18. Tree Miner Alg: c4.5 Pruning confidence: 40% Num instances: 6 Repository ARFF Arff Loader Source: weather.arff Sampling Alg: simple sampling Percentage: 66% Tree Classify Table Loader Source: weather_test.xml Repository Data Example (2) ... <PP_SAMPLING> <ARFF_LOADER .../> <ALGORITHM algorithm_name=“simple_sampling”> <PARAM name=“percentage” value=“0.66”/> </ALGORITHM> </PP_SAMPLING> ... ... <ARFF_LOADER arff_file_name="weather.arff"/> ... ... <TABLE_LOADER xml_source="weather_test.xml"/> ... <KDDML_OBJECT> <KDD_QUERY name="sample"> <TREE_CLASSIFY xml_dest="results.xml"> <TREE_MINER xml_dest="weather.xml" target_attribute="play"> <PP_SAMPLING> <ARFF_LOADER arff_file_name="weather.arff"/> <ALGORITHM algorithm_name=“simple_sampling”> <PARAM name=“percentage” value=“0.66”/> </ALGORITHM> </PP_SAMPLING> <ALGORITHM algorithm_name=“C4.5"> <PARAM name="confidence_for_pruning" value="0.4"/> <PARAM name="num_instances_for_leaf" value="6"/> </ALGORITHM> </TREE_MINER> <TABLE_LOADER xml_source="weather_test.xml"/> </TREE_CLASSIFY> </KDD_QUERY> </KDDML_OBJECT> ... <TREE_MINER xml_dest="weather.xml" target_attribute="play"> <PP_SAMPLING> ..... </PP_SAMPLING> <ALGORITHM algorithm_name=“c4.5"> <PARAM name="confidence_for_pruning" value="0.4"/> <PARAM name="num_instances_for_leaf" value="6"/> </ALGORITHM> </TREE_MINER> ... <TREE_CLASSIFY xml_dest="results.xml"> <TREE_MINER ....> .... </TREE_MINER> <TABLE_LOADER xml_source="weather_test.xml"/> </TREE_CLASSIFY> SEBD 2005 - Brixen, June 2005

  19. Language Operators • Data/Model access. • Preprocessing. • Data Cleaning, Sampling, Normalization, Discretization. • Model Extraction. • Model application and evaluation. • Model meta-reasoning & filtering. SEBD 2005 - Brixen, June 2005

  20. Example one: Discretization Discretization of a numeric attribute “age” into three intervals using the natural binning method. .... <PP_NUMERIC_DISCRETIZATION xml_dest= "census_discrete.xml", attribute_name = "age", label_type = "enumeration", enumerated_label_list = "young, middle, old"> <TABLE_LOADER xml_source= "census.xml"/> <ALGORITHM algorithm_name="natural_binning"> <PARAM name="cardinality" value="3"/> <PARAM name="having_number_of_intervals" value="true"/> </ALGORITHM> </PP_NUMERIC_DISCRETIZATION> .... SEBD 2005 - Brixen, June 2005

  21. Example two: RdA filtering Selects the rules with item “bread” in the body and not having the item “milk” in the head and having exactly two items in the head and having the support greater than 30%. .... <RDA_FILTER> <RDA_LOADER xml_source="rules.xml"/> <CONDITION> <AND_COND> <BASE_COND op_type="is_in" term1="@body" term2="bread"/> <BASE_COND op_type="is_not_in" term1="@head" term2="milk"/> <BASE_COND op_type="equal" term1="@head_cardinality" term2="2"/> <BASE_COND op_type="greater" term1="@support" term2="0.3"/> </AND_COND> </CONDITION> </RDA_FILTER> .... SEBD 2005 - Brixen, June 2005

  22. System Architecture SEBD 2005 - Brixen, June 2005

  23. Design targets • Extensibility • Data sources • Algorithms • Models • Portability • Modularity. • Architecture structured in 3 layers. SEBD 2005 - Brixen, June 2005

  24. To upper layers… Interpreter Layer Operators Layer Repository Layer Data Models Architecture Layers • Repository Layer: • Manages the read/write access to data and models repository. • Manages the read/write access to data and models from external sources. • Give a programmatic functionality to the higher layers. • Interpreter Layer: • Accepts a validated KDDML query and returns the result as XML document. • Recursively traverse the DOM tree representation. • The interpreter is not-affected by data/algorithms/model extensibility. • Operators Layer: • Implementation of language operators. • <OPERATOR_NAME> is implemented as a Java class satisfying an interface. • Interface is task-dependent. SEBD 2005 - Brixen, June 2005

  25. Interpreter Layer Operators Layer Repository Layer Data Models KDDML as Middleware System High Level GUI MQL Query MQL Results Query KDDML Results Compiler Query KDDML SEBD 2005 - Brixen, June 2005

  26. Experiences with KDDML SEBD 2005 - Brixen, June 2005

  27. ClickWorld • Extract DM models from visits to a city-news portal with the intent to characterize topics-of-interest of new visitors. • M. Baglioni, U. Ferrara, A. Romei, S. Ruggieri, F. Turini Preprocessing and mining web log data for web personalization.8th Italian Conf. on Artificial Intelligence : 237-249. Vol. 2829 of LNCS, September 2003. SEBD 2005 - Brixen, June 2005

  28. OP OP2 OP1 OP3 KDDML-G • A system for KDD on the GRID. • Exploit the parallelism offered by the GRID • Data immovability by moving the code on the place. SEBD 2005 - Brixen, June 2005

  29. Download KDDML http://kdd.di.unipi.it/kddml/ GNU (General Public Licence) SEBD 2005 - Brixen, June 2005

More Related