1 / 40

Anna Atramentov Major: Computer Science Program of Study Committee:

A Multi-Relational Decision Tree Learning Algorithm – Implementation and Experiments. Anna Atramentov Major: Computer Science Program of Study Committee: Vasant Honavar, Major Professor Drena Leigh Dobbs Yan-Bin Jia Iowa State University, Ames, Iowa 2003. KDD and Relational Data Mining.

kareem
Download Presentation

Anna Atramentov Major: Computer Science Program of Study Committee:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Multi-Relational Decision Tree Learning Algorithm – Implementation and Experiments Anna Atramentov Major: Computer Science Program of Study Committee: Vasant Honavar, Major Professor Drena Leigh Dobbs Yan-Bin Jia Iowa State University, Ames, Iowa 2003

  2. KDD and Relational Data Mining • Term KDD stands for Knowledge Discovery in Databases • Traditional techniques in KDD work with the instances represented by one table • Relational Data Mining is a subfield of KDD where the instances are represented by several tables

  3. Motivation Importance of relational learning: • Growth of data stored in MRDB • Techniques for learning unstructured data often extract the data into MRDB Promising approach to relational learning: • MRDM (Multi-Relational Data Mining) framework developed by Knobbe’s (1999) • MRDTL (Multi-Relational Decision Tree Learning) algorithm implemented by Leiva (2002) Goals • Speed up MRDM framework and in particular MRDTL algorithm • Incorporate handling of missing values • Perform more extensive experimental evaluation of the algorithm

  4. Relational Learning Literature • Inductive Logic Programming (Dzeroski and Lavrac, 2001; Dzeroski et al., 2001; Blockeel, 1998; De Raedt, 1997) • First order extensions of probabilistic models • Relational Bayesian Networks(Jaeger, 1997) • Probabilistic Relational Models (Getoor, 2001; Koller, 1999) • Bayesian Logic Programs (Kersting et al., 2000) • Combining First Order Logic and Probability Theory • Multi-Relational Data Mining (Knobbe et al., 1999) • Propositionalization methods (Krogel and Wrobel, 2001) • PRMs extension for cumulative learning for learning and reasoning as agents interact with the world (Pfeffer, 2000) • Approaches for mining data in form of graph (Holder and Cook, 2000; Gonzalez et al., 2000)

  5. Problem Formulation Given: Data stored in relational data base Goal: Build decision tree for predicting target attribute in the target table Example of multi-relational database schema instances

  6. … … … sunny not sunny {d1, d2} {d3, d4} Temperature hot not hot No No {d3} {d4} Yes Propositional decision tree algorithm. Construction phase {d1, d2, d3, d4} Tree_induction(D: data) A = optimal_attribute(D) if stopping_criterion (D) return leaf(D) else Dleft := split(D, A) Dright := splitcomplement(D, A) childleft := Tree_induction(Dleft) childright := Tree_induction(Dright) return node(A, childleft, childright) Outlook

  7. Grad. Student Staff Staff Grad. Student Staff Grad. Student Grad. Student GPA >2.0 GPA >2.0 MR setting. Splitting data with Selection Graphs Department Graduate Student Staff complement selection graphs

  8. Grad.Student GPA >3.9 What is selection graph? • It corresponds to the subset of the instances from target table • Nodes correspond to the tables from the database • Edges correspond to the associations between tables • Open edge = “have at least one” • Closed edge = “have non of ” Grad.Student Department Staff Specialization=math

  9. Transforming selection graphs into SQL queries Staff SelectdistinctT0.id FromStaff Where T0.position=Professor Position = Professor Select distinctT0.id FromStaff T0, Graduate_Student T1 Where T0.id=T1.Advisor Staff Grad. Student Generic query: select distinctT0.primary_key fromtable_list wherejoin_list andcondition_list Staff Grad. Student SelectdistinctT0.id FromStaff T0 Where T0.id not in ( Select T1. id From Graduate_Student T1) Grad. Student Select distinct T0. id From Staff T0, Graduate_Student T1 WhereT0.id=T1.Advisor T0. id not in ( Select T1. id From Graduate_Student T1 Where T1.GPA > 3.9) Staff Grad. Student GPA >3.9

  10. Staff Staff Grad.Student Staff Grad.Student Grad.Student … … Staff Grad. Student Staff Grad.Student GPA >3.9 GPA >3.9 … … … … MR decision tree • Each node contains selection graph • Each child selection graph is a supergraphof the parent selection graph

  11. Staff Staff Grad.Student Staff Grad.Student Grad.Student … … Staff Grad. Student Staff Grad.Student GPA >3.9 GPA >3.9 … … … … How to choose selection graphs in nodes? Problem: There are too many supergraph selection graphs to choose from in each node Solution: • start with initial selection graph • find greedy heuristic to choose supergraphselection graphs: refinements • use binary splits for simplicity • for each refinementget complement refinement • choose the best refinement basedon information gain criterion Problem: Somepotentiallygood refinementsmay give noimmediate benefit Solution: • look ahead capability

  12. Department Grad.Student Staff Refinements of selection graph • add condition to the node - explore attribute information in the tables • add present edge and open node –explore relational properties between the tables Grad.Student Department Staff Specialization=math Grad.Student GPA >3.9

  13. Grad.Student Department Staff Department Grad.Student Grad.Student GPA >3.9 Grad.Student Department Staff Staff Grad.Student Grad.Student Department Staff GPA >3.9 Grad.Student GPA >3.9 Refinements of selection graph refinement • add condition to the node • add present edge and open node Specialization=math Position = Professor Specialization=math complement refinement Specialization=math Position != Professor

  14. Grad.Student Department Staff Department Grad.Student Grad.Student GPA >3.9 Grad.Student Department Staff Staff Grad.Student Grad.Student Department Staff GPA >3.9 Grad.Student GPA >3.9 Grad.Student GPA >2.0 Refinements of selection graph refinement GPA >2.0 • add condition to the node • add present edge and open node Specialization=math Specialization=math complement refinement Specialization=math

  15. Department Grad.Student Grad.Student Department Staff Staff Grad.Student Grad.Student Department Staff GPA >3.9 Department Grad.Student #Students >200 GPA >3.9 Refinements of selection graph refinement Grad.Student Department Staff Specialization=math • add condition to the node • add present edge and open node #Students >200 Grad.Student GPA >3.9 Specialization=math complement refinement Specialization=math

  16. Department Grad.Student Grad.Student Department Staff Department Staff Grad.Student Grad.Student Department Staff GPA >3.9 Grad.Student GPA >3.9 Department Refinements of selection graph refinement Grad.Student Department Staff Specialization=math • add condition to the node • add present edge and open node Grad.Student GPA >3.9 Specialization=math complement refinement Note: information gain = 0 Specialization=math

  17. Department Staff Grad.Student Grad.Student Department Staff Staff Grad.Student Grad.Student Department Staff GPA >3.9 Grad.Student Staff GPA >3.9 Refinements of selection graph refinement Grad.Student Department Staff Specialization=math • add condition to the node • add present edge and open node Grad.Student GPA >3.9 Specialization=math complement refinement Specialization=math

  18. Staff Department Grad.Student Grad.Student Department Staff Staff Grad.Student Grad.Student Department Staff Staff GPA >3.9 Grad.Student GPA >3.9 Refinements of selection graph refinement Grad.Student Department Staff • add condition to the node • add present edge and open node Specialization=math Grad.Student GPA >3.9 Specialization=math complement refinement Specialization=math

  19. Grad.S Department Grad.Student Grad.Student Department Staff Staff Grad.Student Grad.Student Department Grad.S Staff GPA >3.9 Grad.Student GPA >3.9 Refinements of selection graph refinement Grad.Student Department Staff • add condition to the node • add present edge and open node Specialization=math Grad.Student GPA >3.9 Specialization=math complement refinement Specialization=math

  20. Department Grad.Student Grad.Student Department Staff Department Staff Grad.Student GPA >3.9 Department Look ahead capability refinement Grad.Student Department Staff Specialization=math Grad.Student GPA >3.9 Specialization=math complement refinement Grad.Student Department Staff Specialization=math Grad.Student GPA >3.9

  21. Department Grad.Student Grad.Student Department Staff Department Department Staff complement refinement Department Grad.Student Grad.Student Department GPA >3.9 Staff Grad.Student GPA >3.9 #Students > 200 refinement Look ahead capability Grad.Student Department Staff Specialization=math Grad.Student GPA >3.9 #Students > 200 Specialization=math Specialization=math

  22. Grad.Student … … Staff Grad. Student Staff Grad.Student GPA >3.9 GPA >3.9 … … … … MRDTL algorithm. Construction phase Staff for each non-leaf node: • consider all possible refinements and their complements of the node’s selection graph • choose the best onesbased on informationgain criterion • createchildrennodes Staff Grad.Student Staff Grad.Student

  23. MRDTL algorithm. Classification phase Staff for each leaf: • apply selection graph of theleaf to the test data • classify resulting instanceswith classificationof the leaf Staff Grad.Student Staff Grad.Student Grad.Student … … Staff Grad. Student Staff Grad.Student GPA >3.9 GPA >3.9 … … … … Staff Grad. Student Staff Grad. Student Position =Professor GPA >3.9 …………….. GPA >3.9 Department Department 70-80k 80-100k Spec=math Spec=physics

  24. Grad.Student Department Staff Grad.Student GPA >3.9 The most time consuming operations of MRDTL Entropy associated with this selection graph: Specialization=math E =  (ni /N)log (ni /N) Query associated with counts ni: select distinct Staff.Salary, count(distinct Staff.ID) fromStaff, Grad.Student, Deparment wherejoin_list andcondition_list group by Staff.Salary n1 n2 Result of the query is the following list: … ci , ni

  25. Grad.Student Department Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student Grad.Student Department Staff GPA >3.9 Grad.Student GPA >3.9 Grad.Student GPA >2.0 The most time consuming operations of MRDTL GPA >2.0 Specialization=math Entropy associated with each of the refinements Specialization=math select distinct Staff.Salary, count(distinct Staff.ID) fromtable_list wherejoin_list andcondition_list group by Staff.Salary Specialization=math

  26. Grad.Student Department Staff Grad.Student GPA >3.9 A way to speed up - eliminate redundant calculations Problem:For selection graph with 162 nodes the time to execute a query is more than 3 minutes! Redundancy in calculation:For this selection graph tables Staff and Grad.Student will be joined over and over for all the children refinements of the tree A way to fix:calculate it only once and save for all further calculations Specialization=math

  27. Grad.Student Department Staff Grad.Student GPA >3.9 Speed Up Method. Sufficient tables Specialization=math

  28. Grad.Student Department Staff Grad.Student GPA >3.9 Speed Up Method. Sufficient tables Entropy associated with this selection graph: Specialization=math E =  (ni /N)log (ni /N) Query associated with counts ni: selectS.Salary, count(distinct S.Staff_ID) fromS group by S.Salary n1 Result of the query is the following list: n2 ci , ni …

  29. Grad.Student Department Staff Grad.Student GPA >3.9 Speed Up Method. Sufficient tables Queries associated with the addcondition refinement: select S.Salary, X.A, count(distinct S.Staff_ID) fromS, X where S.X_ID = X.ID group by S.Salary, X.A Specialization=math Calculations for the complement refinement: count(ci , Rcomp(S)) = count(ci, S) – count(ci , R(S))

  30. Grad.Student Department Staff Grad.Student GPA >3.9 Speed Up Method. Sufficient tables Queries associated with the addedge refinement: select S.Salary, count(distinct S.Staff_ID) fromS, X, Y where S.X_ID = X.ID, and e.cond group by S.Salary Specialization=math Calculations for the complement refinement: count(ci , Rcomp(S)) = count(ci, S) – count(ci , R(S))

  31. Speed Up Method • Significant speed up in obtaining the counts needed for the calculations of the entropy and information gain • The speed up is reached by the additional space used by the algorithm

  32. Handling Missing Values Graduate Student Department Staff For each attribute which has missing values we build a Naïve Bayes model: Staff.Position Staff.Name Staff.Dep Department.Spec …

  33. Handling Missing Values Graduate Student Department Then the most probable value for the missing attribute is calculated by formula: Staff P(vi | X1.A1, X2.A2, X3.A3 …) = P(X1.A1, X2.A2, X3.A3 …| vi) P(vi) / P(X1.A1, X2.A2, X3.A3 … ) = P(X1.A1| vi) P(X2.A2| vi) P(X3.A3| vi) … P(vi) / P(X1.A1, X2.A2, X3.A3 … )

  34. Experimental results. Mutagenesis • Most widely DB used in ILP. • Describes molecules of certain nitro aromatic compounds. • Goal: predict their mutagenic activity (label attribute) – ability to cause DNA to mutate. High mutagenic activity can cause cancer. • Two subsets regression friendly (188 molecules) and regression unfriendly (42 molecules). We used only regression friendly subset. • 5 levels of background knowledge: B0, B1, B2, B3, B4. They provide richer descriptions of the examples. We used B2 level.

  35. Experimental results. Mutagenesis • Schema of the mutagenesis database • Results of 10-fold cross-validation for regression friendly set. Best-known reported accuracy is 86%

  36. FUNCTION Experimental results. KDD Cup 2001 • Consists of a variety of details about the various genes of one particular type of organism. • Genes code for proteins, and these proteins tend to localize in various parts of cells and interact with one another in order to perform crucial functions. • 2 Tasks: Prediction of gene/protein localization and function • 862 training genes, 381 test genes. • Many attribute values are missing: 70% of CLASS attribute, 50% of COMPLEX, and 50% of MOTIF in composition table

  37. Experimental results. KDD Cup 2001 Best-known reported accuracy is 72.1% Best-known reported accuracy is 93.6%

  38. Experimental results. PKDD 2001 Discovery Challenge • Consists of 5 tables • Target table consists of 1239 records • The task is to predict the degree of the thrombosis attribute from ANTIBODY_EXAM table • The results for 5:2 cross validation: Best-known reported accuracy is 99.28%

  39. Summary • the algorithm significantly outperforms MRDTL in terms of running time • the accuracy results are comparable with the best reported results obtained using different data-mining algorithms Future work • Incorporation of the more sophisticated techniques for handling missing values • Incorporating of more sophisticated pruning techniques or complexity regularizations • More extensive evaluation of MRDTL on real-world data sets • Development of ontology-guided multi-relational decision tree learning algotihms to generate classifiers at multiple levels of abstraction [Zhang et al., 2002] • Development of variants of MRDTL that can learn from heterogeneous, distributed, autonomous data sources, based on recently developed techniques for distributed learning and ontology based data integration

  40. Thanks to • Dr. Honavar for providing guidance, help and support throughout this research • Colleges from Artificial Intelligence Lab for various helpful discussions • My committee members: Drena Dobbs and Yan-Bin Jia for their help • Professors and lecturers of the Computer Science department for the knowledge that they gave me through lectures and discussions • Iowa State University and Computer Science department for funding in part this research

More Related