1 / 12

Anna Atramentov and Vasant Honavar* Artificial Intelligence Laboratory

Speeding Up Multi-Relational Data Mining. Anna Atramentov and Vasant Honavar* Artificial Intelligence Laboratory Department of Computer Science Iowa State University Ames, IA 50011, USA www.cs.iastate.edu/~honavar/aigroup.html.

mitch
Download Presentation

Anna Atramentov and Vasant Honavar* Artificial Intelligence Laboratory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speeding Up Multi-Relational Data Mining Anna Atramentov and Vasant Honavar* Artificial Intelligence Laboratory Department of Computer Science Iowa State University Ames, IA 50011, USA www.cs.iastate.edu/~honavar/aigroup.html * Support provided in part by National Science Foundation, Carver Foundation, and Pioneer Hi-Bred, Inc.

  2. Motivation Importance of relational learning: • Growth of data stored in MRDB • Techniques for learning unstructured data often extract the data into MRDB One of the promising approaches to relational learning: • MRDM (Multi-Relational Data Mining) framework developed by Knobbe et. al. (1999) • MRDTL (Multi-Relational Decision Tree Learning) algorithm implemented by Leiva et. al. (2002) Goal • Speed up MRDM framework and in particular MRDTL algorithm

  3. Problem Formulation Given: Data stored in relational database Goal: Learn a predictive model for the instances in the target table Example of multi-relational database schema instances

  4. Grad.Student GPA >3.9 MRDM overview. Selection graphs Grad.Student Department • Nodes correspond to the tables from the database • Edges correspond to the associations between tables • It corresponds to the subset of the instances from the target table having some property • It is a way of specifying attributes in the relational setting Staff Specialization=math Staff

  5. MRDM overview. Transforming selection graphs into SQL queries Select distinctT0.id FromStaff T0, Graduate_Student T1 Where T0.id=T1.Advisor Staff Grad. Student Generic query: select distinctT0.primary_key fromtable_list wherejoin_list andcondition_list Staff Grad. Student SelectdistinctT0.id FromStaff T0 Where T0.id not in ( Select T1. id From Graduate_Student T1) Grad. Student Select distinct T0. id From Staff T0, Graduate_Student T1 WhereT0.id=T1.Advisor T0. id not in ( Select T1. id From Graduate_Student T1 Where T1.GPA > 3.9) Staff Grad. Student GPA >3.9

  6. Grad.Student Department Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student Grad.Student Department Staff GPA >3.9 Grad.Student GPA >3.9 Grad.Student GPA>2.0 MRDM overview. Refinements of selection graphs refinement GPA >2.0 Specialization=math Specialization=math complement refinement Specialization=math

  7. Grad.Student Department Staff Grad.Student GPA >3.9 The most time consuming operations of MRDTL Query associated with the selection graph: Specialization=math select distinct Staff.Salary, count(distinct Staff.ID) fromStaff, Grad.Student, Department wherejoin_list andcondition_list group by Staff.Salary

  8. Grad.Student Department Staff Grad.Student GPA >3.9 A way to speed up - eliminate redundant calculations Problem:For selection graph with 160 nodes the time to execute a query is more than 3 minutes! Redundancy in calculation:Tables Staff and Grad.Student will be joined for all the children refinements A way to fix:make the join only once and save necessary information for all further calculations Specialization=math

  9. Grad.Student Department Staff Grad.Student GPA >3.9 Speed Up Method. Sufficient tables Specialization=math

  10. Grad.Student Department Staff Grad.Student GPA >3.9 Speed Up Method. Sufficient tables Specialization=math Query associated with the selection graph: selectS.Salary, count(distinct S.Staff_ID) fromS group by S.Salary

  11. Experimental results

  12. Summary • A general approach for speeding up MRDM framework • MRDTL algorithm is a competitive algorithm for learning from RDB in terms of both accuracy and time Future work • techniques for handling missing values • pruning techniques or complexity regularizations • use of the aggregates for the attribute values • more extensive evaluation of MRDTL on real-world data sets

More Related