1 / 21

Scalable Mining For Classification Rules in Relational Databases

Scalable Mining For Classification Rules in Relational Databases. Min Wang Bala Iyer Jeffrey Scott Vitter. מוצג ע”י : נדב גרוסאוג. Abstract. Problem : Increase in Size of Training Set MIND (MINing in Database) Classifier Can be Implemented easily over SQL

akira
Download Presentation

Scalable Mining For Classification Rules in Relational Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Mining For Classification Rules in Relational Databases Min Wang Bala Iyer Jeffrey Scott Vitter מוצג ע”י : נדב גרוסאוג

  2. Abstract • Problem : Increase in Size of Training Set • MIND (MINing in Database) Classifier • Can be Implemented easily over SQL • Other Classifiers Need O(N) space In Memory. • MIND Scales Well Over : • I/O • # of Processors

  3. Over View Introduction Algorithm Database Implementation Performance Experimental Results Conclusions

  4. Introduction - Classification Problem DETAIL TABLE CLASSIFYER Age <= 30 yes no salary <= 62K safe yes no risky safe

  5. Introduction - Scalability In Classification Importance Of Scalability: • Use a Very Large Training Set – Data is Not Memory Resident. • Number Of CPUs – better usage of resources.

  6. Introduction - Scalability In Classification Properties of MIND: • Scalable in memory • Scalable In CPU • Uses SQL • Easy to implement Assumptions Attribute Values Are Discrete We focus on the growth stage(no pruning)

  7. The Algorithm - DataStracture DATA in DETAIL TABLE DETAIL(attr1,attr2,….,class,leaf_num) attri = i attribute class = Class type leaf_num = the number of leaf the example belongs to(this data can be calculated by the known tree)

  8. The Algorithm - gini index S - data Set C - number of Classes Pi - relative frequency of class i in S gini index :

  9. The Algorithm GrowTree(DETAIL TABLE) Initialize tree T and put all records of DETAIL in root while (some leaf in T is not a STOP node) for each attribute i do evaluate gini index for each non-STOP leaf at each split value with respect to attribute i for each non-STOP leaf do get the overall best split for it; partition the records and grow the tree for one more level according to best splits; mark all small or pure leaves as STOP nodes; return T;

  10. Database Implementation - Dimension table • For Each Attribute and each level of the tree INSERT INTO DIMi SELECT leaf_num,class,attri,count(*) FROM DETAIL WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attri Size of Dimi = #leaves * #distinct values of attri * #classes

  11. Database Implementation - Dimension table SQL SELECT FROM DETAIL INSERT INTO DIM1leaf_num,class,attr1,count(*) WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attr1 INSERT INTO DIM2leaf_num,class,attr2,count(*) WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attr2

  12. Database Implementation - UP/DOWN - split for each attribute we find all possible split places: INSERT INTO UP SELECT d1.leaf_num, d1.attri, d1.class,SUM(d2.count) FROM(FULL OUTER JOIN DIMi d1, DIMid2 ON d1.leaf_num = d2.leaf_num AND d2. attri <= d1. attri AND d1.class = d2.class GROUP BY d1.leaf_num, d1. attri, d1.class

  13. Database Implementation - Class View create view for each class k and attribute i: CREATE VIEW Ck_UP(leaf_num,attri,count) SELECT leaf_num,attri,count FROM UP WHERE class = k

  14. Database Implementation - GINI VALUE create view for all gini values: CREATE VIEW GINI_VALUE(leaf_num, attri,gini)AS SELECT u1.leaf_num, u1.attri,ƒgini FROM C1_UP u1,..,Cc_UP uc,C1_DOWN d1... ,Cc_DOWN dc WHERE u1.attri = .. = uc. attri = .. = dc. attri AND u1.leaf_num = .. = uc.leaf_num = .. = dc.leaf_num

  15. Database Implementation - MIN GINI VALUE create table for minimum gini values for attribute i : INSERT INTO MIN_GINI SELECT leaf_num,i,attri,gini FROM GINI_VALUE a WHERE a.gini = (SELECT MIN(gini) FROM GINI_VALUE b WHERE a.leaf_num = b.leaf_num

  16. Database Implementation - BEST SPLIT create view over MIN_GINI for best split : CREATE VIEW BEST_SPLIT (leaf_num,attr_name,attr_value) SELECT leaf_num, attr_name,attr_value FROM MIN_GINI a WHERE a.gini = (SELECT MIN(gini) FROM MIN_GINI b WHERE a.leaf_num = b.leaf_num

  17. Database Implementation - Partitioning Build new nodes by spliting old nodes according to BEST_SPLIT values Set correct node to recoreds: Update leaf_node - is done by a function No need to UPDATE data or DB

  18. Performance I/O cost of MIND: I/O cost of SPRINT:

  19. Experimental Results Normalized time to finish building the tree Normalized time to build the tree per example

  20. Experimental Results Normalized time to build the tree per # of processor Time to build tree By Training Set Size

  21. Conclusions • MIND works over DB • MIND works well because • MIND rephrases the classification to a DB problem • MIND avoid UPDATES the DETAIL table • Parallelism and Scaling Are achived by the use of RDBMS • MIND uses a user function to get the performance gain in the DIMi creation.

More Related