1 / 46

B oosting tuple propagation in multi- relational classification

15th International Database Engineering & Applications Symposium. Lisbon , Portugal, 21-23 September , 2011. Lucantonio Ghionna , Gianluigi Greco. B oosting tuple propagation in multi- relational classification. Dept . of Mathematics, University of Calabria, Italy. Outline.

Download Presentation

B oosting tuple propagation in multi- relational classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 15th International Database Engineering & Applications Symposium Lisbon, Portugal, 21-23 September, 2011 Lucantonio Ghionna, Gianluigi Greco Boostingtuplepropagationinmulti-relationalclassification Dept. of Mathematics, University of Calabria, Italy

  2. Outline • Background • Multi-RelationalClassification • Problem Complexity • Tractability Islands • Heuristic Approaches • DBMS Implementation • System Design • Experiments • ConclusionRemarks

  3. Multi-RelationalClassification Account District Loan account-id district-id district-id dist-name Target relation: Each tuple has a class label, indicating whether a loan is paid on time. loan-id frequency Card region account-id date card-id #people date disp-id #lt-500 amount type #lt-2000 duration Transaction issue-date #lt-10000 payment trans-id #gt-10000 account-id #city Disposition date ratio-urban Order disp-id type avg-salary order-id account-id operation unemploy95 account-id amount client-id unemploy96 bank-to balance How to makedecision on loangranting? den-enter account-to symbol Client #crime95 amount client-id #crime96 type birth-date gender district-id

  4. Multi-RelationalClassification • Search for good predicates across multiple relations Do goodpayersaccesstheir account with a "monthly" frequency? Applicant #1 Loan Applications Applicant #2 Orders Applicant #3 Accounts Applicant #4 Other relations Districts

  5. Solving CLP: State-of-Art • Flatteningapproach [Krogel03] • Build the universal relation throughjoins • Combinatorialexplosition of data, large tables with manyattributes [Mugg92] • Upgradingapproach [Xu06] • Keepthe universal relationvirtualby propagatinglabelsthroughforeignkeys • Global Perspective [Xu06] • Local Perspective [Blockheel03,Yin04,Xu06]

  6. Contributions • We show that the propagation problem can effectively be solved on databases whose hypergraphsare nearly-acyclic • We design effectivealgorithms for the global/localperspectives • Weprovide an implementation of a complete JDBC basedsystem for tuplepropagation • Experiments

  7. Problem Complexity • Tractability Islands • Heuristic Approaches

  8. Global Perspective: TractabilityIslands of CLP • Good news Exponentially large universal relations does not imply CLP intractability [Xu06] p1(X,Y) p2(X,Z,W) p5(Y,T,X)

  9. Global Perspective: TractabilityIslands of CLP • Good news Exponentially large universal relations does not imply CLP intractability [Xu06] Bottom up p1(X,Y) p2(X,Z,W) p5(Y,T,X)

  10. Global Perspective: TractabilityIslands of CLP • Good news Exponentially large universal relations does not imply CLP intractability [Xu06] Bottom up p1(X,Y) p2(X,Z,W) p5(Y,T,X)

  11. Global Perspective: TractabilityIslands of CLP • Good news Exponentially large universal relations does not imply CLP intractability [Xu06] Top down p1(X,Y) p2(X,Z,W) p5(Y,T,X)

  12. Global Perspective: TractabilityIslands of CLP • Good news Exponentially large universal relations does not imply CLP intractability [Xu06] Top down p1(X,Y) p2(X,Z,W) p5(Y,T,X) CLP tractable on dependency graphs whose undirected versions are (forests of) trees[Xu06]

  13. TractabilityIslands of CLP. Are treesenough? Q=R1 (B1,A1, …, Am), R2(B2,A1, …, Am), …, R1(B1,A1, …, Am),…, R’1 (A1), R’2(A2),…,R’m(Am) R1(B1,A1, …, Am) R2 (B2,A1, …, Am) Rn (Bn,A1, …, Am) ….. ….. ….. ….. R’1 (A1) R’2(A2) R’m(Am) ….. The (undirected) dependencygraphis a bipartite clique of size m × n, and hence it is not a tree and the result in [XU06] does not apply CLP isstilltractable !

  14. TractabilityIslands of CLP. HypertreeDecompositions Q=R1 (B1,A1, …, Am), R2(B2,A1, …, Am), …, R1(B1,A1, …, Am),…, R’1 (A1), R’2(A2),…,R’m(Am) R2 {B1, …, Bm ,A1, …, Am} R1, R2, …, Rm B2 R’1 R’2 R’m B1 Bm A2 Am A1 Rm … R1 {A2}R’2 {A1} R’1 {Am} R’m ……. • For fixedk, • decidingwhetherhw(Q)  kis in P [Gottlob02] • computinghypertreedecompositionsis in P [Gottlob02]

  15. TractabilityIslands of CLP. HypertreeDecompositions • Cyclic dependency graph…… • ….bounded width!

  16. CLPonDBkalgorithm • Bottom up Phase Loan Transaction Order Account Account,Disposition Card Client District

  17. CLPonDBkalgorithm • Bottom up Phase Loan Transaction Order Account Account,Disposition Card Client District

  18. CLPonDBkalgorithm • Bottom up Phase Loan Transaction Order Account Account,Disposition Card Client District

  19. CLPonDBkalgorithm • Bottom up Phase Loan Transaction Order Account Account,Disposition Card Client District

  20. CLPonDBkalgorithm • Bottom up Phase Loan Transaction Order Account Account,Disposition Card Client District

  21. CLPonDBkalgorithm • Bottom up Phase Loan Transaction Order Account Account,Disposition Card Client District

  22. CLPonDBkalgorithm • Bottom up Phase Loan Transaction Order Account Account,Disposition Card Client District

  23. CLPonDBkalgorithm • Bottom up Phase Loan Transaction Order Account Account,Disposition Card Client District

  24. CLPonDBkalgorithm • Top Down Phase Loan Transaction Order Account Account,Disposition Card Client District

  25. CLPonDBkalgorithm • Top Down Phase Loan <1,1> Transaction Order Account Account,Disposition Card Client District

  26. CLPonDBkalgorithm • Top Down Phase Loan <1,1> <1,1> Transaction Order Account Account,Disposition Card Client District

  27. CLPonDBkalgorithm • Top Down Phase Loan <1,1> <1,1> Transaction Order <1,1> Account Account,Disposition Card Client District

  28. CLPonDBkalgorithm • Top Down Phase Loan <1,1> <1,1> Transaction Order <1,1> Account <1,1> Account,Disposition Card Client District

  29. CLPonDBkalgorithm • Top Down Phase Loan <1,1> <1,1> Transaction Order <1,1> Account <1,1> Account,Disposition <1,1> Card Client District

  30. CLPonDBkalgorithm • Top Down Phase Loan <1,1> <1,1> Transaction Order <1,1> Account <1,1> Account,Disposition <1,1> <1,1> Card Client District

  31. CLPonDBkalgorithm • Top Down Phase Loan <1,1> <1,1> Transaction Order <1,1> Account <1,1> Account,Disposition <1,1> <1,1> Card Client <1,1> CLPonDBk solves CLP in time O(|D| × max RiD||Ri||k+3), on the class of those instances whose associated hypergraphshave hypertree width bounded by k. District

  32. L-CLP: Local Perspectiveon PropagationProblem In several multi-relational approaches, CLP is heuristically restricted to portions of the database • Reducing the search space can pragmatically speed-up the computation • Still, joining many relations may be challenging from a computational viewpoint.

  33. L-CLP: NTtoT_onDBMS and TtoNT_onDBMS Propagation path from R1 to Rm only requires joining pairs of “adjacent” relations “Target to Non-Target” Propagation (TtoNTonDBMS) Propagate information from R1to Rm, evaluate C on the result “Non-Target to Target” Propagation (NTtoTonDBMS) Start by filtering Rmwith the condition C, by joining the result with Rm-1, and by iterating the process back to R1

  34. L-CLP: NTtoT_onDBMS and TtoNT_onDBMS TtoNT_onDBMS NTtoT_onDBMS

  35. DBMS Implementation • System Design • Experiments

  36. A JDBC System for CLP

  37. ExperimentationSettings Scenario: • CROSSMINE + NTtoT_onDBMS • CROSSMINE + TtoNT_onDBMS • CROSSMINE + TupleIDPropagation Parameters: • The number m of relations • Thenumber ||target || of tuples in the target relation; • The “propagation ratio” ||target ||/||R|| • The selectivity s of each join attribute Environment: 2.1GHz Centrino PC, 1 Gb RAM, 5400 rpm hard disk (Windows XP Professional)

  38. Computation Time and Propagation Time m=5; ||target ||/||R||=1; s=50% • Dramaticimprovements w.r.t. standard Crossmine • Effectivescaling for large relations • ….

  39. Gains w.r.t. Crossmine m=5; s=50% NTtoT_onDBMS or TtoNT_onDBMS ? • Gain on propagation up to 95 % • Gain on computation time up to 90 % • ……

  40. NTtoT_onDBMS vs TtoNT_onDBMS ||target ||=100000;m=5; s=50% ||target ||=100000;m=5; s=50% ||target ||/R=1 • TtoNT_onDBMSis the best with lowpropagation ratio • NTtoT_onDBMSis the best whentarget relation is much larger than other relations • Semi-joins operators are a winning choice in practical database applications

  41. Conclusion and Discussion CLP problemis a challenging task which can be effectivelyaskedusing state-of-art query-optimization methods • Propagation over large class of nearly-acyclic database schemas is in fact tractable (polynomial upper bound guarantee) • Result in [Xu06] emerges as a special case • Database implementation of local-perspective methods shows tremendous benefits w.r.t. standard in-memory strategies Potential benefits for many classifications algorithms, such as Bayesian classifiers[Getoor01], probabilistic models [Taskar02], and decisiontree learningmethods[Leiva03].

  42. THANK YOU!

  43. References • P. A. Bernstein and N. Goodman. Power of natural semijoins. SIAM Journal on Computing, 10(4):751–771, 1981. • H. Blockeel and L. De Raedt. Top-down Induction of First-Order Logical Decision Trees. Artificial Intelligence, 101(1-2):285–297, 1998. • H. Blockeel and M. Sebag. Scalability and Efficiency in Multi-relational Data Mining. SIGKDD ExplorationsNewsletters, 5(1):17–30, 2003. • M. Ceci and D. Malerba. Mr-SBC: a Multi-Relational Naive Bayes Classifier. In Proc. of PKDD’03, pages 95–106, 2003. • S. Dˇzeroski. Multi-relational Data Mining: an Introduction. SIGKDD ExplorationsNewsletters, 5(1):1–16, 2003. • P. A. Flach and N. Lachiche. IBC2: A True First-Order Bayesian Classifier. In Proc. of ILP’02, pages133–148, 2002. • R. Frank and F.M.M. Ester. A Method for Multi-relational Classification Using Single and Multi-feature Aggregation Functions. In Proc. Of PKDD’07, pages 430–437, 2007. • L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning Probabilistic Models of Relational Structure. In Proc. of ICML’01, pages 170–177, 2001. • G. Gottlob, N. Leone, and F. Scarcello. Hypertreedecomposition and tractable queries. Journal of Computer and System Sciences, 64:579–627, 2002. • G. Gottlob, Z. Miklos, and T. Schwentick. Generalized hypertreedecompositions: Np-hardness and tractable variants. In Proc. of PODS’07, pages 13–22, 2007. • H. Guo and H. L. Viktor. Multirelationalclassification: a multiple view approach. Knowledge and Information Systems, 17(3):287–312, 2008.

  44. References • G. Jing-Feng, L. Jing, and B. Wei-Feng. An Efficient RelationalDecision Tree ClassificationAlgorithm. In Proc. of ICNC’07, pages 530–534, 2007. • M. A. Krogel, S. Rawles, F. Zelezny, P. A. Flach, N. Lavrac, and S. Wrobel. Comparative Evaluation of Approaches to Propositionalization. In In Proc. Of ILP’03, pages 197–214, 2003. • H. Leiva, A. Atramentov, and V. Honavar. A Multi-relational Decision Tree Learning Algorithm. In Proc. of ILP’03, pages 97–112, 2002. • H. Liu, X. Yin, and J. Han. An efficient Multi-relational Na¨ıve Bayesian classifier based on Semantic Relationship Graph. In Proc. of MRDM’05, pages 39–48, 2005. • S. Muggleton. Inductive Logic Programming. Academic Press, New York, 1992. J. Neville, D. Jensen, L. Friedland, and M. Hay. Learning Relational Probability Trees. In Proc. Of KDD’03, pages 625–630, 2003. • J. Neville, D. Jensen, and B. Gallagher. Simple Estimators for Relational Bayesian Classifiers. In Proc. of ICDM’03, page 609, 2003. • U. Pompe and I. Kononenko. NaiveBayesianClassifier within ILP-R. In Proc. of ILP’95, pages 417–436, 1995. • B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proc. Of UAI’02, 2002. • K. Wang, Y. Xu, P.S. Yu, and R. She. Building Decision Trees on Records Linked through Key References. In Proc. of SDM’05, 2005. • Y. Xu, K. Wang, A. Wai-Chee Fu, R. She, and J. Pei. Classification Spanning Correlated Data Streams. In Proc. of CIKM’06, pages 132–141, 2006. • M. Yannakakis. Algorithms for acyclic database schemes. In Proc. of VLDB’81, pages 82–94. • X. Yin, J. Han, J. Yang, and P.S. Yu. CrossMine: EfficientClassificationAcross Multiple Database Relations. In Proc. of t ICDE’04, page 399, 2004.

  45. Multi-RelationalClassification Formal Framework • Input: D (with target having attribute CL), I, a class label ‘l’, and a condition C over the attributes of some relation RD; • Output: key[target] C^target.CL=‘l’R(D, I)

  46. {account-id,district-id} {Account} {transaction-id,account-id} {Transaction} {account-id,disp-id,client-id,district-id} {Account,Disposition} {loan-id,account-id} {Loan} {disp-id,card-id} {card} {client-id,district-id} {Client} {order-id,account-id} {Order} {district-id} {District}

More Related