1 / 27

Distributed Linear Programming and Resource Management for Data Mining in Distributed Environments

Distributed Linear Programming and Resource Management for Data Mining in Distributed Environments. Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA.

hedda
Download Presentation

Distributed Linear Programming and Resource Management for Data Mining in Distributed Environments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Linear Programming and Resource Management for Data Mining in Distributed Environments Haimonti Dutta1 and Hillol Kargupta2 1Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2University of Maryland, Baltimore County, Baltimore, MD. Also affiliated to Agnik, LLC, Columbia, MD.

  2. Motivation Support Vector (Kernel) Regression An illustration Support Vector Kernel Regression Find a function f(x)=y to fit a set of example data points Problem can be phrased as constrained optimization task Solved using a standard LP solver

  3. Motivation contd .. Knowledge Based Kernel Regression • In addition to sample points, give advice • If (x ≥3) and (x ≤5) Then (y≥5) • Rules add constraints about regions • Constraints added to LP and a new solution (with advice constraints) can be constructed Fung, Mangasarian and Shavlik,”Knowledge Based Support Vector Machine Classifiers”, NIPS, 2002. Mangasarian, Shavlik and Wild, “Knowledge Based Kernel Approximation”, JMLR, 5, 1127 – 1141, 2005. Figure adapted from McLain, Shavlik, Walker and Torrey, “Knowledge-based Support Vector Regression for Reinforcement Learning”, IJCAI, 2005

  4. Distributed Data Mining Applications – An example of Scientific Data Mining in Astronomy Distributed data and computing resources on the National Virtual Observatory Need for distributed optimization strategies P2P Data Mining on homogeneously partitioned sky survey H Dutta, Empowering Scientific Discovery by Distributed Data Mining on the Grid Infrastructure, Ph.D Thesis, UMBC, Maryland, 2007.

  5. Road Map • Motivation • Related Work • Framing an Linear Programming problem • The simplex algorithm • The distributed simplex algorithm • Experimental Results • Conclusion and Directions of Future Work

  6. Related Work Resource Discovery in Distributed Environments • Imantichi, “Resource Discovery in Large Resource Sharing Experiments”, Ph.D. Thesis, University of Chicago, 2003. • Livny and Solomon, “Matchmaking: Distributed Resource Management for high throughput computing”, HPDC, 1998. Optimization Techniques • Yarmish, “Distributed Implementation of the Simplex Method”, Ph.D. Thesis, CIS Polytechnic University, 2001. • Hall and McKinnon, “Update procedures for parallel revised simplex methods, Tech Report, University of Edinburg, UK, 1992 • Craig and Reed, “Hypercube Implementation of the Simplex Algorithm”, ACM, pages 1473 – 1482, 1998.

  7. The Optimization Problem • Assumptions: • n nodes in the network • The network is static • Dataset Di at node i • Processing Cost at i-th node – νi per record • Transportation Cost between i and j – μij • Amount of Data Transferred between nodes – xij • Cost Function Z = Σijμij xij + νi xij = Σij cij xij

  8. Framing the Linear Programming Problem: An illustration 600 GB Objective Function • z = 6.03x12 +9.04x23 +6.52x15 +8.28x14 +14.42x25 + 9.58x34 + 12.32x45 Constraints • C(X) = ∑ijµijxij + νjxij = ∑ijcijxij , Cij = µij + νij • x12 + x14 + x15 ≤ 300; • x12 + x25 + x23 ≤ 600; • x15+x25+x45 ≤ 300 ; • x23+x34 ≤ 300; • 0 ≤ x12 ≤ D1; 0 ≤ x23 ≤ D2; • 0 ≤ x15 ≤ D1; 0 ≤ x14 ≤ D1; • 0 ≤ x25 ≤ D2; 0 ≤ x34 ≤ D3; • 0 ≤ x45 ≤ D4 1 3.8 2 6.1 300 GB 6.5 3 2.5 10.4 300 GB 5 4 7.8 8.3 300 GB 300 GB

  9. The Simplex Algorithm • Find • x1 ≥ 0, x2 ≥ 0, …. , xn ≥ 0 and • Min z = c1 x1 + c2 x2 + …. + cn xn • Satisfying Constraints • A1 x1 + A2 x2 + ….. + An xn = B • The Simplex Algorithm The simplex tableau

  10. The Simplex Algorithm – Contd … The Problem • Maximize z = x1 + 2x2 – x3 • Subject to • 2x1+ x2+ x3 ≤ 14 • 4x1+2x2+3x3 ≤ 28 • 2x1+5 x2+5x3 ≤ 30 The Steps of the Simplex Algorithm (Dantzig) • Obtain a canonical representation (Introduce Slack Variables) • Find a Column Pivot • Find a Row Pivot • Perform Gauss Jordan Elimination

  11. The simplex tableau and iterations Canonical Representation x1 x2 x3 s1 s2 s3 B 14/1= 14 28/2=14 30/5= 6 Pivot Row Pivot Column

  12. Simplex iterations contd … • Perform Gauss Jordan Elimination • The Final Tableau

  13. Road Map • Motivation • Related Work • Framing an Linear Programming problem • The simplex algorithm • The distributed simplex algorithm • Experimental Results • Conclusions and Future Work

  14. Node1 Node 2 Node 3 Node 5 Node 4 The Distributed Problem – An Example x12+x15+x14+2x25≤300 x12+2x15-x25=2 x12+x23+x25≤600 2x25-x12-x23=4 600 GB 300 GB x23+x34≤300 300 GB 300 GB 300 GB x15+x25+x45≤300 x25-2x15-x45=5 x34 +8 x25≤300 Each site observes different constraints, but wants to solve the same objective function z = 6.03x12 + 9.04x23 + 6.52x15 + 8.28x14 + 14.42x25 + 9.58x34 + 12.32x45

  15. Distributed Canonical Representation • An initialization step • No of basic variables to add = Total no of constraints in the system • Build a spanning tree in the network • Perform a distributed sum estimation algorithm • Builds a canonical representation exactly identical to the one if data was centralized

  16. The Distributed Algorithm for solving the LP problem • Steps involved: • Estimate Column pivot • Estimate Row pivot (requires communication with neighbors) • Perform Gauss Jordan elimination

  17. Node1 Node 2 Node 3 Node 5 Node 4 Illustration of the Distributed Algorithm Column pivot selection is done at each node

  18. Distributed Row Pivot selection • Protocol Push Min (gossip based) • Minimum estimation problem • Iteration t-1: {mr} values sent to node i • mti = min {{mr} , current row pivot} • Termination: All nodes have exactly the same minimum value

  19. Analysis of Protocol Push Min • Based on spread of an epidemic in a large population • Suseptible, infected and dead nodes • The “epidemic” spreads exponentially fast Node1 Node 2 Node 3 Node 5 Node 4

  20. Comments and Discussions • Assume η no of nodes in the network • Communication Complexity is O(no of iterations of simplex X η) • Worst case Simplex may require exponential no of iterations. • For most practical purposes it is λ m (λ<4)

  21. Road Map • Motivation • Related Work • Framing an Linear Programming problem • The simplex algorithm • The distributed simplex algorithm • Experimental Results • Conclusion and Directions of Future Work

  22. Experimental Results • Artificial Data Set • Simulated constraint matrices at each node • Used Distributed Data Mining Toolkit (DDMT) developed at University of Maryland, Baltimore County (UMBC) for simulating the network structure • Two different metrics for evaluation: • TCC (Total Communication Cost in the network) • Average Communication Cost per Node (ACCN)

  23. Communication Cost • Average Communication Cost Per Node versus Number of Nodes in the network

  24. More Experimental Results …. TCC versus No of Variables at each node TCC versus No of constraints at each node

  25. Conclusions and Future Work • Resource management and pattern recognition present formidable challenges on distributed systems • Present a distributed algorithm for resource management based on the simplex algorithm • Test our algorithm on simulated data Future Work • Incorporation of dynamics of the network • Testing the algorithm on a real distributed network • Effect of size and structure of network on the mining results • Examine the trade-off between accuracy and communication cost incurred before and after using distributed simplex on a mining task like classification or clustering

  26. Selected Bibliography • G.B.Dantzig, “Linear Programming and Extensions”. Princeton University Press, Princeton, NJ, 1963 • Kargupta and Chan,”Advances in Distributed and Parallel Knowledge Discovery”, AAAI Press, Menlo Park, CA, 2000. • A. L. Turinsky. “Balancing Cost and Accuracy in Distributed Data Mining”. PhD thesis, University of Illinois at Chicago., 2002. • Haimonti Dutta, “Empowering Scientific Discovery by Distributed Data Mining on the Grid Infrastructure”, Ph.D. Thesis, UMBC, 2007. • Mangasarian, “Mathematical Programming in Data Mining”, DMKD, Vol 42, pg 183 – 201, 1997.

  27. Questions ?

More Related