1 / 29

Scaling Up Classifiers to Cloud Computers

Scaling Up Classifiers to Cloud Computers. Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame. Distributed Data Mining Data Mining on Clouds Abstraction for Distributed Data Mining Implementing the Abstraction Evaluating the Abstraction

yanni
Download Presentation

Scaling Up Classifiers to Cloud Computers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame

  2. Distributed Data Mining • Data Mining on Clouds • Abstraction for Distributed Data Mining • Implementing the Abstraction • Evaluating the Abstraction • Take-aways

  3. For training D, testing T, and classifier F: Divide D into N partitions with partitioner P Run N copies of F, one on each partition, generating a set of votes on T for each partition Collect votes from all copies of F and combine into a final result R Distributed Data Mining

  4. Challenges in Distributed DM • When dealing with large amounts of data (MB to GB to TB), there are systems problems in addition to data mining problems. • Why should data miners have to be distributed systems experts too? • Scalable (in terms of data size and number of resources) distributed data mining architectures tend to be finely tailored to an application and algorithm.

  5. Proposed Solution • An abstraction framework for distributed data mining • An abstraction allows users to declare a distributed workload based on only what they know (sequential programs, data) • Why an abstraction? • Abstractions hide many complexities from users • Unlike a specially-tailored implementation, a conceptual abstraction provides a general-purpose solution for a problem which may be implemented in any of several ways depending on requirements.

  6. Small (4-16) to very large Use shared filesystem, often centralized Assign dedicated resources, often in large blocks Often static and generally homogeneous Managed by batch or grid engine Large (~500 CPUs, ~300 disks @ ND) Use individual disks rather than a central FS Assign resources dynamically, without a guarantee of dedicated access Commodity, Dynamic, and Heterogeneous Managed by batch or grid engine Clusters versus Cloud Computers

  7. There are several factors to consider: How many nodes to use for computation? How many nodes to use for data. How to connect the data and computation nodes? Implementing the Abstraction

  8. Each process is connected via a data stream. Data exists only in buffers in memory, and stream writers block until stream readers have consumed the buffer. Requires full-way parallelism to complete. Not robust to failure. Streaming

  9. Partitioning is done ahead of computation and partitions are stored on the source node. Computation jobs pull in the proper partition from the source node. Flexible and robust to failure, but not scalable to a large number of computation nodes. Pull

  10. Pull .data P1 P2 P3 P4 Condor Matchmaker

  11. Work assignments are done ahead of partitioning and partitioning distributes data to where it will be used. Data are accessed locally where possible, or accessed in-place remotely. This improves scalability to larger numbers of computation nodes, but can decrease flexibility and increase reliance on unreliable nodes. Push

  12. Push P1 .data P2 P3 Condor Matchmaker P4

  13. Push to a well-known set of intermediate nodes. Pull from those nodes. This combines the advantages of Pull (flexibility, reliability) and Push (I/O performance) Hybrid

  14. Hybrid .data P1 P3 P2 P4 Condor Matchmaker

  15. The effectiveness of these possibilities hinges on the flexibility, reliability, and performance of their components. An example of such a component is the partitioning algorithm. Implementing the Abstraction

  16. Partitioning Algorithms • Shuffle: One instance at a time from the training data, copy into a partition. • Chop: One partition at a time, copy all its instances from the training data

  17. Shuffle A E A I B C D B E F F G J H I C J G K K L D H L

  18. Chop A B C A B C D D E E F F G H G I H J I K L J K L

  19. 5.4G / Locals: using fgets, fprintf. R16s: using fgets, chirp_stream_write, intra-sc0 cluster.

  20. Partitioning Conclusions • Remote partitioning is faster, but less reliable, than local partitioning • Shuffle is slower locally and to a small number of remote hosts but scales better to a large number of remote hosts • Shuffle is less robust than Chop for large data sets

  21. Evaluation is based on performance and scalability. Classifier algorithms were decision trees, K-nearest neighbors, and support vector machines. Evaluating the Architectures

  22. Protein Data Set (3.3M instances, 170MB), Using Decision Trees

  23. KDDCup Data Set (4.9M instances, 700MB), Using Decision Trees

  24. Alpha Data Set (400K instances, 1.8GB), Using KNN

  25. System Architectures • Push • Fastest (remote part., mainly local access, etc.) • 1-to-1 matching or heavy preference. • Could have pure 1-to-1 matching, but more fragile. • Pull • Slowest (local part, on-jobstart transfer) • Most robust (central data, “any” host can run jobs) • Hybrid • Combination: Push to subset of nodes, then Pull. • Faster than Pull (remote part., multiple servers), • More robust than Push (small set of servers)

  26. Future Work • Performance vs. Accuracy for long-tail jobs • Is there a viable tradeoff between turnaround time and degrading classification accuracy? • Efficient data management on multicores • Hierarchical abstraction framework • Submit jobs to clouds of subnets of multicores

  27. Conclusions • Hybrid method is amenable to both cluster-like environments and larger, more-diverse clouds, and its use of intermediate data servers mitigates some of shuffle’s problems. • A fundamental limit of scalability is the available memory on each workstation. For our largest sets, even 16 nodes were not sufficient to run effectively.

  28. Questions? • Data Analysis and Inference Laboratory • Karsten Steinhaeuser (ksteinha@cse.nd.edu) • Nitesh V. Chawla (nchawla@cse.nd.edu) • Cooperative Computing Laboratory • Christopher Moretti (cmoretti@cse.nd.edu) • Douglas Thain (dthain@cse.nd.edu) • Acknowledgements: • NSF CNS-06-43229, CCF-06-21434, CNS-07-20813

More Related