1 / 22

myHadoop - Hadoop-on-Demand on Traditional HPC Resources

myHadoop - Hadoop-on-Demand on Traditional HPC Resources . Sriram Krishnan, Ph.D. sriram@sdsc.edu. Acknowledgements. Mahidhar Tatineni Chaitanya Baru Jim Hayes Shava Smallen. Outline. Motivations Technical Challenges Implementation Details Performance Evaluation. Motivations.

nibal
Download Presentation

myHadoop - Hadoop-on-Demand on Traditional HPC Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. myHadoop - Hadoop-on-Demand on Traditional HPC Resources Sriram Krishnan, Ph.D. sriram@sdsc.edu

  2. Acknowledgements • MahidharTatineni • ChaitanyaBaru • Jim Hayes • ShavaSmallen

  3. Outline • Motivations • Technical Challenges • Implementation Details • Performance Evaluation

  4. Motivations • An open source tool for running Hadoop jobs on HPC resources • Easy to configure and use for the end-user • Play nice with existing batch systems on HPC resources • Why do we need such a tool? • End-users: I already have Hadoop code – and I only have access to regular HPC-style resources • Computer Scientists: I want to study the implications of using Hadoop on HPC resources • And I don’t have root access to these resources

  5. Some Ground Rules • What this presentation is: • A “how-to” for running Hadoop jobs on HPC resources using myHadoop • A description of the performance implications of using myHadoop • What this presentation is not: • A propaganda for the use of Hadoop on HPC resources

  6. Main Challenges • Shared-nothing (Hadoop) versus HPC-style architectures • In terms of philosophies and implementation • Control and co-existence of Hadoop and HPC batch systems • Typically both Hadoop and HPC batch systems (viz., SGE, PBS) need completely control over the resources for scheduling purposes

  7. Traditional HPC Architecture Shared-nothing (MapReduce-style) Architectures ETHERNET COMPUTE/DATA CLUSTER WITH LOCAL STOARGE

  8. Hadoop and HPC Batch Systems • Access to HPC resources is typically via batch systems – viz. PBS, SGE, Condor, etc • These systems have complete control over the compute resources • Users typically can’t log in directly to the compute nodes (via ssh) to start various daemons • Hadoop manages its resources using its own set of daemons • NameNode & DataNodefor Hadoop Distributed File System (HDFS) • JobTracker & TaskTrackerfor MapReduce jobs • Hadoop daemons and batch systems can’t co-exist seamlessly • Will interfere with each other’s scheduling algorithms

  9. myHadoop Requirements • Enabling execution of Hadoop jobs on shared HPC resources via traditional batch systems • Working with a variety of batch systems (PBS, SGE, etc) • Allowing users to run Hadoop jobs without needing root-level access • Enabling multiple users to simultaneously execute Hadoop jobs on the shared resource • Allowing users to either run a fresh Hadoop instance each time (a), or store HDFS state for future runs (b)

  10. myHadoop Architecture BATCH PROCESSING SYSTEM (PBS, SGE) [1] [2, 3] COMPUTE NODES PERSISTENT MODE HADOOP DAEMONS NON-PERSISTENT MODE [4(b)] [4(a)] PARALLEL FILE SYSTEM

  11. Implementation Details: PBS, SGE

  12. User Workflow BOOTSTRAP TEARDOWN

  13. Performance Evaluation • Goals and non-goals • Study the performance overhead and implication of myHadoop • Not to optimize/improve existing Hadoop code • Software and Hardware • Triton Compute Cluster (http://tritonresource.sdsc.edu/) • Triton Data Oasis (Lustre-based parallel file system) for data storage, and for HDFS in “persistent mode” • Apache Hadoop version 0.20.2 • Various parameters tuned for performance on Triton • Applications • Compute-intensive: HadoopBlast (Indiana University) • Modest-sized inputs – 128 query sequences (70K each) • Compared against NR database – 200MB in size • Data-intensive: Data Selections (OpenTopography Facility at SDSC) • Input size from 1GB to 100GB • Sub-selecting around 10% of the entire dataset

  14. HadoopBlast

  15. Data Selections

  16. Related Work • Recipe for running Hadoop over PBS in blogosphere • http://jaliyacgl.blogspot.com/2008/08/hadoop-as-batch-job-using-pbs.html • myHadoop is “inspired” by their approach – but is more general-purpose and configurable • Apache Hadoop On Demand (HOD) • http://hadoop.apache.org/common/docs/r0.17.0/hod.html • Only PBS support, needs external HDFS, harder to use, and has trouble with multiple concurrent Hadoop instances • CloudBatch – batch queuing system on clouds • Use of Hadoop to run batch systems like PBS • Exact opposite of our goals – but similar approach

  17. Center for Large-Scale Data Systems Research (CLDS) Industry Advisory Board • Student internships • Joint collaborations CLDS Academic Advisory Board Visiting Fellows Information Metrology Data Growth, Information Mgt Cloud Storage Architecture Cloud Storage and Performance Benchmarking Industry Interchange Mgt, Technical Forums How Much Information? Project Benchmarking,Performance Evaluation and Systems Development Projects Industry Forums and Professional Education Public Private Personal Industry-University Consortium on Software for Large-scale Data Systems

  18. Summary • myHadoop – an open source tool for running Hadoop jobs on HPC resources • Without need for root-level access • Co-exists with traditional batch systems • Allows “persistent” and “non-persistent” modes to save HDFS state across runs • Tested on SDSC Triton, TeraGrid and UC Grid resources • More information • Software: https://sourceforge.net/projects/myhadoop/ • SDSC Tech Report: http://www.sdsc.edu/pub/techreports/SDSC-TR-2011-2-Hadoop.pdf

  19. Questions? • Email me at sriram@sdsc.edu

  20. Appendix

  21. core-site.xml: hdfs-site.xml: hdfs-site.xml:

  22. Data SelectCounts on Dash

More Related