1 / 55

ARMS Active Resource Management Services For Big Data Processing

ARMS Active Resource Management Services For Big Data Processing. Presentation Two. Agenda. 1: Title 2: Outline 3: Members 4: Mentor 5-6: Societal Issue 7: History 8-9: Dr. Li 10-11: Cluster Computing 12-14: Case Study 15: Accuracy 16: Current Major Functional Component Diagram

kyrie
Download Presentation

ARMS Active Resource Management Services For Big Data Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ARMS Active Resource Management ServicesFor Big Data Processing Presentation Two

  2. Agenda • 1: Title • 2: Outline • 3: Members • 4: Mentor • 5-6: Societal Issue • 7: History • 8-9: Dr. Li • 10-11: Cluster Computing • 12-14: Case Study • 15: Accuracy • 16: Current Major Functional Component Diagram • 17: Current Process Flow • 18: Problem Statement • 19: Proposed Major Functional Component Diagram • 20: Proposed Process Flow • 21-24: Dinosolve Walkthrough • 25: Dinosolve Issues • 26: Software • 27: Hardware • 28: Solution Statement • 29: Competition Identified • 30-32: 508 Compliance • 33: Objectives • 34: Benefits of Solution • 35-41: Milestones • 42: Sitemap • 43: Database Schema • 44: Entity Relationship Diagram • 45: Risks • 46: Conclusion • 47-50: References • 51-54: Appendix

  3. Group Members and Roles • Scott Pardue (Team Leader) • Michael Rajs (Risk Manager) • Adam Willis (Algorithm Specialist) • Sybil Acotanza (Documentation Specialist) • Jordan Heinrichs (Database Designer) • David Crook (User Interface Designer)

  4. Dr. YaohangLi • Associate Professor in the Department of Computer Science at Old Dominion University. • Research interests include: • Computational Biology: applies computational simulation techniques to solve biological problems • Markov Chain Monte Carlo (MCMC) methods: statistical algorithm for sampling from probability distributions • Parallel Distributed Grid Computing: uses multiple computers communicating via Internet to solve a problem

  5. How do researchers manage the massive amounts of data they are collecting in order to benefit their research?

  6. “Every day, [mankind] creates 2.5 quintillion (2.5*10^18) bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.” - IBM http://www-01.ibm.com/software/data/bigdata/

  7. Data Management Examples • Large Hadron Collider 2 • 150 million sensors report 40 million times per second • Watson on Jeopardy • 200 million pages • Structured and Unstructured • 4 Terabytes of information • DinoSolve Protein Prediction Server • Proteins are made up of single or multiple amino acids • 20 different amino acids • If a protein is made up of 5 amino acids then the number of possible proteins will be 20^5 or 3,200,000

  8. Big Data Analysis Hardware Scheduling and Queuing Subsystem Physical Resource Management Subsystem Job Management Subsystem

  9. Dr. Li’s Cluster Configuration

  10. Dinosolve Issues As it continues to grow in popularity, these are expected to occur: • Limited hard resources for computation • CPU cycles • Memory • Disk space • Network bandwidth • Server crashes Goal is to prepare the system to be able to continue to support the research community in light of its expected growth in requests and to also enhance the design of the user interface

  11. Job Management Subsytem

  12. Physical resource management

  13. Scheduling and Queueing

  14. Dr. Li’s Grants • DinoSolve • secured for a five year, $400,000 CAREER Award from the National Science Foundation • Dr. Li • principal or co-principal investigator • research grants totaling more than $15.3 million

  15. Dr. Yaohang Li and Dinosolve • Dinosolve examines a protein sequence of amino acids and determines if the protein can be manipulated by an addition of a disulfide bond • Each computational result enhances the prediction accuracies for future results • 40^20, larger than 10^32, different possible combinations for only the shortest sequence

  16. What is the problem? 300 simultaneous requests will cause the web server to crash

  17. Dinosolve Case Study • Bioinformatics7 • Disulfide bond prediction program • Disulfide bond creation is important to the research community http://www.merriam-webster.com/dictionary/bioinformatics

  18. Dinosolve Users • Drug design • Pharmaceutical companies • Antibody design • To combat viruses • Bio-energy development • Creation of new fuels to replace diminishing fossil fuels • Genetic mapping5 • Research to cure cancer, HIV, and other diseases

  19. Accuracy of Popular Tools More users use Dinosolve because of the enhanced accuracy

  20. Current Major Functional Component Diagram

  21. Current Process Flow

  22. RWP Major Functional Component Diagram

  23. RWP Process Flow

  24. Objectives • Configure, utilize, and optimize the SGE • Aesthetically pleasing and professional user interface • 508 Compliance • Improve the existing database schema and adding user accounts

  25. Benefits from Goals • Efficient utilization of available resources and increased throughput of the cluster • Professional user interface leading to a rise in popularity • Accessibility • Security and efficient access of previous submissions

  26. User interface will be improved to be more aesthetically pleasing

  27. Working with Dinosolve Input title Input protein sequence Input e-mail address Submit, then wait for confirmation... Protein Sequence: string of alphabetic characters, each of which represent a particular amino acid in the protein

  28. Working with Dinosolve Confirmation of request Now wait for results

  29. Working with Dinosolve Check your e-mail, Click the link provided The results are displayed

  30. Why is it important to be compliant? If an entity wishes to receive government funding then any electronic form the entity uses must be 508 compliant

  31. 508 Compliance • Amended Rehabilitation Act of 1998 •  require Federal agencies to make their electronic and information technology accessible to people with disabilities [32] •  enacted to eliminate barriers in information technology, to make available new opportunities for people with disabilities, and to encourage development of technologies that will help achieve these goals [32] http://en.wikipedia.org/wiki/Section_508_Amendment_to_the_Rehabilitation_Act_of_1973

  32. Compliance of Popular Tools

  33. Milestones

  34. Three Computational Nodes Each processor has four execution slots

  35. *Each computational node has two processors Processors 6 processors yield 24 execution slots

  36. Software Milestones

  37. Testing Milestones • Cluster Performance • Stress testing • Prevention of denial of service attacks • Database Performance • Stress testing • Prevention of MySQL injection attacks

  38. Complete Milestone Tree

  39. Sitemap

  40. Database Schema

  41. Entity Relationship

  42. Risks Probability Risks T1: Larger volumes of queries could cause slower processing speeds and may be the result of hardware strength T2: Improper synchronization of cluster resources could lead to a deadlock T3: Race conditions between the HPCR cluster and the MySQL database T4: A local attacker could exploit these vulnerabilities and cause a crash or execute arbitrary code on the system C1: Users may not like new design C2: SGE does not enforce exclusive access to the reserved processors I m p a c t

  43. Technical Risks and Mitigations Probability T1: Larger volumes of queries could cause slower processing speeds and may be the result of hardware strength Probability: 1 Impact: 5 Mitigation: Creating indexes, use specialized data structures and aggregate tables. T2: Improper synchronization of cluster resources can lead to a deadlock Probability: 2 Impact: 4 Mitigation: Modify and read application data. Alter execution logic and basic software configuration of SGE. I m p a c t

  44. Technical Risks and Mitigations Probability T3: Race conditions between the HPCR cluster and the MySQL database. Probability: 3 Impact: 3 Mitigation: Using software control on the SGE. T4: A local attacker could exploit these vulnerabilities and cause a crash or execute arbitrary code on the system Probability: 2 Impact: 2 Mitigation: Keep virus protection up to date. Use very specific types of passwords. Run current scripts because hackers look for dated scripts because they most likely have a hole in them. Limit access to certain files. I m p a c t

  45. Risks Probability C1: Users may not like new design. Probability: 3 Impact: 3 Mitigation: Create a new more aesthetically pleasing design C2: SGE does not enforce exclusive access to the reserved processors. Probability: 4 Impact: 4 Mitigation: Qsub and knowledge of node memory capacity I m p a c t

  46. With the updated user interface and correctly configured Sun Grid Engine, Dr. Li hopes to establish a reputable, reliable, and aesthetically pleasing Disulfide Bonding Prediction Server

  47. References for history • http://www-01.ibm.com/software/data/bigdata/ • http://en.wikipedia.org/wiki/Big_data • http://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-of-content-and-500-terabytes-ingested-every-day/ • http://en.wikipedia.org/wiki/Computer_cluster

  48. References for case study 5.  Li, Y. (2010, September 1). CAREER: Novel Sampling Approaches for Protein Modeling Applications [Abstract]. National Science Foundation Award Abstract #1066471. 6.  Li, Y., & Yaseen, A. (2012). Enhancing Protein Disulfide Bonding Prediction Accuracy with Context-based Features. Biotechnology and Bioinformatics Symposium 7.  bioinformatics. 2011. In Merriam-Webster.com. Retrieved February 15, 2013, from http://www.merriam-webster.com/dictionary/bioinformatics 8. Cronk, J. D. (2012). Disulfide Bond. Retrieved February 15, 2013, from Biochemistry Dictionary:  http://guweb2.gonzaga.edu/faculty/cronk/biochem/D-index.cfm?definition=disulfide_bond 9.  Yan, Y., & Chapman, B. (2008). Comparative Study of Distributed Resource Management Systems–SGE, LSF, PBS Pro, and LoadLeveler. Technical Report-Citeseerx. 10. Li, Y., & Yaseen, A. (2012). Dinosolve. Retrieved from http://hpcr.cs.odu.edu/dinosolve/

  49. References for competition 11. Arvind Krishna, “Why Big Data? Why Now?”, IBM , 2011 URL: http://almaden.ibm.com/colloquium/resources/Why%20Big%20Data%20Krishna.PDF 12. Yonghong Yan, Barbara M. Chapman, Comparative Study of Distributed Resource Management Systems - SGE, LSF, PBS Pro, and LoadLeveler, Department of Computer Science, University of Houston, May 2005 (pdf) 13. Dr. Li’s site http://hpcr.cs.odu.edu/dinosolve/ 14. Scratch Predictor http://scratch.proteomics.ics.uci.edu/ 15. DiANNA server http://clavius.bc.edu/~clotelab/DiANNA/ Portable Batch System (PBS) 16. http://resources.altair.com/pbs/documentation/support/PBSProUserGuide12-2.pdf 17. http://www.pbsworks.com/SupportDocuments.aspx?AspxAutoDetectCookieSupport=1 18. http://resources.altair.com/pbs/documentation/support/PBSProRefGuide12-2.pdf 19. http://resources.altair.com/pbs/documentation/support/PBSProAdminGuide12-2.pdf 20.http://www.pbsworks.com/(S(tykrsyqbemmlf3o5zwrmjrgf))/images/solutions-en-US/PBS-Pro_Datasheet-USA_WEB.pdf 21.http://agendafisica.files.wordpress.com/2011/05/pbs.pdf Moab HPC Suite 22.http://www.adaptivecomputing.com/publication/420/wppa_open/ IBM Platform LSF 23.http://public.dhe.ibm.com/common/ssi/ecm/en/dcd12354usen/DCD12354USEN.PDF Apache Hadoop with Zookeeper 24. http://zookeeper.apache.org/doc/current/zookeeperOver.html 25. http://www.cloud-net.org/~swsellis/tech/solaris/performance/doc/blueprints/0102/jobsys.pdf

  50. Reference for 508 Compliance 26. http://en.wikipedia.org/wiki/Section_508_Amendment_to_the_Rehabilitation_Act_of_1973

More Related