a r m s active resource management services for big data processing n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
A.R.M.S. Active Resource Management Services For Big Data Processing PowerPoint Presentation
Download Presentation
A.R.M.S. Active Resource Management Services For Big Data Processing

Loading in 2 Seconds...

play fullscreen
1 / 44

A.R.M.S. Active Resource Management Services For Big Data Processing - PowerPoint PPT Presentation


  • 71 Views
  • Uploaded on

A.R.M.S. Active Resource Management Services For Big Data Processing. Revised Presentation One. Outline. 1: Title 2: Outline 3: Members 4: Mentor 5-6: Societal Issue 7: History 8-9: Dr. Li 10-11: Cluster Computing 12-14: Case Study 15: Accuracy

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A.R.M.S. Active Resource Management Services For Big Data Processing' - sheila


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
outline
Outline
  • 1: Title
  • 2: Outline
  • 3: Members
  • 4: Mentor
  • 5-6: Societal Issue
  • 7: History
  • 8-9: Dr. Li
  • 10-11: Cluster Computing
  • 12-14: Case Study
  • 15: Accuracy
  • 16: Current Major Functional Component Diagram
  • 17: Current Process Flow
  • 18: Problem Statement
  • 19: Proposed Major Functional Component Diagram
  • 20: Proposed Process Flow
  • 21-24: Dinosolve Walkthrough
  • 25: Dinosolve Issues
  • 26: Software
  • 27: Hardware
  • 28: Solution Statement
  • 29: Competition Identified
  • 30-32: 508 Compliance
  • 33: Objectives
  • 34: Benefits of Solution
  • 35: Conclusion
  • 36-39: References
  • 40-44: Appendix
group members and roles
Group Members and Roles
  • Scott Pardue (Team Leader)
  • Michael Rajs (Risk Manager)
  • Adam Willis (Algorithm Specialist)
  • Sybil Acotanza (Documentation Specialist)
  • Jordan Heinrichs (Database Designer)
  • David Crook (User Interface Designer)
dr yaohang li
Dr. YaohangLi
  • Associate Professor in the Department of Computer Science at Old Dominion University.
  • Research interests include:
    • Computational Biology: applies computational simulation techniques to solve biological problems
    • Markov Chain Monte Carlo (MCMC) methods: statistical algorithm for sampling from probability distributions
    • Parallel Distributed Grid Computing: uses multiple computers communicating via Internet to solve a problem
slide5

How do researchers handle the massive amounts of data they are collecting in order to benefit their research?

slide6

“Every day, [mankind] create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.”1

http://www-01.ibm.com/software/data/bigdata/

slide7

Data Management Examples

  • Large Hadron Collider 2
    • 150 million sensors report 40 million times per second
  • Facebook 3
    • 2.5 billion – content items shared
    • 2.7 billion – “Likes”
    • 300 million – photos uploaded
  • Walmart2
    • 1 million customer transactions
    • 2.5 x 10^15 bytes of data

http://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-of-content-and-500-terabytes-ingested-every-day/

dr li s research
Dr. Li’s Research
  • Ideally, his research can be used to develop new protein-modeling programs. Computational approaches can be more efficient and less expensive than biologists, chemists and others experimenting in lab settings
  • Leads to the manufacturing of additional drugs to fight conditions as varied as Alzheimer’s disease, cystic fibrosis and mad cow disease

http://diverseeducation.com/article/13348/

dr li s grants
Dr. Li’s Grants
  • Dinosolve, his current project, was secured for a five year, $400,000 CAREER Award from the National Science Foundation
  • Dr. Li has been the principal or co-principal investigator on research grants totaling more than $15.3 million
big data analysis hardware
Big Data Analysis Hardware
  • Cluster Computing 4
    • A cluster consists of many nodes (computers).
    • Big data can be generated and analyzed quicker by spreading the workload amongst the nodes.
  • Head Node
  • Logging data
  • Job submission
  • 3 Computation Node
  • 2 Processors each
    • 4 Execution slots per processor
  • 24 total execution slots

Head node packages data from the computation nodes and presents it in

a readable format so that it is usable by the research community

managing the cluster
Managing the Cluster

Distributed Resource Management Systems (D-RMS)

  • Job management subsystem
  • Physical resource management subsystem
  • Scheduling and queuing subsystem
dr yaohang li and dinosolve
Dr. Yaohang Li and Dinosolve
  • Dinosolve examines a protein sequence of amino acids and determines if the protein can be manipulated by an addition of a disulfide bond
  • Each computational result enhances the prediction accuracies for future results

http://hpcr.cs.odu.edu/dinosolve/index.php

dinosolve case study
Dinosolve Case Study
  • Bioinformatics7
    • Disulfide bond prediction program
    • Disulfide bond creation is important to the research community
dinosolve users
Dinosolve Users
  • Drug design
    • Pharmaceutical companies
  • Antibody design
    • To combat viruses
  • Bio-energy development
    • Creation of new fuels to replace diminishing fossil fuels
  • Genetic mapping5
    • Research to cure cancer, HIV, and other diseases
accuracy of popular tools
Accuracy of Popular Tools

More users use Dinosolve because of the enhanced accuracy

Reference 13,14 and 15

what is the problem
What is the problem?
  • Processing time on big data sets is computationally expensive and as the volume of queries grows the system will progressively drop in performance until the system fails.
  • 300 simultaneous requests will cause the web served to crash
slide22

Working with Dinosolve

Input title

Input protein sequence

Input e-mail address

Submit, then wait for confirmation...

Protein Sequence: string of alphabetic characters, each of which represent a particular amino acid in the protein

slide23

Working with Dinosolve

Confirmation of request

Now wait for results

slide24

Working with Dinosolve

Check your e-mail,

Click the link provided

The results are displayed

dinosolve issues
Dinosolve Issues

As it continues to grow in popularity, these are expected to occur:

  • Hard resources for computation
    • CPU cycles
    • Memory
    • Disk space
    • Network bandwidth
  • Server crashes

Goal is to prepare the system to be able to continue to support the research community in light of its expected growth in requests

software
Software
  • Unix operating system installed on the Dinosolve cluster
  • Dinosolve algorithm
  • Sun Grid Engine which will be our Distributed Resource Management System (D-RMS) installed on the cluster.
  • MySQL (database software)
  • Web-based user interface (website)
hardware
Hardware
  • MySQL database server
  • A computer cluster to run the Dinosolve algorithm
  • Web server for web-based user interface
how will we correct the problem
How will we correct the problem?

Configure a distributed resource management system

competing distributed resource management systems
Competing Distributed Resource Management Systems
  • Sun Grid Engine (SGE)
  • Portable Batch System (PBS)
  • Load Sharing Facility (LSF)
508 compliance
508 compliance
  • Amended Rehabilitation Act of 1998
    •  require Federal agencies to make their electronic and information technology accessible to people with disabilities [32]
    •  enacted to eliminate barriers in information technology, to make available new opportunities for people with disabilities, and to encourage development of technologies that will help achieve these goals [32]
why is it important to be compliant
Why is it important to be compliant?

If an entity wishes to receive government funding then any electronic form the entity uses must be 508 compliant.

objectives
Objectives
  • Interpret and visualize current usage statistics
  • Configure, utilize, and optimize the SGE
  • Aesthetically pleasing and professional user interface
what benefits will come from attaining the goals
What benefits will come from attaining the goals?
  • Efficient utilization of available resources
  • Increased throughput of the cluster
  • An intuitive and professional user interface
  • Rise in popularity due to excellent accuracy, efficiency, and professional design
conclusion
Conclusion

With the updated user interface and correctly configured Sun Grid Engine, Dr. Li hopes to establish a reputable, reliable, and aesthetically pleasing Disulfide Bonding Prediction Server.

references for history
References for history
  • http://www-01.ibm.com/software/data/bigdata/
  • http://en.wikipedia.org/wiki/Big_data
  • http://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-of-content-and-500-terabytes-ingested-every-day/
  • http://en.wikipedia.org/wiki/Computer_cluster
references for case study
References for case study

5.  Li, Y. (2010, September 1). CAREER: Novel Sampling Approaches for Protein Modeling Applications [Abstract]. National Science Foundation Award Abstract #1066471.

6.  Li, Y., & Yaseen, A. (2012). Enhancing Protein Disulfide Bonding Prediction Accuracy with Context-based Features. Biotechnology and Bioinformatics Symposium

7.  bioinformatics. 2011. In Merriam-Webster.com. Retrieved February 15, 2013, from http://www.merriam-webster.com/dictionary/bioinformatics

8. Cronk, J. D. (2012). Disulfide Bond. Retrieved February 15, 2013, from Biochemistry Dictionary: 

http://guweb2.gonzaga.edu/faculty/cronk/biochem/D-index.cfm?definition=disulfide_bond

9.  Yan, Y., & Chapman, B. (2008). Comparative Study of Distributed Resource Management Systems–SGE, LSF, PBS Pro, and LoadLeveler. Technical Report-Citeseerx.

10. Li, Y., & Yaseen, A. (2012). Dinosolve. Retrieved from http://hpcr.cs.odu.edu/dinosolve/

references for competition
References for competition

11. Arvind Krishna, “Why Big Data? Why Now?”, IBM , 2011

URL: http://almaden.ibm.com/colloquium/resources/Why%20Big%20Data%20Krishna.PDF

12. Yonghong Yan, Barbara M. Chapman, Comparative Study of Distributed Resource Management Systems - SGE, LSF, PBS Pro, and LoadLeveler, Department of Computer Science, University of Houston, May 2005 (pdf)

13. Dr. Li’s site http://hpcr.cs.odu.edu/dinosolve/

14. Scratch Predictor http://scratch.proteomics.ics.uci.edu/

15. DiANNAserver http://clavius.bc.edu/~clotelab/DiANNA/

Portable Batch System (PBS)

16. http://resources.altair.com/pbs/documentation/support/PBSProUserGuide12-2.pdf

17. http://www.pbsworks.com/SupportDocuments.aspx?AspxAutoDetectCookieSupport=1

18. http://resources.altair.com/pbs/documentation/support/PBSProRefGuide12-2.pdf

19. http://resources.altair.com/pbs/documentation/support/PBSProAdminGuide12-2.pdf

20.http://www.pbsworks.com/(S(tykrsyqbemmlf3o5zwrmjrgf))/images/solutions-en-US/PBS-Pro_Datasheet-USA_WEB.pdf

21.http://agendafisica.files.wordpress.com/2011/05/pbs.pdf

Moab HPC Suite

22.http://www.adaptivecomputing.com/publication/420/wppa_open/

IBM Platform LSF

23.http://public.dhe.ibm.com/common/ssi/ecm/en/dcd12354usen/DCD12354USEN.PDF

Apache Hadoop with Zookeeper

24. http://zookeeper.apache.org/doc/current/zookeeperOver.html

25. http://www.cloud-net.org/~swsellis/tech/solaris/performance/doc/blueprints/0102/jobsys.pdf

reference for 508 compliance
Reference for 508 Compliance

26. http://en.wikipedia.org/wiki/Section_508_Amendment_to_the_Rehabilitation_Act_of_1973

appendix
Appendix
  • 40: Competition Matrix for Resource Management Systems
  • 41-43: 508.22 Compliance Statistics for Dinosolve