Summer Internship Douglas Drobny Idaho National Laboratory High Performance Computing
Who I worked for • Idaho National Laboratory • Idaho Falls • High Performance Computing group • Manages ~4 different clusters • Supports and maintains software for big research progress. • User Support group
Clusters • Fission • 12,512 processors • 25 TBytes of memory • Icestorm • 2048 processors • 4 TBytes of memory • Quark • Eos
Compute Manager • Current job submissions are command line • Goals • Web interface for PBS Scheduler • Easy to use • Behaves the same as current job submissions • Improved error message handling
Setup • Application Services • On the server head nodes • Receive web requests • Submits Jobs • Compute Manager • On the web server • Creates web forms • Sends results to App. Services • Displays Results
What I did • Installed compute manager and AIF on Eos • Created test cases for PBS features • Created test cases for User Inputs • Submit feedback / bug reports with PBS • Documented process for future implementations / troubleshooting
Results • Good • Easy to create different application forms • Instant job monitoring • Restrict input values • Default input values • Secure file transferring
Results • Bad • Easy to put results in insecure location • Always copies the input files • Missing a form entry can result in lost output files • Spams the sudo log • “Fixed in next version (Week after I leave)”
Updating HPC Wiki • Moinmoin wiki (python) • 1.8.8 to 1.9.4 • Used temporary virtual machine to test update and fix issues • Added support for viewing reports • Deployed on hpcweb • Note: Learn what type of service monitoring is being used before taking down a system.
Wiki Reports • Automatically generate a visual report of an XML document each month • Created the XSL • Putting data into charts • Automation ('Right' way vs. Working way) • Editing to reduce transcription errors • <script/>
Intel Compiler Issue (ICC) • Issue • Compile times on Quark are much longer than Fission (head nodes) • Quark should be faster (hardware wise) • 17 minutes on Quark • 8 minutes on Fission
Intel Compiler Steps • Create test cases • Determine effected systems • Enable debugging • Strace • Wireshark • Hardware Test Environment
ICC Solution • License files were resolved in the order • License manager • User's home directory • /opt/intel • /apps/intel/..../license • 'Errors' in the license file cause the system to check all of the sources
ICC Solution • The /opt/intel license files pointed to the license manager • This caused additional requests to the license manager (takes time) • Quark's /opt/intel license files pointed to the license servers the most • *Removed /opt/intel/license folder to fix the problem.
Things Learned • Python • XSL • Creating and Signing SSL Keys • Unix permissions • Strace • Testing • Refactoring • Monitoring • Vim!