1 / 36

Introduction to Openmind

Learn about the basics of high-performance computing (HPC) and Openmind, a powerful computer cluster at MIT. Explore Linux commands, storage, software usage, and job management. Discover the advantages of parallel computing and the structure of a computer cluster.

jsherri
Download Presentation

Introduction to Openmind

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Openmind Shaohao Chen Department of Brian and Cognitive Sciences (BCS) Massachusetts Institute of Technology (MIT)

  2. Outline • Introduction to high performance computing (HPC) • Basic facts of Openmind (OM) • Basic Linux: bash shell commands, text editors • Storage, File transfer and GUI • Using software (hands on) • Basic job management (hands on) • Other computing resources

  3. What is HPC? • High Performance Computing (HPC) refers to the practice of aggregating computing power in order to solve large problems in science, engineering, or business. • The purpose of HPC: accelerate computer programs, and thus accelerate work process. • Computer cluster: A set of connected computers that work together. Computers are connected with high-speed network. They can be viewed as a single system. • Similar terminologies: supercomputing, parallel computing. • Parallel computing: many computations are carried out simultaneously, typically computed on a computer cluster. • Related terminology: cloud computing (a loose coupling mechanism, web browser interface vs. command line interface).

  4. Computing power of CPU • Moore‘s law is the observation that the computing power of CPU doubles approximately every two years. • Nowadays the multi-core technique is the key to keep up with Moore's law. • The figure is from Wiki.

  5. Why computer cluster? • Drawbacks of increasing CPU clock frequency: --- Electric power consumption is proportional to the cubic of CPU clock frequency (ν3). --- Generates more heat. • A drawback of increasing the number of cores within one CPU chip: --- Difficult for heat dissipation. • Computer cluster: connect many computers with high-speed networks. • Currently computer cluster is the best solution to scale up computing power. • Consequently software/programs need to be designed in the manner of parallel computing.

  6. Basic structure of a computer cluster • Cluster – a collection of many computers/nodes. • Rack – a closet to hold a bunch of nodes. • Node – a computer (with processors, memory, hard drive, etc.) • Socket/processor – one multi-core processor. • Core/processor – one processing unit. • Hyperthreaded core: virtual (logical) core • Network switch • Storage system (shared by all nodes) • Power supply system • Cooling system Picture: Computer Clusters at MGHPCC

  7. Inside a node • CPU (e.g. multi-core processors) To carry out the instructions of programs. Built-in cache (fast memory). • Memory (RAM) Fast but temporary storage, to store data for immediate use. • Hard drives(Magnetic or solid-state materials) Relatively slow but permanent storage, to store data permanently. • Network devices (e.g. Ethernet, Infiniband, Omni-path) To transfer data between nodes or between sites. • Accelerator (e.g. GPU) To accelerate programs with parallel computing.

  8. Graphic Processing Unit (GPU) • GPU is a device attached to a CPU-based system. • Many computer programs can be parallelized and thus accelerated on GPU. • CPU memory and GPU memory are separated. Data transfer between CPU and GPU (via PCI) is needed. • Nvlink: data communication between GPUs.

  9. How to measure computer performance? • Floating-point operations per second (FLOPS): • The 3rd term clock cycles per second is known as the clock frequency, typically 2 ~ 4 GHz. • The 4th term FLOPs per cycle: how many floating-point operations are done in one clock cycle. For typical Intel Xeon CPUs: 16 DP FLOPs/cycle, 32 SP FLOPs/cycle. • GigaFLOPS – 109 FLOPS (a multicore CPU ~ hundreds of GFLOPSs) • TeraFLOPS – 1012 FLOPS (a GPU ~ several TFLOPSs). • PetaFLOPS – 1015 FLOPS (a computer cluster ~ 1-100 PFLOPs). • ExaFLOPS – 1018 FLOPS (the next-generation computer clusters).

  10. Computer power grows rapidly • Iphone 4 vs. 1985 Cray-2 supercomputer

  11. The Top 500 List • The list of June 2019

  12. Parallel Computing • Parallel computing is a type of computation in which many calculations are carried out simultaneously, based on the principle that large problems can often be divided into smaller ones, which are then solved at the same time. • Speedup of a parallel program, p: number of processors/cores, α: fraction of the program that is serial. • The figure is from: https://en.wikipedia.org/wiki/Parallel_computing

  13. Distributed or shared memory systems • Shared memory system • A single node on a cluster • Open Multi-processing (OpenMP) or MPI • Distributed memory system • Multiple nodes on a cluster • Message Passing Interface (MPI) • Figures are from the book Using OpenMP: Portable Shared Memory Parallel Programming

  14. Openmind for BCS at MIT • An HPC cluster for BCS at MIT. • Department contribution + research group contributions. • Total resources: 72 compute nodes, 1620 CPU cores, ~220 GPUs, ~ 900 TB Lustre storage, ~ 1.3 PB NFS storage. • Located at MGHPCC in Holyoke, MA. Connected to MIT campus with high-bandwidth ethernet. MIT BCS @ Cambridge MGHPCC @ Holyoke

  15. Structure of Openmind Terminal Openmind Computer Cluster Head node: openmind7 Compute nodes …… node004 node001 …… node005 node016 Network Switch node017 Fast Network Switch node018 …… node072 • Green: CPU + GPU • Blue: CPU only • Red: CPU only, large memory Lustre Storage NFS Storage

  16. Compute resources on Openmind One head node: 16 CPU cores (32 hyperthreaded cores), 128 GB memory 72 compute nodes: • 59 (CPU + GPU) nodes: 20 or 24 CPU cores (40 or 48 hyperthreaded cores); 256 or 512 GB RAM; 2-8 GPUs per node. GPU types: tesla-k20, telsla-k40, tesla-k80, titan-black, titan-x, GEFORCEGTX1080TI, GEFORCERTX2080. • 13 CPU nodes: 20 CPU cores (40 hyperthreaded cores), 256 GB RAM. • One big memory node: 32 CPU cores (64 hyperthreaded cores), 1 TB RAM Storage: • Lustre: 452 TB shared, 452 TB local • NFS: ~ 1,300 TB shared

  17. Get started on Openmind (OM) • User guide at MIT Github. https://github.mit.edu/MGHPCC/openmind/wiki/Cookbook:-Getting-started • Log in the MIT Github using your MIT Kerberos account. • A research group has to enroll for membership to use OM. • Every user in an enrolled group can request for an OM account by submitting an issue here: https://github.mit.edu/MGHPCC/openmind/issues • The OM user name is the same as your MIT Kerberos account name. • Use a terminal on MAC or MobaXterm on Windows to login OM: ssh user@openmind7.mit.edu • A new user needs to complete the online training course at MIT Atlas to remove the initial limit of submitting jobs.

  18. Graphical User Interface (GUI) • On MAC, install X-quarts, then login OM with X-forwarding: Remote Desktop • Software (e.g. Matlab, Jupyter notebook) GUI performs much better in a remote desktop than in a terminal with X-forwarding. • Xfast (preferred): https://openmind.mit.edu:3443/auth/ssh • X2go (less preferred). • To run time-consuming jobs in GUI: 1. Open a terminal in the remote desktop; 2. Request for an interactive session; 3. Open the software GUI; 4. Execute the commands. ssh -Y user@openmind7.mit.edu • On Windows, install MobaXtem, then X-forwrding is automatically supported when login.

  19. Basic Linux Shell Commands • Operating System on OM: CentOS 7 (a Linux distribution) • Basic bash shell commands: ls # list files/directories ls -l # listing with details ls -a # list all files (including hidden) cd /path/to/dir# change directory to dir cd # change directory to home mkdirdir # create a directory rm file # delete a file rm -r dir# delete directory dir cp file1 file2 # copy file1 to file2 cp -r dir1 dir2 # copy dir1 to dir2 mv file1 file2 # rename file1 to file2 pwd # show current directory groups # check group names cat file # print file head -n 10 file # print the first 10 lines of file tail -n 10 file # print the last 10 lines of file less file # show a file in one screen echo “hello” # print the string hello grep pattern file # search for pattern in file which app # show which app will be run Ctrl + c # halt the current command man command # show the manual for command

  20. Linux Text Editors • Easy to start with: gedit, nano • Most recommended (with a long learning curve): vim, emacs • Open vim: vim filename • Basic usage of vim: Ctrl + d # move forward 1/2 a screen Ctrl + u # move back 1/2 a screen gg # go to the first line G # go to the last line yy # copy a line dd # delete (cut) a line 10dd # delete (cut) ten lines p # paste i# insert before the cursor a # insert (append) after the cursor A # Insert (append) at the end of the line Esc # exit the current mode :wq # save and quit :q # quit (fails if there is unsaved change) :q! # quit without save /pattern # search for pattern

  21. Bash script export PATH=~/bin:$PATH export LD_LIBRARY_PATH=~/lib:$LD_LIBRARY_PATH export PYTHONPATH=~/lib/python2.7/site-packages:$PYTHONPATH echo $PATH • A Bash script is a plain text file which contains a series of bash commands. • Environment variables: • Source vs. execution: ./setenv.sh source setenv.sh • Customize environment at login: Set variables at the ~/.bashrcfile. • Loops: for i in `seq 10` # Loop from 1 to 10 do echo $i # Print the loop index done files=./* # All files in current directory for file in $files # Loop all files do cat $file # Print the file done

  22. Storage on Openmind (1) • User home space: default directory for login, small storage, to save documents and source codes. Path: /home/$USER . User quota: 5 GB To Check quota: • Lustre storage: fast and large storage, to save large and often-used data (when possible). Path: /om . Default group quota: 10 T (Additional storage can be rented). To Check quota: • NFS storage: large storage, to save large but rarely-used data. Path: /om2 . Default group quota: 10 T (Additional storage can be rented). Path: /om3 or /om4 (Some groups purchase additional storage). To Check quota: quota -s lfs quota -g group_name /om -h quota -s -g group_name

  23. Storage on Openmind (2) • Local Lustre storage: fast, local, temporary /nobackup/scratch • Local NFS storage: fast, local, temporary /tmp • Back up: The Lustre storage /om. NFS storage: /om2, /om3, /om4. Ask the system administration group for recovering files.

  24. Transfer files • Ex1 sftp clients: Install and use an SFTP client on your local computer. MAC: Cyberduck, Transmit, Fetch. Windows: MobaXterm, WinSCP. Ubuntu: FileZilla • Ex2 scp: Execute the scp command in a terminal on a local computer. • Ex3 wget: Download a file from a Web site (provided the url): • Ex4 rsync: Synchronize data between two sites. • Ex5 sshfs: Mount remote directories to local computer. • Ex6 Globus: A platform to transfer files in a fast speed based on grid FTP technique. Globus server is not installed in the OM system, but it can be used via Singularity container. Upload: scp file user@openmind7.mit.edu:~/destination Download: scp user@openmind7.mit.edu:~/source/file ./ wgeturl

  25. Module: software environment management • Use module to set up environment variables easily. module avail # List all available software module avail openmind# List the software installed for openmind module avail mit# List the software licensed for MIT module load mit/matlab/2019a # Load a specific version of a software (e.g. matlab 2019a) module show openmind/hdf5/1.10.1 # Show the environment variables for a software (e.g. hdf5) module list # List currently loaded software module remove openmind/hdf5/1.10.1 # unload a software module switch mit/matlab/2019a mit/matlab/2018b # Switch a software to a different version module purge # Unload all current loaded software module use /path/to/modulefiles# Use a module in nonstandard location (e.g. a user define module)

  26. Core software • Core software on OM: module load openmind/gcc/5.3.0 # GNU compiler for C/C++ and Fortran module load mit/matlab/2019a # A multi-paradigm numerical computing software module load mit/mathematica/10.3.1 # A technical computing software module load openmind/anaconda/2.5.0 # Provides useful Python environment module load openmind/R/3.6.1 # A software for statistics module load cuda70/toolkit/7.0.28. # For GPU programming and applications module load openmind/hdf5/1.10.1 # A library for fast I/O module load openmind/singularity/3.2.0 # A container for HPC • Use Singularity container to customize the system and software stack as needed (especially useful for Python programs).

  27. Exercises (1) 1. Practice basic Linux bash shell commands listed above. 2 (a). The following Matlab command creates a random matrix. Run it 5 times in a bash script. Note: need to load a Matlab module. 2 (b). The following Python command creates a list with random integers. Run it 5 times in a bash script. Hints: • Use a text editor to create a bash script and write a loop in it: • Change the script to be executable: • Execute the script: matlab -nodisplay -r “A=rand(3), exit” python -c 'import random; a=[random.randint(0,100) for i in range(5)]; print(a)' vim script.sh or geditscript.sh & chmod +x script.sh ./script.sh

  28. SLURM Job Scheduler Why using job scheduler? • There is only one head node (openmind7) for all users. It is not allowed to run time-consuming programs on the head node! • There are 72 compute nodes. All time-consuming programs should run on compute nodes! • To run programs on compute nodes, submit batch jobs using the Slurm job scheduler. • Job scheduler: submit jobs in a batch mode, assign appropriate compute resources to jobs, queue jobs in priority order, monitor job information. SLURM (Simple Linux Utility for Resource Management) •  An open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters. • Widely used in many computer clusters in the world.

  29. Request an interactive session • Interactive session: Login a compute node with requested resources (e.g. CPU, GPU, memory, walltime). • Purpose: Run your program using the requested resources and get the standard output right away. • Typical cases: Run testing programs; Debug codes; Visualization. • Examples: srun -n 1 -t 02:00:00 --pty bash # Request 1 CPU core for 2 hours (default 10 minutes) srun -n 1 -t 03:00:00 --x11 --pty bash # Support GUI srun -n 1 -t 01:30:00 --mem=20G --pty bash # Request 1 CPU core and 20 GB RAM (default 2GB/core) srun -n 1 --gres=gpu:titan-x:1 --pty bash # Request 1 CPU core and 1 k80 GPU • Check whether you are on a compute node by printing the host name. hostname # Check the host name of the node

  30. Work in an interactive session • Check CPU and memory on the node cat /proc/cpuinfo# Check CPU info top # Check current usage of CPU and memory • Check GPU info on the node nvidia-smi# Check GPU resources and status • Run a Matlab program: module load mit/matlab/2019a # Load Matlab matlab -nodesktop# Open Matlab without GUI • Run a Python program: python # Open Python (Use Singularity container for customized environment.)

  31. Submit a batch job • Submit jobs to background. Continue running no matter the terminal is connected to the cluster or not. • Create a batch script (e.g. named example.sh), then submit it. • An example script for running a Matlab program (e.g. named main.m): sbatchexample.sh #!/bin/bash # Bash shell #SBATCH -t 01:00:00 # walltime = 1 hour #SBATCH -N 1 # one node #SBATCH -n 1 # one CPU core #SBATCH --mem=10G # 10 GB memroy module load mit/matlab/2019a # Load matlab matlab -nodisplay -r “maxNumCompThreads(1), main, exit” # Run a Maltab program

  32. Check and cancel jobs squeue -u $USER # User name should be provided. . • Check job status for a user: • Check information of a job: scontrol show job JobID# Job ID should be provided. • Cancel a job: scancelJobID# Job ID should be provided. • Output files: By default both standard output and standard error are directed to the same file. The default file name is "slurm-%j.out", where the "%j" is replaced by the job ID. Use -o option to specify an output file name. Use -e option to direct the standard error to a specified file.

  33. Exercises (2) 1 (a). The following Matlab command creates a random matrix. Submit a batch job to run it using one CPU core. Note: need to load a Matlab module. 1 (b). The following Python command creates a list with random integers. Submit a batch job to run it using one CPU core. 2. Check the job status and output files. Hints: • Use a text editor to create a job script, then submit it: matlab -nodisplay -r “A=rand(10), exit” python -c 'import random; a=[random.randint(0,100) for i in range(10)]; print(a)' sbatchjob.sh

  34. Other Computing Resources Consider other available resources, • if OM is too busy and you need to obtain results quickly (before deadlines). • if the computing resources on OM do not meet your computational need. MIT resources: • Supercloud at Lincoln Lab: An HPC cluster with web browser interface. • Engaging Cluster: An HPC cluster for MIT researchers, located in MGHPCC. Nation-wide resources: • XSEDE: provides HPC resources in many institutions in the US. • Amazon Web Services (AWS): A business platform for cloud computing.

  35. XSEDE resources • Pros: Free. Campus allocation available. Easy to get a start-up allocation. • Cons: Requires a strong proposal for a large research allocation. XSEDE resources: • Bridges at PSC: GPU nodes (K80, P100), GPU-AI nodes (V100 GPUs, Nvlink), large-memory nodes (3 TB, 12 TB), large storage (Lustre, ~6 PB local, ~10 PB shared) • Comet at SDSC: Hashwell CPUs (~2,000 nodes), GPU nodes (K80, P100), fast storage (SSD local scratch) • Stampede at TACC: Skylake CPUs (~1,750 nodes), Knights Landing Xeon Phis (~4,200 nodes, 68 cores per node) • Jetstream at IU/TACC: Cloud computing, virtual machines, web-based user interface

  36. Further Help • Openmind webpage at MIT Github: https://github.mit.edu/MGHPCC/OpenMind/wiki • To get notifications, set your primary email address. • Submit an issue at https://github.mit.edu/MGHPCC/openmind/issues • “Not watching”: be notified when participating or @mentioned. • “Watching”: Be notified of all conversations. To avoid a deluge of emails, set up email filtering. • If you have an answer to an issue, or have seen the problem before, consider replying it. We need communication between users to keep it functioning optimally. • Slack channel for OM users: openmind-46.slack.com • Meetings: • Walk-in for short-time questions. Make an appointment for long-time discussions. • Monthly group representatives meetings. • Contact: • Email: shaohao@mit.edu . Office: 46-4115D.

More Related