CFAM
Cluster Information
Ganglia Monitoring System
http://cfam-cluster.engr.unr.edu/ganglia/
Hardware
Head Node
8 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz 32GiB RAM 8TiB HDD /home share with nodes via NFS
Nodes 1-6
20 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz 128GiB RAM 8TiB HDD /home via the head node
Highmem 1-2
20 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz 256GiB RAM 8TiB HDD /home via the head node
Networking
Mellanox Technologies MT27500 Family [ConnectX-3] FDR
1GbE Netgear Switch
Connecting to the CFAM Cluster
In order to connect to the cluster remotely, you will have to use either SSH or X2Go. Use X2Go if you would like a desktop interface, and SSH if you would like a command line interface.
SSH
If you are on Mac or Linux, open the terminal. If you are on Windows, you will have to download Cygwin and use that. Cygwin Installer
During Cygwin's setup, it will ask you which packages to install. Under the "Net" category, there is a package called "openssh." Click on the "Skip" button to the left of openssh to cycle it to 7.2p 1-1, so that it will install that version. Once at a terminal, run the following command, replacing <netid> with your netid. It will prompt you for your password, and then you'll be connected!
ssh -Y <netid>@cfam-cluster.engr.unr.edu
X2Go
If you prefer a desktop environment, you can use X2Go to connect.
Download and install X2Go here:
X2Go Download
Once in X2Go, type cfam-cluster.engr.unr.edu for the host, and your netid for the login.
Under the Session Type, select XFCE and you are ready to connect.
Number Pad
If the number pad is not working in X2Go, go to Keyboard under Settings. Under the Layout Tab, check Use system defaults. Log out and log back in and your number pad should work.
Restarting X2Go
When X2Go will not start properly, access the cluster through the command line and type pkill -u <username>. This will kill all processes in your username and X2Go should start.
SLURM Introduction
SLURM (Simple Linux Utility for Resource Management) is a utility built around cluster management. It allows for commands to utilize all of the nodes that the CFAM cluster contains. You likely will be using srun and sbatch primarily, but additional documentation on SLURM can be found here: SLURM Documentation
srun is used to run jobs in parallel, and there are two options you will primarily use; -N specifies the number of nodes to run the command on, and -n is used to specify the number of cores to run it with. For example, to run the command hostname on all 8 of the nodes in the CFAM cluster, you would run: srun -N 8 hostname, and if you wanted to run hostname on 100 cores you would run: srun -n 100 hostname. Additional documentation for srun can be found here: srun
sbatch is used to queue batch scripts for the cluster to run. You would specify a batch script to add to the queue, and when the cluster has the resources necessary, it will execute the script. To add scripts to the queue, run sbatch /path/of/script. To view the queue of scripts, you can use the command squeue. Additional documentation for sbatch can be found here: sbatch
Compiling SLURM Jobs
#/bin/bash #We Storage some example code from Lawrence Livermore National lab in #/opt/mpi #Copy it to your home directory cp -r /opt/mpi/tutorials/mpi/samples/C ~/mpi cd ~/mpi #Compile an example mpicc -lpmi -o mpi_hello mpi_hello.c #Run the example srun -n16 --mpi=pmi2 mpi_hello
Output
$ srun -n16 mpi_hello Hello from task 10 on node03! Hello from task 9 on node03! Hello from task 11 on node03! Hello from task 12 on node03! Hello from task 14 on node03! Hello from task 8 on node03! Hello from task 15 on node03! Hello from task 13 on node03! Hello from task 2 on node03! Hello from task 0 on node03! MASTER: Number of MPI tasks is: 16 Hello from task 1 on node01! Hello from task 4 on node01! Hello from task 7 on node01! Hello from task 5 on node01! Hello from task 6 on node01! Hello from task 3 on node01!
Running Tasks
SRUN
https://slurm.schedmd.com/srun.html
srun is synchronous and blocking. Use sbatch to submit a job to the queue.
#-n indicates the number of cores #--mem indicates the memory needed per node in megabytes #--time indicates the specified run time of the job $ srun -n16 --mem=2048 --time=00:05:00 ~/mpi/mpi_hello
SBATCH
https://slurm.schedmd.com/sbatch.html
$ cat ~/mpi/run.sh #!/bin/bash #SBATCH -n 16 #SBATCH --mem=2048MB #SBATCH --time=00:30:00 #SBATCH --mail-user=YOUR_EMAIL@DOMAIN.COM #SBATCH --mail-type=ALL #SBATCH --mpi=pmi2 srun ~/mpi/mpi_hello
batch the job:
$ sbatch ~/mpi/run.sh Submitted batch job 536 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 536 main run.sh cse-admi R 0:03 2 node0[1,3]
Check the Cluster status:
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST main* up infinite 2 alloc node0[1,3] controller up infinite 1 idle head test up infinite 1 down* test
Applications
Interactive ANSYS
salloc
salloc -N6 -w node01,node02,node03,node04,node05,node06 \ srun --x11 -N1 /opt/ansys17-2/v172/fluent/bin/fluent -r17.2.0 \ 3ddp -t110 -pinfiniband -mpi=openmpi \ -cnf=node01,node02,node03,node04,node05,node06 -nm -ssh
salloc allocates 6 nodes and srun --x11 runs an interactive job with X forwarding.
sinteractive
sinteractive -p nodes -N 2
You can find the name of the allocated hosts with the following:
$ nodeset -e $SLURM_JOB_NODELIST
highmem01 highmem02
SBATCH
Template submit.sh
#!/bin/bash #SBATCH -N 6 # Nodes #SBATCH -n 100 # Total number of tasks #SBATCH --exclusive # exclusive lock #SBATCH --mail-type=end #SBATCH --mail-user=$YOUR_EMAIL #SBATCH --workdir=/home/YOUR/PROJECT/PATH #SBATCH -w highmem01,highmem02,node01,node02,node03,node04 FLUENT_HOSTS=highmem01,highmem02,node01,node02,node03,node04 FLUENT_BIN=~/ansys16/v162/fluent/bin/fluent export FLUENT_GUI=off if [ -z "$SLURM_NPROCS" ]; then N=$(( $(echo $SLURM_TASKS_PER_NODE | sed -r 's/([0-9]+)\(x([0-9]+)\)/\1 * \2/') )) else N=$SLURM_NPROCS fi echo -e "N: $N\n"; ~/ansys16/v162/fluent/bin/fluent 3ddp -g -slurm -ssh \ -cnf=$FLUENT_HOSTS -t $N \ -mpi=openmpi -pinfiniband -i journal
Administration
Modules
$ module avail --------------------------------------- /usr/share/Modules/modulefiles -------------- dot module-git module-info modules null use.own -------------------------------------- /etc/modulefiles -------------------------------- mpi/mpich-3.2-x86_64 -------------------------------------- /act/modulefiles -------------------------------- impi mpich/gcc mvapich2-2.1/gcc openmpi-1.6/gcc openmpi-1.8/gcc intel mpich/intel mvapich2-2.1/intel openmpi-1.6/intel openmpi-1.8/intel
Load a module
module load mpich/intel
Adding Users
For users to be able to run SLURM commands across the entire cluster, they need to have the same user account on each node in the cluster. To do this, an ansible playbook has been created. To run this playbook, you will need to be logged in as root. To do this, run su root and input the password for root. Once logged in as root, enter into the virtual environment for ansible by running source ~/ansible-env/bin/activate.
Now that we're in the virtual environment, we will be editing the file containing users to add by running nano ~/ansible-env/addusers/roles/common/tasks/main.yml. Add the user and their uid to this file under both of the sections, following the scheme below. A users uid can be found by running the command id <netid>. Make sure to keep the spacing consistent with the lines already in the file.
- { name: 'netid', uid: enter_uid_here }
After you've added the user to both sections of this file, hit 'control-X' to exit, and 'Y' to save the file. Now the playbook is ready to be run. Run the following command and watch the terminal for errors:
ansible-playbook ~/ansible-env/addusers/site.yml -i ~/ansible-env/addusers/hosts
If the playbook executed without any errors, then the user has been added across all of the nodes.
HOWTO: Setup SLURM on your personal computer
https://source2.cse.unr.edu/w/cse/tutorials/slurm-mpi-setup/
- Last Author
- • ctrujillo
- Last Edited
- May 30 2017, 11:43 AM