Change Details

# Connecting to the CFAM Cluster In order to connect to the cluster remotely, you will have to use either SSH or X2Go. Use X2Go if you would like a desktop interface, and SSH if you would like a command line interface. **SSH:** If you are on Mac or Linux, open the terminal. If you are on Windows, you will have to download Cygwin and use that. [[ https://cygwin.com/install.html | Cygwin Installer ]] During Cygwin's setup, it will ask you which packages to install. Under the "Net" category, there is a package called "openssh." Click on the "Skip" button to the left of openssh to cycle it to 7.2p 1-1, so that it will install that version. Once at a terminal, run the following command, replacing <netid> with your netid. It will prompt you for your password, and then you'll be connected! ``` ssh -Y <netid>@cfam-cluster.engr.unr.edu ``` **X2Go:** If you prefer a desktop environment, you can use X2Go to connect. Download and install X2Go here: [[ http://wiki.x2go.org/doku.php/download:start | X2Go Download ]] Once in X2Go, type `cfam-cluster.engr.unr.edu` for the host, and your netid for the login. Under the Session Type, select **XFCE **and you are ready to connect. # SLURM Introduction SLURM (Simple Linux Utility for Resource Management) is a utility built around cluster management. It allows for commands to utilize all of the nodes that the CFAM cluster contains. You likely will be using `srun` and `sbatch` primarily, but additional documentation on SLURM can be found here: [[ http://slurm.schedmd.com/documentation.html | SLURM Documentation ]] `srun` is used to run jobs in parallel, and there are two options you will primarily use; `-N` specifies the number of nodes to run the command on, and `-n` is used to specify the number of cores to run it with. For example, to run the command `hostname` on all 8 of the nodes in the CFAM cluster, you would run: `srun -N 8 hostname`, and if you wanted to run `hostname` on 100 cores you would run: `srun -n 100 hostname`. Additional documentation for `srun` can be found here: [[ http://slurm.schedmd.com/srun.html | srun ]] `sbatch` is used to queue batch scripts for the cluster to run. You would specify a batch script to add to the queue, and when the cluster has the resources necessary, it will execute the script. To add scripts to the queue, run `sbatch /path/of/script`. To view the queue of scripts, you can use the command `squeue`. Additional documentation for `sbatch` can be found here: [[ http://slurm.schedmd.com/sbatch.html | sbatch ]] ==Compiling SLURM Jobs ``` #/bin/bash #We Storage some example code from Lawrence Livermore National lab in #/opt/mpi #Copy it to your home directory cp -r /opt/mpi/tutorials/mpi/samples/C ~/mpi cd ~/mpi #Compile an example mpicc -lpmi -o mpi_hello mpi_hello.c #Run the example srun -n16 --mpi=pmi2 mpi_hello ``` Output ``` $ srun -n16 mpi_hello Hello from task 10 on node03! Hello from task 9 on node03! Hello from task 11 on node03! Hello from task 12 on node03! Hello from task 14 on node03! Hello from task 8 on node03! Hello from task 15 on node03! Hello from task 13 on node03! Hello from task 2 on node03! Hello from task 0 on node03! MASTER: Number of MPI tasks is: 16 Hello from task 1 on node01! Hello from task 4 on node01! Hello from task 7 on node01! Hello from task 5 on node01! Hello from task 6 on node01! Hello from task 3 on node01! ``` ==Running Tasks ===SRUN https://slurm.schedmd.com/srun.html srun is synchronous and blocking. Use sbatch to submit a job to the queue. ``` #-n indicates the number of cores #--mem indicates the memory needed per node in megabytes #--time indicates the specified run time of the job $ srun -n16 --mem=2048 --time=00:05:00 ~/mpi/mpi_hello ``` ===SBATCH https://slurm.schedmd.com/sbatch.html ``` $ cat ~/mpi/run.sh #!/bin/bash #SBATCH -n 16 #SBATCH --mem=2048MB #SBATCH --time=00:30:00 #SBATCH --mail-user=YOUR_EMAIL@DOMAIN.COM #SBATCH --mail-type=ALL #SBATCH --mpi=pmi2 srun ~/mpi/mpi_hello ``` batch the job: ``` $ sbatch ~/mpi/run.sh Submitted batch job 536 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 536 main run.sh cse-admi R 0:03 2 node0[1,3] ``` Check the Cluster status: ``` $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST main* up infinite 2 alloc node0[1,3] controller up infinite 1 idle head test up infinite 1 down* test ``` =Applications ==Interactive ANSYS ===salloc ``` lang=bash salloc -N6 -w node01,node02,node03,node04,node05,node06 \ srun --x11 -N1 /opt/ansys17-2/v172/fluent/bin/fluent -r17.2.0 \ 3ddp -t110 -pinfiniband -mpi=openmpi \ -cnf=node01,node02,node03,node04,node05,node06 -nm -ssh ``` **salloc** allocates 6 nodes and **srun --x11** runs an interactive job with X forwarding. ===sinteractive ``` lang=bash sinteractive -p nodes -N 2 ``` You can find the name of the allocated hosts with the following: ``` lang=bash $ nodeset -e $SLURM_JOB_NODELIST highmem01 highmem02 ``` ==SBATCH Template **submit.sh** ``` lang=bash #!/bin/bash #SBATCH -N 6 # Nodes #SBATCH -n 100 # Total number of tasks #SBATCH --exclusive # exclusive lock #SBATCH --mail-type=end #SBATCH --mail-user=$YOUR_EMAIL #SBATCH --workdir=/home/YOUR/PROJECT/PATH #SBATCH -w highmem01,highmem02,node01,node02,node03,node04 FLUENT_HOSTS=highmem01,highmem02,node01,node02,node03,node04 FLUENT_BIN=~/ansys16/v162/fluent/bin/fluent export FLUENT_GUI=off if [ -z "$SLURM_NPROCS" ]; then N=$(( $(echo $SLURM_TASKS_PER_NODE | sed -r 's/([0-9]+)$x([0-9]+)$/\1 * \2/') )) else N=$SLURM_NPROCS fi echo -e "N: $N\n"; ~/ansys16/v162/fluent/bin/fluent 3ddp -g -slurm -ssh \ -cnf=$FLUENT_HOSTS -t $N \ -mpi=openmpi -pinfiniband -i journal ``` = Administration ==Modules ``` lang=bash $ module avail --------------------------------------- /usr/share/Modules/modulefiles -------------- dot module-git module-info modules null use.own -------------------------------------- /etc/modulefiles -------------------------------- mpi/mpich-3.2-x86_64 -------------------------------------- /act/modulefiles -------------------------------- impi mpich/gcc mvapich2-2.1/gcc openmpi-1.6/gcc openmpi-1.8/gcc intel mpich/intel mvapich2-2.1/intel openmpi-1.6/intel openmpi-1.8/intel ``` ===Load a module ``` module load mpich/intel ``` == Adding Users For users to be able to run SLURM commands across the entire cluster, they need to have the same user account on each node in the cluster. To do this, an ansible playbook has been created. To run this playbook, you will need to be logged in as root. To do this, run `su root` and input the password for root. Once logged in as root, enter into the virtual environment for ansible by running `source ~/ansible-env/bin/activate`. Now that we're in the virtual environment, we will be editing the file containing users to add by running `nano ~/ansible-env/addusers/roles/common/tasks/main.yml`. Add the user and their uid to this file under both of the sections, following the scheme below. A users uid can be found by running the command `id <netid>`. Make sure to keep the spacing consistent with the lines already in the file. ``` - { name: 'netid', uid: enter_uid_here } ``` After you've added the user to both sections of this file, hit 'control-X' to exit, and 'Y' to save the file. Now the playbook is ready to be run. Run the following command and watch the terminal for errors: ``` ansible-playbook ~/ansible-env/addusers/site.yml -i ~/ansible-env/addusers/hosts ``` If the playbook executed without any errors, then the user has been added across all of the nodes. ==HOWTO: Setup SLURM on your personal computer https://source2.cse.unr.edu/w/cse/tutorials/slurm-mpi-setup/

##Cluster Information Ganglia Monitoring System http://cfam-cluster.engr.unr.edu/ganglia/ ===Hardware **Head Node** ``` 8 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz 32GiB RAM 8TiB /home share with nodes via NFS ``` **Nodes 1-6** ``` 20 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz 128GiB RAM 8TiB /home via the head node ``` ===Highmem1-2 ``` 20 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz 256GiB RAM 8TiB /home via the head node ``` ## Connecting to the CFAM Cluster In order to connect to the cluster remotely, you will have to use either SSH or X2Go. Use X2Go if you would like a desktop interface, and SSH if you would like a command line interface. **SSH:** If you are on Mac or Linux, open the terminal. If you are on Windows, you will have to download Cygwin and use that. [[ https://cygwin.com/install.html | Cygwin Installer ]] During Cygwin's setup, it will ask you which packages to install. Under the "Net" category, there is a package called "openssh." Click on the "Skip" button to the left of openssh to cycle it to 7.2p 1-1, so that it will install that version. Once at a terminal, run the following command, replacing <netid> with your netid. It will prompt you for your password, and then you'll be connected! ``` ssh -Y <netid>@cfam-cluster.engr.unr.edu ``` **X2Go:** If you prefer a desktop environment, you can use X2Go to connect. Download and install X2Go here: [[ http://wiki.x2go.org/doku.php/download:start | X2Go Download ]] Once in X2Go, type `cfam-cluster.engr.unr.edu` for the host, and your netid for the login. Under the Session Type, select **XFCE **and you are ready to connect. # SLURM Introduction SLURM (Simple Linux Utility for Resource Management) is a utility built around cluster management. It allows for commands to utilize all of the nodes that the CFAM cluster contains. You likely will be using `srun` and `sbatch` primarily, but additional documentation on SLURM can be found here: [[ http://slurm.schedmd.com/documentation.html | SLURM Documentation ]] `srun` is used to run jobs in parallel, and there are two options you will primarily use; `-N` specifies the number of nodes to run the command on, and `-n` is used to specify the number of cores to run it with. For example, to run the command `hostname` on all 8 of the nodes in the CFAM cluster, you would run: `srun -N 8 hostname`, and if you wanted to run `hostname` on 100 cores you would run: `srun -n 100 hostname`. Additional documentation for `srun` can be found here: [[ http://slurm.schedmd.com/srun.html | srun ]] `sbatch` is used to queue batch scripts for the cluster to run. You would specify a batch script to add to the queue, and when the cluster has the resources necessary, it will execute the script. To add scripts to the queue, run `sbatch /path/of/script`. To view the queue of scripts, you can use the command `squeue`. Additional documentation for `sbatch` can be found here: [[ http://slurm.schedmd.com/sbatch.html | sbatch ]] ==Compiling SLURM Jobs ``` #/bin/bash #We Storage some example code from Lawrence Livermore National lab in #/opt/mpi #Copy it to your home directory cp -r /opt/mpi/tutorials/mpi/samples/C ~/mpi cd ~/mpi #Compile an example mpicc -lpmi -o mpi_hello mpi_hello.c #Run the example srun -n16 --mpi=pmi2 mpi_hello ``` Output ``` $ srun -n16 mpi_hello Hello from task 10 on node03! Hello from task 9 on node03! Hello from task 11 on node03! Hello from task 12 on node03! Hello from task 14 on node03! Hello from task 8 on node03! Hello from task 15 on node03! Hello from task 13 on node03! Hello from task 2 on node03! Hello from task 0 on node03! MASTER: Number of MPI tasks is: 16 Hello from task 1 on node01! Hello from task 4 on node01! Hello from task 7 on node01! Hello from task 5 on node01! Hello from task 6 on node01! Hello from task 3 on node01! ``` ==Running Tasks ===SRUN https://slurm.schedmd.com/srun.html srun is synchronous and blocking. Use sbatch to submit a job to the queue. ``` #-n indicates the number of cores #--mem indicates the memory needed per node in megabytes #--time indicates the specified run time of the job $ srun -n16 --mem=2048 --time=00:05:00 ~/mpi/mpi_hello ``` ===SBATCH https://slurm.schedmd.com/sbatch.html ``` $ cat ~/mpi/run.sh #!/bin/bash #SBATCH -n 16 #SBATCH --mem=2048MB #SBATCH --time=00:30:00 #SBATCH --mail-user=YOUR_EMAIL@DOMAIN.COM #SBATCH --mail-type=ALL #SBATCH --mpi=pmi2 srun ~/mpi/mpi_hello ``` batch the job: ``` $ sbatch ~/mpi/run.sh Submitted batch job 536 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 536 main run.sh cse-admi R 0:03 2 node0[1,3] ``` Check the Cluster status: ``` $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST main* up infinite 2 alloc node0[1,3] controller up infinite 1 idle head test up infinite 1 down* test ``` =Applications ==Interactive ANSYS ===salloc ``` lang=bash salloc -N6 -w node01,node02,node03,node04,node05,node06 \ srun --x11 -N1 /opt/ansys17-2/v172/fluent/bin/fluent -r17.2.0 \ 3ddp -t110 -pinfiniband -mpi=openmpi \ -cnf=node01,node02,node03,node04,node05,node06 -nm -ssh ``` **salloc** allocates 6 nodes and **srun --x11** runs an interactive job with X forwarding. ===sinteractive ``` lang=bash sinteractive -p nodes -N 2 ``` You can find the name of the allocated hosts with the following: ``` lang=bash $ nodeset -e $SLURM_JOB_NODELIST highmem01 highmem02 ``` ==SBATCH Template **submit.sh** ``` lang=bash #!/bin/bash #SBATCH -N 6 # Nodes #SBATCH -n 100 # Total number of tasks #SBATCH --exclusive # exclusive lock #SBATCH --mail-type=end #SBATCH --mail-user=$YOUR_EMAIL #SBATCH --workdir=/home/YOUR/PROJECT/PATH #SBATCH -w highmem01,highmem02,node01,node02,node03,node04 FLUENT_HOSTS=highmem01,highmem02,node01,node02,node03,node04 FLUENT_BIN=~/ansys16/v162/fluent/bin/fluent export FLUENT_GUI=off if [ -z "$SLURM_NPROCS" ]; then N=$(( $(echo $SLURM_TASKS_PER_NODE | sed -r 's/([0-9]+)$x([0-9]+)$/\1 * \2/') )) else N=$SLURM_NPROCS fi echo -e "N: $N\n"; ~/ansys16/v162/fluent/bin/fluent 3ddp -g -slurm -ssh \ -cnf=$FLUENT_HOSTS -t $N \ -mpi=openmpi -pinfiniband -i journal ``` = Administration ==Modules ``` lang=bash $ module avail --------------------------------------- /usr/share/Modules/modulefiles -------------- dot module-git module-info modules null use.own -------------------------------------- /etc/modulefiles -------------------------------- mpi/mpich-3.2-x86_64 -------------------------------------- /act/modulefiles -------------------------------- impi mpich/gcc mvapich2-2.1/gcc openmpi-1.6/gcc openmpi-1.8/gcc intel mpich/intel mvapich2-2.1/intel openmpi-1.6/intel openmpi-1.8/intel ``` ===Load a module ``` module load mpich/intel ``` == Adding Users For users to be able to run SLURM commands across the entire cluster, they need to have the same user account on each node in the cluster. To do this, an ansible playbook has been created. To run this playbook, you will need to be logged in as root. To do this, run `su root` and input the password for root. Once logged in as root, enter into the virtual environment for ansible by running `source ~/ansible-env/bin/activate`. Now that we're in the virtual environment, we will be editing the file containing users to add by running `nano ~/ansible-env/addusers/roles/common/tasks/main.yml`. Add the user and their uid to this file under both of the sections, following the scheme below. A users uid can be found by running the command `id <netid>`. Make sure to keep the spacing consistent with the lines already in the file. ``` - { name: 'netid', uid: enter_uid_here } ``` After you've added the user to both sections of this file, hit 'control-X' to exit, and 'Y' to save the file. Now the playbook is ready to be run. Run the following command and watch the terminal for errors: ``` ansible-playbook ~/ansible-env/addusers/site.yml -i ~/ansible-env/addusers/hosts ``` If the playbook executed without any errors, then the user has been added across all of the nodes. ==HOWTO: Setup SLURM on your personal computer https://source2.cse.unr.edu/w/cse/tutorials/slurm-mpi-setup/