Page MenuHomePhabricator

CFAM
Updated 2,731 Days AgoPublic

Cluster Information

Ganglia Monitoring System
http://cfam-cluster.engr.unr.edu/ganglia/

Hardware

Head Node

8 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
32GiB RAM
8TiB HDD /home share with nodes via NFS

Nodes 1-6

20 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
128GiB RAM
8TiB HDD /home via the head node

Highmem 1-2

20 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
256GiB RAM
8TiB HDD /home via the head node

Networking
Mellanox Technologies MT27500 Family [ConnectX-3] FDR
1GbE Netgear Switch

Connecting to the CFAM Cluster

In order to connect to the cluster remotely, you will have to use either SSH or X2Go. Use X2Go if you would like a desktop interface, and SSH if you would like a command line interface.

SSH

If you are on Mac or Linux, open the terminal. If you are on Windows, you will have to download Cygwin and use that. Cygwin Installer

During Cygwin's setup, it will ask you which packages to install. Under the "Net" category, there is a package called "openssh." Click on the "Skip" button to the left of openssh to cycle it to 7.2p 1-1, so that it will install that version. Once at a terminal, run the following command, replacing <netid> with your netid. It will prompt you for your password, and then you'll be connected!

ssh -Y <netid>@cfam-cluster.engr.unr.edu

X2Go

If you prefer a desktop environment, you can use X2Go to connect.
Download and install X2Go here:
X2Go Download

Once in X2Go, type cfam-cluster.engr.unr.edu for the host, and your netid for the login.

Under the Session Type, select XFCE and you are ready to connect.

Number Pad

If the number pad is not working in X2Go, go to Keyboard under Settings. Under the Layout Tab, check Use system defaults. Log out and log back in and your number pad should work.

Restarting X2Go

When X2Go will not start properly, access the cluster through the command line and type pkill -u <username>. This will kill all processes in your username and X2Go should start.

SLURM Introduction

SLURM (Simple Linux Utility for Resource Management) is a utility built around cluster management. It allows for commands to utilize all of the nodes that the CFAM cluster contains. You likely will be using srun and sbatch primarily, but additional documentation on SLURM can be found here: SLURM Documentation

srun is used to run jobs in parallel, and there are two options you will primarily use; -N specifies the number of nodes to run the command on, and -n is used to specify the number of cores to run it with. For example, to run the command hostname on all 8 of the nodes in the CFAM cluster, you would run: srun -N 8 hostname, and if you wanted to run hostname on 100 cores you would run: srun -n 100 hostname. Additional documentation for srun can be found here: srun

sbatch is used to queue batch scripts for the cluster to run. You would specify a batch script to add to the queue, and when the cluster has the resources necessary, it will execute the script. To add scripts to the queue, run sbatch /path/of/script. To view the queue of scripts, you can use the command squeue. Additional documentation for sbatch can be found here: sbatch

Compiling SLURM Jobs

#/bin/bash

#We Storage some example code from Lawrence Livermore National lab in
#/opt/mpi

#Copy it to your home directory
cp -r /opt/mpi/tutorials/mpi/samples/C ~/mpi

cd ~/mpi

#Compile an example
mpicc -lpmi -o mpi_hello mpi_hello.c

#Run the example
srun -n16 --mpi=pmi2 mpi_hello

Output

$ srun -n16 mpi_hello
Hello from task 10 on node03!
Hello from task 9 on node03!
Hello from task 11 on node03!
Hello from task 12 on node03!
Hello from task 14 on node03!
Hello from task 8 on node03!
Hello from task 15 on node03!
Hello from task 13 on node03!
Hello from task 2 on node03!
Hello from task 0 on node03!
MASTER: Number of MPI tasks is: 16
Hello from task 1 on node01!
Hello from task 4 on node01!
Hello from task 7 on node01!
Hello from task 5 on node01!
Hello from task 6 on node01!
Hello from task 3 on node01!

Running Tasks

SRUN

https://slurm.schedmd.com/srun.html

srun is synchronous and blocking. Use sbatch to submit a job to the queue.

#-n indicates the number of cores
#--mem indicates the memory needed per node in megabytes
#--time indicates the specified run time of the job
$ srun -n16 --mem=2048 --time=00:05:00 ~/mpi/mpi_hello

SBATCH

https://slurm.schedmd.com/sbatch.html

$ cat ~/mpi/run.sh

#!/bin/bash
#SBATCH -n 16
#SBATCH --mem=2048MB
#SBATCH --time=00:30:00
#SBATCH --mail-user=YOUR_EMAIL@DOMAIN.COM
#SBATCH --mail-type=ALL
#SBATCH --mpi=pmi2

srun ~/mpi/mpi_hello

batch the job:

$ sbatch ~/mpi/run.sh 
Submitted batch job 536

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               536      main   run.sh cse-admi  R       0:03      2 node0[1,3]

Check the Cluster status:

$ sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
main*         up   infinite      2  alloc node0[1,3]
controller    up   infinite      1   idle head
test          up   infinite      1  down* test

Applications

Interactive ANSYS

salloc

salloc -N6 -w node01,node02,node03,node04,node05,node06 \
  srun --x11 -N1 /opt/ansys17-2/v172/fluent/bin/fluent -r17.2.0 \
  3ddp -t110 -pinfiniband -mpi=openmpi \
  -cnf=node01,node02,node03,node04,node05,node06  -nm -ssh

salloc allocates 6 nodes and srun --x11 runs an interactive job with X forwarding.

sinteractive

sinteractive -p nodes -N 2

You can find the name of the allocated hosts with the following:

$ nodeset -e $SLURM_JOB_NODELIST
highmem01 highmem02

SBATCH

Template submit.sh

#!/bin/bash
#SBATCH -N 6  # Nodes
#SBATCH -n 100  # Total number of tasks
#SBATCH --exclusive # exclusive lock
#SBATCH --mail-type=end
#SBATCH --mail-user=$YOUR_EMAIL
#SBATCH --workdir=/home/YOUR/PROJECT/PATH
#SBATCH -w highmem01,highmem02,node01,node02,node03,node04

FLUENT_HOSTS=highmem01,highmem02,node01,node02,node03,node04
FLUENT_BIN=~/ansys16/v162/fluent/bin/fluent

export FLUENT_GUI=off

if [ -z "$SLURM_NPROCS" ]; then
  N=$(( $(echo $SLURM_TASKS_PER_NODE | sed -r 's/([0-9]+)\(x([0-9]+)\)/\1 * \2/') ))
else
  N=$SLURM_NPROCS
fi

echo -e "N: $N\n";

~/ansys16/v162/fluent/bin/fluent 3ddp -g -slurm -ssh \
-cnf=$FLUENT_HOSTS -t $N \
-mpi=openmpi -pinfiniband -i journal

Administration

Modules

$ module avail

--------------------------------------- /usr/share/Modules/modulefiles --------------
dot         module-git  module-info modules     null        use.own

-------------------------------------- /etc/modulefiles --------------------------------
mpi/mpich-3.2-x86_64

-------------------------------------- /act/modulefiles --------------------------------
impi               mpich/gcc          mvapich2-2.1/gcc   openmpi-1.6/gcc    openmpi-1.8/gcc
intel              mpich/intel        mvapich2-2.1/intel openmpi-1.6/intel  openmpi-1.8/intel

Load a module

module load mpich/intel

Adding Users

For users to be able to run SLURM commands across the entire cluster, they need to have the same user account on each node in the cluster. To do this, an ansible playbook has been created. To run this playbook, you will need to be logged in as root. To do this, run su root and input the password for root. Once logged in as root, enter into the virtual environment for ansible by running source ~/ansible-env/bin/activate.
Now that we're in the virtual environment, we will be editing the file containing users to add by running nano ~/ansible-env/addusers/roles/common/tasks/main.yml. Add the user and their uid to this file under both of the sections, following the scheme below. A users uid can be found by running the command id <netid>. Make sure to keep the spacing consistent with the lines already in the file.

- { name: 'netid', uid: enter_uid_here }

After you've added the user to both sections of this file, hit 'control-X' to exit, and 'Y' to save the file. Now the playbook is ready to be run. Run the following command and watch the terminal for errors:

ansible-playbook ~/ansible-env/addusers/site.yml -i ~/ansible-env/addusers/hosts

If the playbook executed without any errors, then the user has been added across all of the nodes.

HOWTO: Setup SLURM on your personal computer

https://source2.cse.unr.edu/w/cse/tutorials/slurm-mpi-setup/

Last Author
ctrujillo
Last Edited
May 30 2017, 11:43 AM

Document Hierarchy

Event Timeline

newellz2 edited the content of this document. (Show Details)
newellz2 changed the visibility from "All Users" to "Public (No Login Required)".Mar 16 2017, 3:49 PM
newellz2 changed the title from Cfam to CFAM.May 24 2017, 5:15 PM
newellz2 edited the content of this document. (Show Details)
newellz2 edited the content of this document. (Show Details)