GPUH Cluster
This server is behind the campus firewall, so it is not directly accessible from off-campus. If you are off-campus, you will need to ssh into a jumphost first (e.g. alpine.cse.unr.edu).
ssh $CSE-ID@gpuh.cse.unr.edu
If you are unable to run jobs across multiple nodes following the instructions below, please email ehelp@cse.unr.edu.
Configuration
NFS Mounts
/scratch #Compile and run stuff /opt #Install stuff here /home #Overrides IPA1 homedir IE: /cse/home/$USER
Playbook
https://source2.cse.unr.edu/diffusion/GPUH/
Ansible Playbook
sudo su source /srv/python_env/bin/activate ansible-playbook /srv/playbook/site.yml
Ganglia
https://www.cse.unr.edu/gpuh/ganglia/
Libraries
OpenMPI > /opt/openmpi
Compiled with SLURM PMI and CUDA
CUDA > /usr/local/cuda
dpkg -l | grep $WHATEVER_YOU_ARE_LOOKING_FOR
OpenMPI
cd /opt/src/openmpi-.2.0.2 ./configure --prefix=/opt/openmpi --with-pmi=/usr \ --with-pmi-libdir=/usr/lib/x86_64-linux-gnu --with-pmix=internal —with-cuda make uninstall make -j 20 all install
Compiling SLIURM Jobs
#/bin/bash #We Storage some example code from Lawrence Livermore National lab in #/llnl/mpi #Copy it to your home directory cp -r /opt/llnl/tutorials/mpi/samples/C ~/mpi cd ~/mpi #Compile an example mpicc -lpmi -o mpi_hello mpi_hello.c #Run the example srun -n16 mpi_hello
Output
$ srun -n16 mpi_hello
Running Tasks
SRUN
https://slurm.schedmd.com/srun.html
srun is synchronous and blocking. Use sbatch to submit a job to the queue.
#-n indicates the number of cores #--mem indicates the memory needed per node in megabytes #--time indicates the specified run time of the job $ srun -n16 --mem=2048 --time=00:05:00 ~/mpi/mpi_hello
SBATCH
https://slurm.schedmd.com/sbatch.html
$ cat ~/mpi/run.sh #!/bin/bash #SBATCH -n 16 #SBATCH --mem=2048MB #SBATCH --time=00:30:00 #SBATCH --mail-user=YOUR_EMAIL@DOMAIN.COM #SBATCH --mail-type=ALL srun ~/mpi/mpi_hello
batch the job:
$ sbatch ~/mpi/run.sh Submitted batch job 536 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 536 main run.sh cse-admi R 0:03 2 head,node[01-03]
Check the Cluster status:
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST main* up infinite 2 alloc node[01-02] main* up infinite 6 idle node[03], head
Node Hardware:
The cluster consists of 4 nodes, each with 64GB of RAM, 2x10 core CPU and 4x NVIDIA GTX1080s
Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
HOWTO: Setup SLURM on your personal computer
https://source2.cse.unr.edu/w/cse/tutorials/slurm-mpi-setup/
- Last Author
- newellz2
- Last Edited
- Feb 28 2017, 2:48 PM