GPUH Cluster
Updated 2,705 Days AgoPublic
Actions

This server is behind the campus firewall, so it is not directly accessible from off-campus. If you are off-campus, you will need to ssh into a jumphost first (e.g. alpine.cse.unr.edu).

ssh $CSE-ID@gpuh.cse.unr.edu

If you are unable to run jobs across multiple nodes following the instructions below, please email ehelp@cse.unr.edu.

Configuration

NFS Mounts

/scratch           #Compile and run stuff
/opt                 #Install stuff here
/home             #Overrides IPA1 homedir IE: /cse/home/$USER

Playbook
https://source2.cse.unr.edu/diffusion/GPUH/

Ansible Playbook

sudo su

source /srv/python_env/bin/activate
ansible-playbook /srv/playbook/site.yml

Ganglia

https://www.cse.unr.edu/gpuh/ganglia/

Libraries

OpenMPI > /opt/openmpi
Compiled with SLURM PMI and CUDA

CUDA > /usr/local/cuda

dpkg -l | grep $WHATEVER_YOU_ARE_LOOKING_FOR

OpenMPI

cd /opt/src/openmpi-.2.0.2
./configure --prefix=/opt/openmpi --with-pmi=/usr \
--with-pmi-libdir=/usr/lib/x86_64-linux-gnu --with-pmix=internal —with-cuda

make uninstall

make -j 20 all install

Compiling SLIURM Jobs

#/bin/bash

#We Storage some example code from Lawrence Livermore National lab in
#/llnl/mpi

#Copy it to your home directory
cp -r /opt/llnl/tutorials/mpi/samples/C ~/mpi

cd ~/mpi

#Compile an example
mpicc -lpmi -o mpi_hello mpi_hello.c

#Run the example
srun -n16 mpi_hello

Output

$ srun -n16 mpi_hello

Running Tasks

SRUN

https://slurm.schedmd.com/srun.html

srun is synchronous and blocking. Use sbatch to submit a job to the queue.

#-n indicates the number of cores
#--mem indicates the memory needed per node in megabytes
#--time indicates the specified run time of the job
$ srun -n16 --mem=2048 --time=00:05:00 ~/mpi/mpi_hello

SBATCH

https://slurm.schedmd.com/sbatch.html

$ cat ~/mpi/run.sh

#!/bin/bash
#SBATCH -n 16
#SBATCH --mem=2048MB
#SBATCH --time=00:30:00
#SBATCH --mail-user=YOUR_EMAIL@DOMAIN.COM
#SBATCH --mail-type=ALL

srun ~/mpi/mpi_hello

batch the job:

$ sbatch ~/mpi/run.sh 
Submitted batch job 536

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               536      main   run.sh cse-admi  R       0:03      2 head,node[01-03]

Check the Cluster status:

$ sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
main*         up   infinite      2  alloc node[01-02]
main*         up   infinite      6   idle node[03], head