Page MenuHomePhabricator

GPUH Cluster
Updated 2,606 Days AgoPublic

This server is behind the campus firewall, so it is not directly accessible from off-campus. If you are off-campus, you will need to ssh into a jumphost first (e.g. alpine.cse.unr.edu).

ssh $CSE-ID@gpuh.cse.unr.edu

If you are unable to run jobs across multiple nodes following the instructions below, please email ehelp@cse.unr.edu.

Configuration

NFS Mounts

/scratch           #Compile and run stuff
/opt                 #Install stuff here
/home             #Overrides IPA1 homedir IE: /cse/home/$USER

Playbook
https://source2.cse.unr.edu/diffusion/GPUH/

Ansible Playbook

sudo su

source /srv/python_env/bin/activate
ansible-playbook /srv/playbook/site.yml

Ganglia

https://www.cse.unr.edu/gpuh/ganglia/

Libraries

OpenMPI > /opt/openmpi
Compiled with SLURM PMI and CUDA

CUDA > /usr/local/cuda

dpkg -l | grep $WHATEVER_YOU_ARE_LOOKING_FOR

OpenMPI

cd /opt/src/openmpi-.2.0.2
./configure --prefix=/opt/openmpi --with-pmi=/usr \
--with-pmi-libdir=/usr/lib/x86_64-linux-gnu --with-pmix=internal —with-cuda

make uninstall

make -j 20 all install

Compiling SLIURM Jobs

#/bin/bash

#We Storage some example code from Lawrence Livermore National lab in
#/llnl/mpi

#Copy it to your home directory
cp -r /opt/llnl/tutorials/mpi/samples/C ~/mpi

cd ~/mpi

#Compile an example
mpicc -lpmi -o mpi_hello mpi_hello.c

#Run the example
srun -n16 mpi_hello

Output

$ srun -n16 mpi_hello

Running Tasks

SRUN

https://slurm.schedmd.com/srun.html

srun is synchronous and blocking. Use sbatch to submit a job to the queue.

#-n indicates the number of cores
#--mem indicates the memory needed per node in megabytes
#--time indicates the specified run time of the job
$ srun -n16 --mem=2048 --time=00:05:00 ~/mpi/mpi_hello

SBATCH

https://slurm.schedmd.com/sbatch.html

$ cat ~/mpi/run.sh

#!/bin/bash
#SBATCH -n 16
#SBATCH --mem=2048MB
#SBATCH --time=00:30:00
#SBATCH --mail-user=YOUR_EMAIL@DOMAIN.COM
#SBATCH --mail-type=ALL

srun ~/mpi/mpi_hello

batch the job:

$ sbatch ~/mpi/run.sh 
Submitted batch job 536

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               536      main   run.sh cse-admi  R       0:03      2 head,node[01-03]

Check the Cluster status:

$ sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
main*         up   infinite      2  alloc node[01-02]
main*         up   infinite      6   idle node[03], head

Node Hardware:

The cluster consists of 4 nodes, each with 64GB of RAM, 2x10 core CPU and 4x NVIDIA GTX1080s
Mellanox Technologies MT27520 Family [ConnectX-3 Pro]

HOWTO: Setup SLURM on your personal computer

https://source2.cse.unr.edu/w/cse/tutorials/slurm-mpi-setup/

Last Author
newellz2
Last Edited
Feb 28 2017, 2:48 PM

Event Timeline

newellz2 edited the content of this document. (Show Details)
newellz2 edited the content of this document. (Show Details)
newellz2 edited the content of this document. (Show Details)
newellz2 edited the content of this document. (Show Details)
newellz2 edited the content of this document. (Show Details)