GPU Cluster

The GPU compute cluster currently offers several hybrid nodes with balanced cpu/ram and gpu ratio. Access maybe granted upon request by additional user permissions. The cluster utilizes the workload manager Slurm. You may start your jobs via the login node login.gpu.cit-ec.net. For Compute tasks it is mandatory to use the slurm scheduler.

Although the cluster nodes running the TechFak netboot installation the locations /homes and /vol are not available. Therefore an exclusive homes and vol is located at /media/compute/ On initial login your compute home directory will be provisioned under /media/compute/homes/user. It is separated from your regular TechFak home location /homes/user. The compute home is accessible via files.techfak.de and a regular TechFak netboot system like compute.techfak.de.

Slurm Basics

Slurm jobs may be scheduled by slurm client tools. For a brief introduction checkout Slurm Quickstart

There are some possibilities to schedule a task in a slurm controlled system.

The main paradigm for job scheduling in a slurm cluster is using sbatch. It schedules a Job and requests the user claimed resources.

The cluster provides two types of resources: CPUs and GPUs which can be requested for jobs in variable amounts.

The GPUs in the cluster come in two flavours: The GPU objects tesla and gtx.

You may request a single gpu object via the option --gres=gpu:1. The Slurm scheduler reserves one gpu object exclusive for your job and therefore schedules the jobs by free resources.

CPUs are requested with -c or --cpus-per-task= options. For further information have a look at the man-pages of srun and sbatch. Reading the Slurm documentation is also highly recommended

The commands sinfo and squeue provide detailed information about the clusters state and jobs running.

CPU management and CPU-only jobs

Though the facility is called GPU-cluster, it also appropriate for CPU-only computing as it not only provides 12 GPUs but also 240 CPU-cores. Effective utilization of the CPU-resources can be tricky so you should make yourself familiar with CPU-management.

Choosing the appropriate partition

The cluster offers two partitions. Partitions can be considered as separate queues with slightly different features.

Partition selection is done with the parameter -p or --partition= in your srun-commands and sbatch-scripts. The default partition is 'cpu'. Jobs which aren't mapped on a partition will be started there.

We have the 'cpu' and 'gpu' partition. If you have a cpu-only job (not requesting any GPU-resources with --gres=gpu:n ), you should start it on the 'cpu' partition.

A job using GPU should be started on the 'gpu' partition, with one exception. Jobs which request one GPU (with --gres=gpu:1) and more than 2 CPUs (with the -c or --cpus-per-task option) should use the 'cpu' partition.

The reason for this policy is not obvious and will be explained under GPU Blocking 

The example example.job.sbatch request one GTX 1080 Ti for the job and calls the payload example.job.sh via srun.

File: example.job.sbatch

#!/bin/bash
#SBATCH --gres=gpu:gtx:1
#SBATCH --partition=gpu
#SBATCH --time=4:00:00
srun example.job.sh

File: ​​​​example.job.sh

#!/bin/bash
module load cuda/9.0
CUDA_DEVICE=$(echo "$CUDA_VISIBLE_DEVICES," | cut -d',' -f $((SLURM_LOCALID + 1)) );
T_REGEX='^[0-9]$';
if ! [[ "$CUDA_DEVICE" =~ $T_REGEX ]]; then
        echo "error no reserved gpu provided"
        exit 1;
fi
echo "Process $SLURM_PROCID of Job $SLURM_JOBID withe the local id $SLURM_LOCALID using gpu id $CUDA_DEVICE (we may use gpu: $CUDA_VISIBLE_DEVICES on $(hostname))"
echo "computing on $(nvidia-smi --query-gpu=gpu_name --format=csv -i $CUDA_DEVICE | tail -n 1)"
sleep 15
echo "done"

In the payload script example.job.sh the cluster specific environment variable CUDA_DEVICE tells your job which gpu device to use.

If you want to get informed on job completion you may add the following sbatch parameter to your sbatch file.

#SBATCH --mail-user=user@techfak.uni-bielefeld.de
#SBATCH --mail-type=END

Please only use your eMail address @techfak.uni-bielefeld.de because others will be dropped. You may setup a mail forward within the TechFak webmailer. For a tutorial see /media/compute/vol/tutorial

GPU-blocking

The Slurm resource manager provides an abstraction layer for resource allocation. So the user just throws jobs including resource requests on the cluster. Slurm will distribute the jobs to the hardware resources automatically. An appropriate node will be chosen for the job(s) and if the requested resources are not available, the job will be queued for being started later when the requested resources become available again.

Even though the resource abstraction layer exists, it's a good idea to keep the real hardware structure of the cluster in mind. It consists of 6 nodes (real machines), each providing 2 GPUs and 40 CPUs (2 sockets with 10 real cores + 10 virtual cores each).

Starting with the simple default configuration we had nodes in a state where GPUs were unuseable because CPU-intensive jobs were able to consume all CPU-resources on a single node. If all 40 CPUs on a node are in use, no further jobs can be startet on such a node. The node is then in the allocate state. If there are unused GPUs on an allocated node, they can not be utilized, because every job needs at least one CPU to run.

To prevent this GPU-lock, we established a second partition (this is the Slurm terminus for a job queue) called 'cpu'. The key feature of this partition is the fact that on a single node a maximum of 36 CPUs will be allocated for jobs from the 'cpu'-partition.
So, jobs started via the 'cpu'-partition can't eat up all CPUs of a node, there are always 4 CPUs left.

On the other hand jobs using GPU-resources should be started on our second partition called 'gpu'. These jobs are able to utilize the 4 'spare' CPUs on a node.

But even with this configuration, it's still possible to generate a GPU-lock: Imagine a node where 36 CPUs are allocated to CPU-only jobs. No GPU is yet in use on this node. Then a job from the 'gpu'-partition which requests 1 GPU and 4 CPUs is started on this node. The node is then allocated (36+4 CPUs are in use) but only 1 GPU is used. The second GPU is blocked because there's no CPU left to start another job on the node.

To avoid this, just follow the simple rule: If a job utilizes one GPU and more than two CPUs, start it on the 'cpu'-partition. Jobs requesting 2 GPUs don't fall under this category as they already utilize both GPUs on a node and no blocking condition can occure.

Cluster Nodes

Hostname CPU-Cores RAM GPU
papaschlumpf 20 256 GB 2 x Tesla P-100
schlumpfine 20 256 GB 2 x Tesla P-100
schlaubi 20 128 GB 2 x GTX 1080 TI
hefty 20 128 GB 2 x GTX 1080 TI
clumsy 20 128 GB 2 x GTX 1080 TI
handy 20 128 GB 2 x GTX 1080 TI

Environment Modules Basics

Environment modules supports you in managing multiple tool flow versions like compilers or libraries. Therefore it manages user environment variables and keeps control of e.g. PATH, LD_LIBRARY_PATH, INCLUDE and other environment variables. Environment Modules benefits of the runtime extension and reduction of modules controlled by the user.

currently available modules

  • cuda/6.5
  • cuda/7.5
  • cuda/8.0
  • cuda/9.0
  • license
  • matlab/R2017b
  • matlab/R2018b

Examples

show available modules

module avail

show loaded modules

module list

load a module

module load cuda/9.0

unload a module

module remove cuda/9.0

see /media/compute/vol/tutorial