More (and easier!) GPU scheduling options

GPU scheduling is now easier and more powerful on Sherlock, with the addition of new job submission options especially targeted at GPU workloads.

The most visible change is that you can now use the --gpus option when submitting jobs, like this:

$ srun -p gpu --gpus=2 ...

A number of additional submission options can now be used, such as:

--cpus-per-gpu, to request a number of CPUs per allocated GPU,
--gpus-per-node, to request a given number of GPUs per node,
--gpus-per-task, to request a number of GPUs per spawned task,
--mem-per-gpu, to allocate a given amount of host memory per GPU.

You can now also allocate a different number of GPUs per node on multi-node jobs, change the frequency of the GPUs allocated to your job and explicitly set task-to-GPU binding maps.

All of those options are detailed in the updated documentations at https://www.sherlock.stanford.edu/docs/user-guide/gpu/ and a more complete description is available in the Slurm manual

Under the hood, the scheduler is now fully aware of the specifics of each GPU node, it knows how GPUs on the same node are inter-connected, and how they map to CPU sockets, and can select preferred GPUs for co-scheduling. It has all the information it needs to take optimal decisions about the placement of tasks within a job.

The end result? Better performance with less hassle for multi-GPU jobs.

So please take the new options for a spin, and let us know how they work for your jobs!