SLURM is a resource manager and job scheduling system developed by SchedMD. The trackable resources (TRES) include Nodes, CPUs, Memory and Generic Resources (GRES). Slurm has three key functions: Allocate resources exclusively/non-exclusive to nodes, start/execute and monitor the resources on a node, and arbitrates pending and queued work. Nodes are grouped together within a partition. The partitions can also be considered as job queues and each of which has a set of constraints such as job size limit, time limit, default memory limits, and the number of nodes, etc. Submitting a job to the system requires you to specify a partition, an account, a Quality of Service (QoS), the number of nodes, wallclock time limits and optional memory (default will be used if not specified). Jobs within a partition will then be allocated to nodes based on the scheduling policy, until all resources within a partition are exhausted.
There are several basic commands you will need to know to submit jobs, cancel jobs, and check status. These are:
- sbatch – submit a job to the batch queue system, e.g., sbatch myjob.sh
- squeue – check the current jobs in the batch queue system, e.g., squeue
- sinfo – view the current status of the queues, e.g., sinfo
- srun – to run interactive jobs, e.g., srun –pty bash
- scancel – cancel a job, e.g., scancel 123