By Gary M Jung on 2016-03-28T05:57:58Z
We are pleased to announce the “Low Priority QoS (Quality of Service)” pilot program which allows users to run the Lawrencium Cluster resources at no charge when running at a lower priority.
This program, tested by Lawrencium Condo users, is now available to all Lawrencium users. We hope by implementing such a solution, it would help users to increase their productivity by allowing them to make use of available computing resources.
The new QoSs “lr_lowprio” and “mako_lowprio” have been added that will allow users to run jobs that request up to 64 nodes and 3 days of runtime. This includes all general purpose partitions such as lr2, lr3, lr4, mako, and special purpose partitions such as lr_amd, lr_bigmem, lr_manycore, mako_manycore. By using these new QoSs, you are NOT subject to the usage recharge that we are currently collecting through the “lr_normal” and “mako_normal” QoSs; however, these QoSs do not get a priority as high as all the general, debug, and condo QoSs and they are subject to preemption by jobs submitted at the normal priority.
This has two implications to you:
1. When the system is busy, any job that is submitted with a Low Priority QoS will yield to other jobs with higher priorities. If you are running debug, interactive, or other types of jobs that require quick turn-around of resource, or have important deadline to catch, you may still want to use the general QoSs.
2. Further, when system is busy and there are higher priority jobs pending, scheduler will preempt jobs that are running with these lower priority QoSs automatically. The preempted jobs are chosen by the scheduler automatically and we have no way to set select criteria to control its behavior. Users can choose at submission time whether preempted jobs should simply be killed, or be automatically requeued after it is killed. Hence, we recommend that you have your application do periodic checkpoints so that it is able to restart from the last checkpoint. If you have a job that is not able to checkpoint/restart by itself, or non-interruptible during its runtime, you may want to use the general QoSs.
To submit jobs to this QoS, you will need to provide all the normal parameters, e.g., –partition=lr3, –account=ac_projectname, etc., for the QoS please use “–qos=lr_lowprio” or “–qos=mako_lowprio”, and make sure you request less than 64 nodes and 3 days of runtime for the job. If you would like the scheduler to requeue the job in its entirety in the case that the job is preempted, please add “–requeue” to your srun or sbatch command, otherwise the job will simply be
killed when preemption happens. An example of the job script should look like below:
====
#!/bin/bash
#SBATCH –job-name=test
#SBATCH –partition=lr3 ### other partition options:
lr2,lr4,lr_bigmem,lr_manycore,lr_amd,mako,mako_manycore
#SBATCH –account=ac_projectname
#SBATCH —qos=lr_lowprio ### another QoS option: mako_lowprio
###SBATCH –requeue ### only needed if automatically requeue is desired
#SBATCH –ntasks=20
#SBATCH –time=24:00:00
mpirun a.out
====
For condo users who have been helping us to test these low priority QoSs on the lr2 and mako partitions, your current associations with your “lr_condo” account have not changed so you can continue to use them but they are limited to lr2 and mako partitions only. If you intend to use other partitions you will need to change the account from “lr_condo” to “ac_condo”, e.g., “lr_nanotheory” -> “ac_nanotheory”. And we will phase out associations connected to your “lr_condo” account in the next month without further notice, so please make the change now.
For more information about this program and how to use the low priority QoSs properly please check our online user guide.
The pilot program will run for two month (Mar 22 – May 22) and we will decide how to proceed from there based on the usage and feedback.
Please forward your requests, questions, and comments to hpcshelp@lbl.gov during this pilot period.