Experimental Scheduling Directives

The mix of job sizes run on SHARCNET systems introduces challenges for doing effective job scheduling. In particular, small jobs tend to run first, regardless of scheduling priority, and many small jobs can prevent any larger jobs from ever starting. To help address these challenges, we will be experimenting with some system scheduling directives that are in use at other supercomputing consortia.

The directives outlined below will take effect on the indicated clusters, both of which are more suitable for jobs requiring large numbers of cores than for smaller threaded or serial jobs.

Narwhal - beginning Dec 3/07

When submitting jobs, you will be required to specify a runtime estimate. This runtime estimate will be treated as an upper limit. If you exceed your estimate your job will be cancelled. (Your job will receive SIGTERM followed by SIGKILL 10 minutes later. If possible, you should trap SIGTERM and checkpoint immediately.) To specify a runtime limit to sqsub, use a form like "-r 2.5d" (2 and a half days). You may also use -W instead of -r, to match the equivalent flag in LSF. Type sqsub --man for more details.
There will be a maximum runtime of seven days.

These changes allow the scheduler to make better use of system resources, for example, by backfilling short jobs before a long one begins, while still executing longer jobs in a timely fashion. Please note the following points:

All non-trivial jobs must save checkpoints.
If you underestimate the runtime of your job, your job will be cancelled before it completes. Thus, if you do not save checkpoints, all the work will be lost.
If you overestimate the runtime of your job, your job start may be delayed, since the scheduler will use your estimate to decide whether your job may run in a given time slot. A reasonably accurate estimate thus benefits you (by expediting your job start time) and other users (by making better use of the resources).
If your job runtime is known to be longer than seven days then your job is guaranteed not to complete on Narwhal and you should contact help@sharcnet.ca to investigate options.
We recommend saving a checkpoint every six hours. If you need help with checkpointing, send e-mail to help@sharcnet.ca and we will put you in touch with a High Performance Computing Analyst.

Requin - beginning Dec 3/07

Every Monday, all running jobs will be cancelled. Jobs that are queued/pending will not be affected. This will ensure that large multi-node jobs (which have a higher scheduling priority on Requin) are started first after the cancellation. Points to note:

All non-trivial jobs must save checkpoints.
If your job begins executing close to the cancellation time, it will not run very long before being cancelled. No allowance will be made for short running jobs.
If your large checkpointing job is cancelled and the general cluster load is high, it is likely that if you resubmit it to the queue to continue processing, it will not begin executing again until after the following cancellation.
These directives will negatively impact serial jobs on Requin, we recommend that serial jobs are run on Whale.
We recommend saving a checkpoint every six hours. If you need help with checkpointing, send e-mail to help@sharcnet.ca and we will put you in touch with a High Performance Computing Analyst.