Scheduling Policies



A Scheduling Policy is used to determine the order in which jobs will be run on computational resources like CPUs and GPUs. (Note we discuss scheduling of Virtual Machines (VMs) separately below).

Scheduling on OpenHPC CPUs and GPUs

We use the SLURM (Simple Linux Utility for Resource Management) to determine the order in which jobs will run on CPU and GPU resources. Generally speaking, SLURM uses a job priority levels to determine the order
in which jobs will be run. The higher the priority, the sooner the job will be run. Whenever a job is submitted to be run, it must specify a compute time Allocation to charge. Compute time Allocations have an Allocation Priority
level that SLURM uses to order jobs. Because Allocation Priority levels are implicitly associated with an Allocation's Purpose (see the table below), we often talk about the scheduling algorithm in terms of Allocation Purposes:

Priority Level

Allocation Purposes with this Priority

Priority Level

Allocation Purposes with this Priority

1

Condo / CCS Discretionary

2

Meritorious Research / Educational / Startup

3

Condo Incentive

4

Open Access

Allocation Priorities (priority 1 is the highest priority -- i.e., most important)

Basically, Condo and CCS Discretionary jobs are given the highest priority by SLURM. Meritorious Research, Educational, and Startup use receive the next highest priority. The last two priority levels are used for jobs that are
not guaranteed compute time, but rather run on nodes that are otherwise unused. In this case jobs run using Condo Incentive allocations have priority over all other jobs (Open Access jobs).

Within jobs of a particular Priority/Purpose, the ordering of jobs uses a complex algorithm that includes factors such as how "wide" a job is, how long a job has been running, or the order in which jobs were submitted -- among
other factors. Moreover, SLURM uses a technique called backfilling to reorder jobs based on their expected completion time. For example, consider a situation where a (wide job) A arrives first and is waiting for 32 nodes to
become free. If SLURM knows the 32 nodes that A needs will not become available for one week, SLURM may decide to allow narrow jobs to run in the meantime. Even though (narrow job) B arrives after A, if B only needs 4
nodes and will complete in less than a week, SLURM will schedule B to begin running immediately, knowing that it can make use of the resources while A is waiting. In order to take advantage of backfilling, it is critical that jobs include an expected runtime (using the SBATCH --time option) when they are submitted to SLURM.

Scheduling on OpenStack VMs

VMs, unlike HPC jobs, often are intended to run forever. Consequently, systems like OpenStack are designed to either allow a VM to be created if there are sufficient resources, or deny the VM creation request. So the concept of "scheduling", in the HPC sense, does not have a clear definition/implementation/counter-part in OpenStack. Moreover, the default OpenStack interface for creating VMs (the OpenStack GUI/Web Interface) only allows Users to repeatedly try to create a VM -- it does not "queue up" creation requests like a scheduling system would.

In order to support and offer the ability to run both standard general-purpose VMs as well as HPC-style jobs executed in VMs, we provide two ways to create OpenStack VMs:

  1. Via the OpenStack GUI/Web Interface -- useful when resources are not over-subscribed.

    • "Scheduling" is not supported. Instead, Users must repeatedly try to create the VM until they succeed.

    • VMs created this way are have no associated priority and thus are not started in any particular order.

    • In general, this approach is only useful when resources are plentiful and VMs creation always succeeds.

  2. Via SLURM -- useful when resources are over-subscribed (or in cases where a batch interface is desired)

    • A User submits a batch job that is queued/scheduled and then automatically launched via the OpenStack API calls when resource become available.

    • Scheduling is based on SLURM's scheduling policy.

    • The job is charged against the SLURM Account supplied by the User

      • This assumes that a SLURM Account (corresponding to resource type "VM") has been created for the Project and has been given an Allocation
         



Center for Computational Sciences