Constraints on Allocations

Although Allocations are granted assigned to Projects immediately, that does not imply that the entire Allocation can be used all at one time. In general, Allocation Constraints -- in the form of Queues and Guard Rails -- are used to prevent a user (or group of users on a project) from monopolizing all the resources.

SLURM Queues

First, the system limits how long a job is allowed to run based on the number of nodes the job requires. The more nodes needed, the shorter the allowed runtime. This is achieve via SLURM Queues.

When a User submits a job to be run, the User must specify the SLURM Queue where it wants the job to be run. Each SLURM Queue is designed to support a different type of job.
Queues place constraints on the types of jobs that can be submitted. For example:

  • We may allow a User to run a "wide" job that uses a large percentage of the system's CPU nodes, but we may constrain them to one job at at time with a maximum run time of 24 hours.

  • On the other hand, we might allow a "narrow" job (4 nodes) run for as much as 30 days.

  • Or we might allow a "debug" job to run for only 1 hour -- say at high priority implemented via backfilling

  • Queues are also used to constrain jobs to the types of resource they should be run on (e.g., CPU or GPU).

Jobs submitted to a SLURM Queue are checked to make sure they fall within the constraints of the Queue and are monitored to make sure the job terminates within the specified max run time.

A list of the SLURM queues that are currently available can be found on the SLURM Queues Names page.

Guard Rails

We introduce the concept of Guard Rails to prevent a User from consuming an excessive number of their Allocations unintentionally (or intentionally). Because Allocations are hard to come by (i.e., there is often a financial cost or proposal writing effort involved in obtaining Allocations), Users will want to use their Allocations judiciously. A buggy program with an infinite loop, for example, could quickly consume a User's entire Allocation -- consuming the entire CPU allocation or writing lots of bytes to the disk and consuming their entire storage Allocation. Not only does excessive consumption of resources eat up a User's precious Allocation, but it has the potential to monopolize a resource, preventing other Users from utilizing the resource. Even if a User would like to use an excessive portion of their Allocation all at once, they may need to be prevented from doing so to ensure the resource remains available to other Users as well.

Unlike Allocation limits which are fixed, hard limits put in place to ensure Projects can only use resources they purchased or were given, Guard Rails are put in place to ensure safe operation of the entire system. As such, they can be adjusted if needed (on a per-Project basis), if CCS/ITS-RC deems it will not affect the overall operation of the entire system. However, the existing Guard Rails established by CCS/ITS-RC were designed such that adjustments should be a rare occurance -- with most jobs being able to run without coming close to the Guard Rails.

Guard Rails are designed to keep usage to reasonable levels (i.e., prevent excessive usage). Guard Rails are associated with Projects and act as resource limits that terminate jobs (or prevent them from running) if the Project hits the Guard Rail limit. For example, while Projects (in theory) have unlimited access to GPFS scratch space, a Guard Rail of 50 TBs has been put in place to prevent runaway programs from eating up the entire GPFS scratch space. Most (correctly behaving) Projects will never hit the 50 TB Guard Rail. However, in certain circumstances, a correctly behaving Project may need more than 50 TB of GPFS scratch space. In such cases, the PI of the project can request that the the Guard Rail be increased within reason. Such requests can usually be handled by the CCS/ITS-RC staff, but in some cases may need to involve the CCS Allocations and Review Committee (ARC).

As another example, Guard Rails are used to limit the burn rate of CPU Allocations (despite a Project being given all of its CPU allocation at one time). In this case, the Guard Rail limits the amount of CPU that can be consumed in any given month. For example, the Guard Rail mechanism may divide a year's CPU allocation by 12 months and then assign that value (1/12 of the yearly value) to the SLURM Account -- thereby only allowing a User to use up to 1/12 of their allocation in any given month. This both prevents a User from accidentally consuming all their Allocation in a single month, while at the same time ensuring that Users do not monopolize a resource and block other users from accessing it. We still need to gain more experience with User's usage patterns to know where to set the Guard Rails, but we expect that the Guard Rail will be set at 2 months of usage for a single month, thereby allowing Users to "burst" above and beyond their normal usage level in a given month.

HPC systems at other institutions will sometimes employ approaches such as "Pre-allocation" (pre-allocating next month's Allocation), "Roll Over Allocations" (carrying over Allocations from month to month), or "Use it or lose it" (this month's Allocation must be used, or it will be lost at the end of the month). Guard Rails combines the best of all these approaches. Allocations can be pre-allocated up to the Guard Rail. Allocations that are unused can be rolled/carried over into future months because Allocations are only lost when they expire (e.g., annually for Startup). Guard Rails only (intended and desired) limitation is that they prevent excessive use of a resource. As such Guard Rails may prevent a User from using available Project Allocation in a particular month -- which is intended. As a result, requests to remove Guard Rails for the purpose of gaining substantial additional time on a resource (something that could be viewed a excessive or monopolizing) will, in most cases, be denied. Also note that Allocations have an expiration date and cannot be "rolled over" past the expiration date.

Guard Rail Settings

The current Guard Rails settings are as follows:

  • Storage Guard Rails

    • GPFS Storage:

      • User Scratch: 25 TB

      • Project Scratch: 50 TB

    • Object Storage:

      • Project Scratch: 25 TB

  • Compute Guard Rails

    • Startup and Education: No Guard Rails -- Given in small amounts so not guard rails needed

    • Condo: 5 months of the annual Allocation (This may be adjusted by CCS in response to competition for the resource)

    • Meritorious: 2 * 1/3 of the Allocation for the quarter (3 months) (This may be adjusted by CCS in response to competition for the resource)

    • Condo Incentive: 1/12 of the annual Allocation

    • CCS Discretionary: In general, no Guard Rails -- Discretionary is typically given out to meet an immediate need

    • Open Access: No Guard Rails -- SLURMs fair-share scheduling algorithm protects against excessive use

CCS reserves the right to make adjustments to the standard Guard Rail settings at any time. PIs may requests changes to these guard rails settings for a particular Project, but they must be clearly justified/needed, and will be made on a case-by-case basis.

Center for Computational Sciences