SLURM Queues
SLURM Queues for Open Access Only Resources
Summary
This page explains Simple Linux Utility for Resource Management (SLURM) queue options for Open Access and Condo (owned-share) resources, usage limits, and naming conventions. Tables list queue parameters and definitions.
The roughly 400 nodes comprising the Lipscomb Compute Cluster (LCC) and Morgan Compute Cluster (MCC) supercomputer continue to be supported, offering approximately 260 million Service Units (SUs) per year to all users for Open Access allocations use only.
As was the case with the previous systems, these resources are scheduled using a fair share scheduling algorithm. (Unlike the Condo resources, there are no privileged SLURM accounts; all users have the same priority). The only allocations that can be used on this system are the Open Access allocations — specifically for collaborative (col) and general (gol) purposes. Like the previous DLX 2/3 systems, there are Long, Medium, Short, and Debug SLURM queues that limit the number of days a job can run based on the number of cores it needs. The more cores required, the shorter the maximum run time.
Moreover, SLURM is configured to give higher priority to jobs that run for short periods of time (see Note 2). In addition, no Principal Investigator (PI) group may use more than 1536 (on LCC) and 2048 (on MCC) cores at any given time. The maximum runtime for the long queues has been set at 14 days based on historical data indicating that few jobs need more than 14 days. If you need to run a job longer than 14 days, please submit a support request: Submit a support request.
The Queue Names shown below are based on the Naming Conventions described at the bottom of this page.
SLURM Queues for Condo Resources
Table 1. Condo queues, maximum runtimes, billing rates, and node details.
Condo jobs are "billed" for the resources they use. Because each type of resource has a different billing rate, there are different queues for each type of Condo resource. At present, the Condo resource types are SKY1 (Skylake 6130 CPUs), P100 (NVIDIA P100 GPUs), and V100 (NVIDIA V100 GPUs). In the future there may be additional resource types (e.g., a new Skylake processor, say SKY2). Within each type of resource, there are short, medium, long, and flex queues to prevent users from monopolizing any particular resource (see below). In addition, there are two debug queues used to debug CPU or GPU jobs, respectively.
No PI may use more than 1024 cores at any given time.
No PI may use more than 32 GPU cards at any given time.
If you need to run a job longer than 14 days, please submit a support request: Submit a support request.
Queue Name | Max Time | Rate per Resource-Hour | Allowed Accounts | Cluster | Notes | Number of Nodes |
|---|---|---|---|---|---|---|
SKY32M192_L | 14 days | 1 CPU SU | c*l | LCC |
| |
SKY32M192_D | 1 hour | 0 CPU SUs | col | LCC | Only 1 jobs per group allowed to debug | |
CAL48M192_L | 14 days | 1 CPU SU | c*l | LCC |
| |
CAL48M192_D | 1 hour | 0 CPU SUs | col | LCC |
| |
CAC48M192_L | 14 days | 1 CPU SU | c*l | LCC |
| |
P4V16_HAS16M128_L | 72 hours | 1 GPU SU | g*l | LCC |
| |
P4V12_SKY32M192_L | 72 hours | 1 GPU SU | g*l | LCC |
| |
P4V12_SKY32M192_D | 1 hour | 0 GPU SUs | gol | LCC |
| |
V4V16_SKY32M192_L | 72 hours | 1.48 GPU SUs | g*l | LCC |
| |
V4V32_SKY32M192_L | 72 hours | 1.48 GPU SUs | g*l | LCC |
| |
V4V32_CAS40M192_L | 72 hours | 1.48 GPU SUs | g*l | LCC |
| |
A2V80_ICE56M256_L | 72 hours | 2.1 GPU SUs | g*l | LCC |
| |
H4V80_ICE64M512_L | 72 hours | 7.2 GPU SUs | g*l | LCC |
|
Naming Conventions (used in the above names)
Table 2. CPU naming convention components.
CPU queues names have the following format:
Table 3. GPU naming convention components.
<CPU_TYPE><CORE_COUNT><MEM_TYPE><MEM_SIZE>_<QUEUE_TYPE>and GPU queues names have the following format: (Note: GPU cards are housed in a server with CPUs)
<GPU_TYPE><GPU_COUNT><MEM_TYPE><MEM_SIZE>_<CPU_TYPE><CORE_COUNT><MEM_TYPE><MEM_SIZE>_<QUEUE_TYPE>where
CPU_TYPE | Description | CORE_COUNT | MEM_TYPE | Description | MEM_SIZE |
|---|---|---|---|---|---|
SKY | Skylake CPU | # of CPU Cores | M | RAM | GB of RAM |
HAS | Haswell CPU | # of CPU Cores | M | RAM | GB of RAM |
SAN | Sandy Bridge CPU | # of CPU Cores | M | RAM | GB of RAM |
CAS | Cascade Lake CPU on GPU node | # of CPU Cores | M | RAM | GB of RAM |
CAL | Cascade Lake CPU w/ 50 Gbps InfiniBand | # of CPU Cores | M | RAM | GB of RAM |
CAC | Cascade Lake CPU w/ 100 Gbps InfiniBand | # of CPU Cores | M | RAM | GB of RAM |
ICE | IceLake CPU w/ 100 Gbps InfiniBand | # of CPU Cores | M | RAM | GB of RAM |
SAP | Sapphire Rapids CPU w/ 100 Gbps InfiniBand | # of CPU Cores | M | RAM | GB of RAM |
AMD/ROME | 7702P |
|
|
|
|
GPU_TYPE | Description | GPU_COUNT | MEM_TYPE | Description | MEM_SIZE |
|---|---|---|---|---|---|
M | M2075 GPU | # of GPUs | V | VRAM | GB of VRAM |
P | P100 GPU | # of GPUs | V | VRAM | GB of VRAM |
V | V100 GPU | # of GPUs | V | VRAM | GB of VRAM |
A | A100 GPU | # of GPUs | V | VRAM | GB of VRAM |
H | H100 GPU | # of GPUs | V | VRAM | GB of VRAM |
H | H200 GPU | # of GPUs | V | VRAM | GB of VRAM |
QUEUE_TYPE | Description |
|---|---|
L | Long running jobs |
D | Extremely short debug jobs |
Notes
2: For both Lipscomb Compute Cluster (LCC) and Morgan Compute Cluster (MCC), and for Condo resources, SLURM is configured to give slighly higher priority to "wide jobs" requiring many cores to ensure they are not starved by "narrow jobs". So-called "back filling" is also used to allow narrow jobs to make effective use of available cores.