SLURM Queues
SLURM Queues for Open Access Only Resources
The roughly 400 nodes comprising LCC/MCC Super Computer continue to be supported, offering roughly 260M SUs/year to all users for Open Access only use.
As was the case with the previous systems, these resources are scheduled using a fair share scheduling algorithm1. (Unlike the Condo resources, there are no privileged SLURM accounts. All users have the same priority). The only allocations that can be used on this system are the
Resources Allocations allocations -- specifically the col and gol purposes. Like the previous DLX 2/3 systems, there are Long, Medium, Short, and Debug SLURM queues that limit the number of days a job can run based on the number of cores it needs. The more cores required, the shorter the maximum run time.
Moreover, SLURM is configured to give higher priority2 to jobs that run for short periods of time (see the table below). In addition, no PI may use more than 1024(on LCC) and 1536(on MCC) cores at any given time. The max runtime for the long queues has been set at 14 days based on historical data indicating that few jobs need more than 14 days. If you need to run a job longer than 14 days, please contact submit a support request: CCS .
The Queue Names shown below are based on the Naming Conventions described at the bottom of this page.
Queue Name | Max Time | Min/Max Cores Per Job | Queue Priority |
---|---|---|---|
HAS24M128_L | 14 days | 1/64 | 0 |
HAS24M128_M | 7 days | 1/128 | 5000 |
HAS24M128_S | 1 day | 1/256 | 10000 |
HAS24M128_D | 1 hour | 1/24 | N/A |
SAN16M64_L | 14 days | 1/64 | 0 |
SAN16M64_M | 7 days | 1/512 | 5000 |
SAN16M64_S | 1 day | 1/1024 | 10000 |
SAN16M64_D | 1 hour | 1/16 | N/A |
SAN32M512_L | 14 days | 1/32 | N/A |
SAN32M3000_L | 14 days | 1/32 | N/A |
SLURM Queues for Condo Resources
Condo jobs are "billed" for the resources they use. Because each type of resource has a different billing rate, there are different queues for each type of condo resource. At present the condo resource types are SKY1 (Skylake 6130 CPUs), P100 (NVIDIA P100 GPUs), and V100 (NVIDIA V100 GPUs). In the future there may be additional resource types (e.g., a new Skylake processor, say SKY2). Within each type of resource there are short, medium, long, and flex queues to prevent users from monopolizing any particular resource (see below). In addition there are two debug queues used to debug CPU or GPU jobs respectively.
No PI may use more than 1024 cores at any given time.
No PI may use more than 32 GPU cards at any given time.
If you need to run a job longer than 14 days, please submit a support request: CCS .
Queue Name | Max Time | Rate per Resource-Hour | Allowed Accounts |
---|---|---|---|
SKY32M192_L | 14 days | 1 CPU SU | c*l |
SKY32M192_D | 1 hour | 0 CPU SUs | col |
CAL48M192_L | 14 days | 1 CPU SU | c*l |
CAL48M192_D | 1 hour | 0 CPU SUs | col |
CAC48M192_L | 14 days | 1 CPU SU | c*l |
P4V16_HAS16M128_L | 72 hours | 1 GPU SU | g*l |
P4V12_SKY32M192_L | 72 hours | 1 GPU SU | g*l |
P4V12_SKY32M192_D | 1 hour | 0 GPU SUs | gol |
V4V16_SKY32M192_L | 72 hours | 1.48 GPU SUs | g*l |
V4V32_SKY32M192_L | 72 hours | 1.48 GPU SUs | g*l |
V4V32_CAS40M192_L | 72 hours | 1.48 GPU SUs | g*l |
A2V80_ICE56M256_L | 72 hours | 2.1 GPU SUs | g*l |
Naming Conventions (used in the above names)
CPU queues names have the following format:
<CPU_TYPE><CORE_COUNT><MEM_TYPE><MEM_SIZE>_<QUEUE_TYPE>
and GPU queues names have the following format: (Note: GPU cards are housed in a server with CPUs)
<GPU_TYPE><GPU_COUNT><MEM_TYPE><MEM_SIZE>_<CPU_TYPE><CORE_COUNT><MEM_TYPE><MEM_SIZE>_<QUEUE_TYPE>
where
CPU_TYPE | Description | CORE_COUNT | MEM_TYPE | Description | MEM_SIZE |
---|---|---|---|---|---|
SKY | Skylake CPU | # of CPU Cores | M | RAM | GB of RAM |
HAS | Haswell CPU | # of CPU Cores | M | RAM | GB of RAM |
SAN | Sandy Bridge CPU | # of CPU Cores | M | RAM | GB of RAM |
CAS | Cascade Lake CPU on GPU node | # of CPU Cores | M | RAM | GB of RAM |
CAL | Cascade Lake CPU w/ 50 Gbps InfiniBand | # of CPU Cores | M | RAM | GB of RAM |
CAC | Cascade Lake CPU w/ 100 Gbps InfiniBand | # of CPU Cores | M | RAM | GB of RAM |
ICE | IceLake CPU w/ 100 Gbps InfiniBand | # of CPU Cores | M | RAM | GB of RAM |
GPU_TYPE | Description | GPU_COUNT | MEM_TYPE | Description | MEM_SIZE |
---|---|---|---|---|---|
M | M2075 GPU | # of GPUs | V | VRAM | GB of VRAM |
P | P100 GPU | # of GPUs | V | VRAM | GB of VRAM |
V | V100 GPU | # of GPUs | V | VRAM | GB of VRAM |
A | A100 GPU | # of GPUs | V | VRAM | GB of VRAM |
QUEUE_TYPE | Description |
---|---|
L | Long running jobs |
D | Extremely short debug jobs |
1: Note that it is possible to reserve partial nodes. For example, two different users might each only need 4 cores. By reserving only the number of cores they need (4 each) , SLURM is able to schedule both users on a single node at the same time.
2: For both LCC/MCC and Condo, SLURM is configured to give slighly higher priority to "wide jobs" requiring many cores to ensure they are not starved by "narrow jobs". So-called "back filling" is also used to allow narrow jobs to make effective use of available cores.
Center for Computational Sciences