SLURM Queues

SLURM Queues for Open Access Only Resources

The roughly 400 nodes comprising LCC/MCC Super Computer continue to be supported, offering roughly 260M SUs/year to all users for Open Access only use.

As was the case with the previous systems, these resources are scheduled using a fair share scheduling algorithm1. (Unlike the Condo resources, there are no privileged SLURM accounts. All users have the same priority). The only allocations that can be used on this system are the
allocations -- specifically the col and gol purposes. Like the previous DLX 2/3 systems, there are Long, Medium, Short, and Debug SLURM queues that limit the number of days a job can run based on the number of cores it needs. The more cores required, the shorter the maximum run time.

Moreover, SLURM is configured to give higher priority2 to jobs that run for short periods of time (see the table below). In addition, no PI may use more than 1024(on LCC) and 1536(on MCC) cores at any given time. The max runtime for the long queues has been set at 14 days based on historical data indicating that few jobs need more than 14 days. If you need to run a job longer than 14 days, please contact submit a support request: https://ukyrcd.atlassian.net/servicedesk/customer/portal/4 .

The Queue Names shown below are based on the Naming Conventions described at the bottom of this page.

Queue Name

Max Time

Min/Max Cores Per Job

Queue Priority

Queue Name

Max Time

Min/Max Cores Per Job

Queue Priority

HAS24M128_L

14 days

1/64

0

HAS24M128_M

7 days

1/128

5000

HAS24M128_S

1 day

1/256

10000

HAS24M128_D

1 hour

1/24

N/A

SAN16M64_L

14 days

1/64

0

SAN16M64_M

7 days

1/512

5000

SAN16M64_S

1 day

1/1024

10000

SAN16M64_D

1 hour

1/16

N/A

SAN32M512_L

14 days

1/32

N/A

SAN32M3000_L

14 days

1/32

N/A

SLURM Queues for Condo Resources

Condo jobs are "billed" for the resources they use. Because each type of resource has a different billing rate, there are different queues for each type of condo resource. At present the condo resource types are SKY1 (Skylake 6130 CPUs), P100 (NVIDIA P100 GPUs), and V100 (NVIDIA V100 GPUs). In the future there may be additional resource types (e.g., a new Skylake processor, say SKY2). Within each type of resource there are short, medium, long, and flex queues to prevent users from monopolizing any particular resource (see below). In addition there are two debug queues used to debug CPU or GPU jobs respectively.

No PI may use more than 1024 cores at any given time.

No PI may use more than 32 GPU cards at any given time.

If you need to run a job longer than 14 days, please submit a support request: https://ukyrcd.atlassian.net/servicedesk/customer/portal/4 .

Queue Name

Max Time

Rate per Resource-Hour

Allowed Accounts

Queue Name

Max Time

Rate per Resource-Hour

Allowed Accounts

SKY32M192_L

14 days

1 CPU SU

c*l

SKY32M192_D

1 hour

0 CPU SUs

col

CAL48M192_L

14 days

1 CPU SU

c*l

CAL48M192_D

1 hour

0 CPU SUs

col

CAC48M192_L

14 days

1 CPU SU

c*l

P4V16_HAS16M128_L

72 hours

1 GPU SU

g*l

P4V12_SKY32M192_L

72 hours

1 GPU SU

g*l

P4V12_SKY32M192_D

1 hour

0 GPU SUs

gol

V4V16_SKY32M192_L

72 hours

1.48 GPU SUs

g*l

V4V32_SKY32M192_L

72 hours

1.48 GPU SUs

g*l

V4V32_CAS40M192_L

72 hours

1.48 GPU SUs

g*l

A2V80_ICE56M256_L

72 hours

2.1 GPU SUs

g*l


Naming Conventions (used in the above names)

CPU queues names have the following format:

<CPU_TYPE><CORE_COUNT><MEM_TYPE><MEM_SIZE>_<QUEUE_TYPE>

and GPU queues names have the following format: (Note: GPU cards are housed in a server with CPUs)

<GPU_TYPE><GPU_COUNT><MEM_TYPE><MEM_SIZE>_<CPU_TYPE><CORE_COUNT><MEM_TYPE><MEM_SIZE>_<QUEUE_TYPE>

where

CPU_TYPE

Description

CORE_COUNT

MEM_TYPE

Description

MEM_SIZE

CPU_TYPE

Description

CORE_COUNT

MEM_TYPE

Description

MEM_SIZE

SKY

Skylake CPU

# of CPU Cores

M

RAM

GB of RAM

HAS

Haswell CPU

# of CPU Cores

M

RAM

GB of RAM

SAN

Sandy Bridge CPU

# of CPU Cores

M

RAM

GB of RAM

CAS

Cascade Lake CPU on GPU node

# of CPU Cores

M

RAM

GB of RAM

CAL

Cascade Lake CPU w/ 50 Gbps InfiniBand

# of CPU Cores

M

RAM

GB of RAM

CAC

Cascade Lake CPU w/ 100 Gbps InfiniBand

# of CPU Cores

M

RAM

GB of RAM

ICE

IceLake CPU w/ 100 Gbps InfiniBand

# of CPU Cores

M

RAM

GB of RAM

GPU_TYPE

Description

GPU_COUNT

MEM_TYPE

Description

MEM_SIZE

GPU_TYPE

Description

GPU_COUNT

MEM_TYPE

Description

MEM_SIZE

M

M2075 GPU

# of GPUs

V

VRAM

GB of VRAM

P

P100 GPU

# of GPUs

V

VRAM

GB of VRAM

V

V100 GPU

# of GPUs

V

VRAM

GB of VRAM

A

A100 GPU

# of GPUs

V

VRAM

GB of VRAM

QUEUE_TYPE

Description

QUEUE_TYPE

Description

L

Long running jobs

D

Extremely short debug jobs

1: Note that it is possible to reserve partial nodes. For example, two different users might each only need 4 cores. By reserving only the number of cores they need (4 each) , SLURM is able to schedule both users on a single node at the same time.

2: For both LCC/MCC and Condo, SLURM is configured to give slighly higher priority to "wide jobs" requiring many cores to ensure they are not starved by "narrow jobs". So-called "back filling" is also used to allow narrow jobs to make effective use of available cores.



Center for Computational Sciences