SLURM Queues

SLURM Queues

SLURM Queues for Open Access Only Resources

The roughly 400 nodes comprising LCC/MCC Super Computer continue to be supported, offering roughly 260M SUs/year to all users for Open Access only use.

As was the case with the previous systems, these resources are scheduled using a fair share scheduling algorithm1. (Unlike the Condo resources, there are no privileged SLURM accounts. All users have the same priority). The only allocations that can be used on this system are the
Resources Allocations allocations -- specifically the col and gol purposes. Like the previous DLX 2/3 systems, there are Long, Medium, Short, and Debug SLURM queues that limit the number of days a job can run based on the number of cores it needs. The more cores required, the shorter the maximum run time.

Moreover, SLURM is configured to give higher priority2 to jobs that run for short periods of time (see the table below). In addition, no PI group may use more than 1536 (on LCC) & 2048 (on MCC) cores at any given time. The max runtime for the long queues has been set at 14 days based on historical data indicating that few jobs need more than 14 days. If you need to run a job longer than 14 days, please contact submit a support request: https://ukyrcd.atlassian.net/servicedesk/customer/portal/4 .

The Queue Names shown below are based on the Naming Conventions described at the bottom of this page.

SLURM Queues for Condo Resources

Condo jobs are "billed" for the resources they use. Because each type of resource has a different billing rate, there are different queues for each type of condo resource. At present, the condo resource types are SKY1 (Skylake 6130 CPUs), P100 (NVIDIA P100 GPUs), and V100 (NVIDIA V100 GPUs). In the future there may be additional resource types (e.g., a new Skylake processor, say SKY2). Within each type of resource, there are short, medium, long, and flex queues to prevent users from monopolizing any particular resource (see below). In addition, there are two debug queues used to debug CPU or GPU jobs, respectively.

No PI may use more than 1024 cores at any given time.

No PI may use more than 32 GPU cards at any given time.

If you need to run a job longer than 14 days, please submit a support request: https://ukyrcd.atlassian.net/servicedesk/customer/portal/4 .

Queue Name

Max Time

Rate per Resource-Hour

Allowed Accounts

Cluster

Notes

Number of Nodes

Queue Name

Max Time

Rate per Resource-Hour

Allowed Accounts

Cluster

Notes

Number of Nodes

SKY32M192_L

14 days

1 CPU SU

c*l

LCC

 

LCC system overview

SKY32M192_D

1 hour

0 CPU SUs

col

LCC

Only 1 jobs per group allowed to debug

LCC system overview

CAL48M192_L

14 days

1 CPU SU

c*l

LCC

 

LCC system overview

CAL48M192_D

1 hour

0 CPU SUs

col

LCC

 

LCC system overview

CAC48M192_L

14 days

1 CPU SU

c*l

LCC

 

LCC system overview

P4V16_HAS16M128_L

72 hours

1 GPU SU

g*l

LCC

 

LCC system overview

P4V12_SKY32M192_L

72 hours

1 GPU SU

g*l

LCC

 

LCC system overview

P4V12_SKY32M192_D

1 hour

0 GPU SUs

gol

LCC

 

LCC system overview

V4V16_SKY32M192_L

72 hours

1.48 GPU SUs

g*l

LCC

 

LCC system overview

V4V32_SKY32M192_L

72 hours

1.48 GPU SUs

g*l

LCC

 

LCC system overview

V4V32_CAS40M192_L

72 hours

1.48 GPU SUs

g*l

LCC

 

LCC system overview

A2V80_ICE56M256_L

72 hours

2.1 GPU SUs

g*l

LCC

 

LCC system overview

H4V80_ICE64M512_L

72 hours

7.2 GPU SUs

g*l

LCC

 

LCC system overview

H8V144_SAP112M2048_L

72 hours

7.2 GPU SUs

g*l

LCC

 

LCC system overview

jumbo

14 days

1 CPU SU

c*a

MCC

 

MCC system overview

normal

14 days

1 CPU SU

c*a

MCC

 

MCC system overview

short

3 days

1 CPU SU

c*a

MCC

Reserved for short MPI or many core jobs

(Not meant for Single core jobs)

MCC system overview

 


Naming Conventions (used in the above names)

CPU queues names have the following format:

<CPU_TYPE><CORE_COUNT><MEM_TYPE><MEM_SIZE>_<QUEUE_TYPE>

and GPU queues names have the following format: (Note: GPU cards are housed in a server with CPUs)

<GPU_TYPE><GPU_COUNT><MEM_TYPE><MEM_SIZE>_<CPU_TYPE><CORE_COUNT><MEM_TYPE><MEM_SIZE>_<QUEUE_TYPE>

where

CPU_TYPE

Description

CORE_COUNT

MEM_TYPE

Description

MEM_SIZE

CPU_TYPE

Description

CORE_COUNT

MEM_TYPE

Description

MEM_SIZE

SKY

Skylake CPU

# of CPU Cores

M

RAM

GB of RAM

HAS

Haswell CPU

# of CPU Cores

M

RAM

GB of RAM

SAN

Sandy Bridge CPU

# of CPU Cores

M

RAM

GB of RAM

CAS

Cascade Lake CPU on GPU node

# of CPU Cores

M

RAM

GB of RAM

CAL

Cascade Lake CPU w/ 50 Gbps InfiniBand

# of CPU Cores

M

RAM

GB of RAM

CAC

Cascade Lake CPU w/ 100 Gbps InfiniBand

# of CPU Cores

M

RAM

GB of RAM

ICE

IceLake CPU w/ 100 Gbps InfiniBand

# of CPU Cores

M

RAM

GB of RAM

SAP

Sapphire Rapids CPU w/ 100 Gbps InfiniBand

# of CPU Cores

M

RAM

GB of RAM

AMD/ROME

7702P

 

 

 

 

GPU_TYPE

Description

GPU_COUNT

MEM_TYPE

Description

MEM_SIZE

GPU_TYPE

Description

GPU_COUNT

MEM_TYPE

Description

MEM_SIZE

M

M2075 GPU

# of GPUs

V

VRAM

GB of VRAM

P

P100 GPU

# of GPUs

V

VRAM

GB of VRAM

V

V100 GPU

# of GPUs

V

VRAM

GB of VRAM

A

A100 GPU

# of GPUs

V

VRAM

GB of VRAM

H

H100 GPU

# of GPUs

V

VRAM

GB of VRAM

H

H200 GPU

# of GPUs

V

VRAM

GB of VRAM

QUEUE_TYPE

Description

QUEUE_TYPE

Description

L

Long running jobs

D

Extremely short debug jobs

1: Note that it is possible to reserve partial nodes. For example, two different users might each only need 4 cores. By reserving only the number of cores they need (4 each) , SLURM is able to schedule both users on a single node at the same time.

2: For both LCC/MCC and Condo, SLURM is configured to give slighly higher priority to "wide jobs" requiring many cores to ensure they are not starved by "narrow jobs". So-called "back filling" is also used to allow narrow jobs to make effective use of available cores.



Center for Computational Sciences