SLURM Queues for Open Access Only Resources

The roughly 400 nodes comprising LCC/MCC Super Computer continue to be supported, offering roughly 260M SUs/year to all users for Open Access only use.

As was the case with the previous systems, these resources are scheduled using a fair share scheduling algorithm¹. (Unlike the Condo resources, there are no privileged SLURM accounts. All users have the same priority). The only allocations that can be used on this system are the
Resources Allocations allocations -- specifically the col and gol purposes. Like the previous DLX 2/3 systems, there are Long, Medium, Short, and Debug SLURM queues that limit the number of days a job can run based on the number of cores it needs. The more cores required, the shorter the maximum run time.

Moreover, SLURM is configured to give higher priority² to jobs that run for short periods of time (see the table below). In addition, no PI may use more than 1024(on LCC) and 1536(on MCC) cores at any given time. The max runtime for the long queues has been set at 14 days based on historical data indicating that few jobs need more than 14 days. If you need to run a job longer than 14 days, please contact submit a support request: https://ukyrcd.atlassian.net/servicedesk/customer/portal/4 .

The Queue Names shown below are based on the Naming Conventions described at the bottom of this page.

Queue Name	Max Time	Min/Max Cores Per Job	Queue Priority

Queue Name	Max Time	Min/Max Cores Per Job	Queue Priority
HAS24M128_L	14 days	1/64	0
HAS24M128_M	7 days	1/128	5000
HAS24M128_S	1 day	1/256	10000
HAS24M128_D	1 hour	1/24	N/A
SAN16M64_L	14 days	1/64	0
SAN16M64_M	7 days	1/512	5000
SAN16M64_S	1 day	1/1024	10000
SAN16M64_D	1 hour	1/16	N/A
SAN32M512_L	14 days	1/32	N/A
SAN32M3000_L	14 days	1/32	N/A

SLURM Queues for Condo Resources

Condo jobs are "billed" for the resources they use. Because each type of resource has a different billing rate, there are different queues for each type of condo resource. At present the condo resource types are SKY1 (Skylake 6130 CPUs), P100 (NVIDIA P100 GPUs), and V100 (NVIDIA V100 GPUs). In the future there may be additional resource types (e.g., a new Skylake processor, say SKY2). Within each type of resource there are short, medium, long, and flex queues to prevent users from monopolizing any particular resource (see below). In addition there are two debug queues used to debug CPU or GPU jobs respectively.

No PI may use more than 1024 cores at any given time.

No PI may use more than 32 GPU cards at any given time.

If you need to run a job longer than 14 days, please submit a support request: https://ukyrcd.atlassian.net/servicedesk/customer/portal/4 .

Queue Name	Max Time	Rate per Resource-Hour	Allowed Accounts

Queue Name	Max Time	Rate per Resource-Hour	Allowed Accounts
SKY32M192_L	14 days	1 CPU SU	c*l
SKY32M192_D	1 hour	0 CPU SUs	col
CAL48M192_L	14 days	1 CPU SU	c*l
CAL48M192_D	1 hour	0 CPU SUs	col
CAC48M192_L	14 days	1 CPU SU	c*l
P4V16_HAS16M128_L	72 hours	1 GPU SU	g*l
P4V12_SKY32M192_L	72 hours	1 GPU SU	g*l
P4V12_SKY32M192_D	1 hour	0 GPU SUs	gol
V4V16_SKY32M192_L	72 hours	1.48 GPU SUs	g*l
V4V32_SKY32M192_L	72 hours	1.48 GPU SUs	g*l
V4V32_CAS40M192_L	72 hours	1.48 GPU SUs	g*l
A2V80_ICE56M256_L	72 hours	2.1 GPU SUs	g*l

Naming Conventions (used in the above names)

CPU queues names have the following format:

<CPU_TYPE><CORE_COUNT><MEM_TYPE><MEM_SIZE>_<QUEUE_TYPE>

and GPU queues names have the following format: (Note: GPU cards are housed in a server with CPUs)

<GPU_TYPE><GPU_COUNT><MEM_TYPE><MEM_SIZE>_<CPU_TYPE><CORE_COUNT><MEM_TYPE><MEM_SIZE>_<QUEUE_TYPE>

where

CPU_TYPE	Description	CORE_COUNT	MEM_TYPE	Description	MEM_SIZE

CPU_TYPE	Description	CORE_COUNT	MEM_TYPE	Description	MEM_SIZE
SKY	Skylake CPU	# of CPU Cores	M	RAM	GB of RAM
HAS	Haswell CPU	# of CPU Cores	M	RAM	GB of RAM
SAN	Sandy Bridge CPU	# of CPU Cores	M	RAM	GB of RAM
CAS	Cascade Lake CPU on GPU node	# of CPU Cores	M	RAM	GB of RAM
CAL	Cascade Lake CPU w/ 50 Gbps InfiniBand	# of CPU Cores	M	RAM	GB of RAM
CAC	Cascade Lake CPU w/ 100 Gbps InfiniBand	# of CPU Cores	M	RAM	GB of RAM
ICE	IceLake CPU w/ 100 Gbps InfiniBand	# of CPU Cores	M	RAM	GB of RAM

GPU_TYPE	Description	GPU_COUNT	MEM_TYPE	Description	MEM_SIZE

GPU_TYPE	Description	GPU_COUNT	MEM_TYPE	Description	MEM_SIZE
M	M2075 GPU	# of GPUs	V	VRAM	GB of VRAM
P	P100 GPU	# of GPUs	V	VRAM	GB of VRAM
V	V100 GPU	# of GPUs	V	VRAM	GB of VRAM
A	A100 GPU	# of GPUs	V	VRAM	GB of VRAM

QUEUE_TYPE	Description

QUEUE_TYPE	Description
L	Long running jobs
D	Extremely short debug jobs

¹: Note that it is possible to reserve partial nodes. For example, two different users might each only need 4 cores. By reserving only the number of cores they need (4 each) , SLURM is able to schedule both users on a single node at the same time.

²: For both LCC/MCC and Condo, SLURM is configured to give slighly higher priority to "wide jobs" requiring many cores to ensure they are not starved by "narrow jobs". So-called "back filling" is also used to allow narrow jobs to make effective use of available cores.