Submitting jobs on LCC (for first-time users)
This page introduces the basic process for submitting jobs on the Lipscomb Compute Cluster (LCC) using SLURM. It covers job scripts, partitions, time limits, job submission, interactive use, and common SLURM commands.
- 1 Overview
- 2 Example Job Script
- 3 Viewing Job Output
- 4 Viewing Job Details with scontrol
- 5 A More Realistic Example
- 6 Queues and Partitions
- 7 Setting Time Limits
- 8 Submitting a Job and Checking Status
- 9 Using Debug Partitions
- 10 Interactive Use of a Compute Node
- 11 Common SLURM Commands
- 12 SLURM Job Script Options
- 13 Notes
Overview
Suppose you have a scientific application that you want to run in a terminal. On your local machine, you might run it directly like this:
$ my_app -x -y -z # Running my_app directly in local machineOn LCC, you are not allowed to run computational jobs directly on the login node. Instead, you must create a job script and submit it to SLURM using sbatch.
Example:
$ sbatch my_job_script.shExample script files are available in:
/share/examples/LCCYou can create a script file with a text editor on LCC, or create it locally and copy it to the cluster.
Example:
vim ./first_job.sh
Example Job Script
Below is a short example of a SLURM job script.
#!/bin/bash
#SBATCH --time=00:15:00 # Time limit for the job (REQUIRED)
#SBATCH --job-name=my_test_job # Job name
#SBATCH --ntasks=1 # Number of cores for the job
#SBATCH --partition=SKY32M192_D # Partition/queue to run the job in (REQUIRED)
#SBATCH -e slurm-%j.err # Error file for this job
#SBATCH -o slurm-%j.out # Output file for this job
#SBATCH -A <your project account> # Project allocation account name (REQUIRED)
echo "Hello world. This is my first job"Submit the job with:
sbatch ./test_job.shAfter submission, SLURM returns a job ID:
Submitted batch job 123027Once submitted, your job will run on a compute node, not on the login node. You may safely log out after submission; the job remains in the queue or continues running in the system.
Viewing Job Output
When the job is complete, the SLURM output files will be present in your working directory:
ls .
slurm-123027.err slurm-123027.out test_job.shYou can inspect the output file:
cat slurm-123027.out
Hello world. This is my first job.
Viewing Job Details with scontrol
After submitting a job, you can inspect job metadata with scontrol:
scontrol show job 123027Example output:
JobId=123027 JobName=my_test_job
UserId=userid(1234) GroupId=users(100) MCS_label=N/A
Priority=10018 Nice=0 Account=col_cwa236_uksr QOS=sl2
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=13-01:14:11 TimeLimit=14-00:00:00 TimeMin=N/A
SubmitTime=2019-10-26T16:26:49 EligibleTime=2019-10-26T16:26:49
AccrueTime=2019-10-26T16:26:49
StartTime=2019-10-26T16:26:50 EndTime=2019-11-09T15:26:50 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-10-26T16:26:50
Partition=CAS48M192_L AllocNode:Sid=login002:65722
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cascade030
BatchHost=cascade030
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=4000M,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=./my_app
WorkDir=/scratch/userid/myproj
StdErr=/scratch/userid/myproj/slurm-123037.err
StdIn=/dev/null
StdOut=/scratch/userid/myproj/slurm-123037.outIn the example above, JobState=RUNNING indicates that the job is currently running.
A More Realistic Example
Below is a more realistic SLURM script with additional options:
#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --job-name=myjob
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --partition=SKY32M192_D
#SBATCH -e slurm-%j.err
#SBATCH -o slurm-%j.out
#SBATCH -A <your project account>
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<enter email address here>
module purge
module load intel/19.0.3.199
module load impi/2019.3.199
module load ccs/nwchem/6.8
mpirun -n 24 nwchem Input_c240_pbe0.nwTo see more examples, review the files in:
/share/examples/LCC
Queues and Partitions
Each job runs in a queue or partition. A partition is a set of nodes with specific resource limits, such as processor type, memory, and maximum walltime.
You can view partition information with:
sinfoThe sinfo command shows:
partition name
time limit
number of nodes
node state
node list
Example output:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
SKY32M192_L up 14-00:00:0 5 mix skylake[001,003,040,042,050]
SKY32M192_L up 14-00:00:0 13 alloc skylake[002,004-007,009,012,022,036-038,046,049]
SKY32M192_L up 14-00:00:0 33 idle skylake[008,011,013-021,023-034,039,043-045,047-048,051-054]
SKY32M192_D up 1:00:00 1 idle skylake056
P4V16_HAS16M128_L up 3-00:00:00 2 idle gpdnode[001-002]
P4V12_SKY32M192_L up 3-00:00:00 1 comp gphnode001
P4V12_SKY32M192_L up 3-00:00:00 1 alloc gphnode002
P4V12_SKY32M192_L up 3-00:00:00 5 idle gphnode[004-006,008-009]
P4V12_SKY32M192_D up 1:00:00 1 idle gphnode010
V4V16_SKY32M192_L up 3-00:00:00 2 idle gvnode[001-002]
V4V32_SKY32M192_L up 3-00:00:00 1 mix gvnode003
V4V32_SKY32M192_L up 3-00:00:00 1 alloc gvnode004
V4V32_SKY32M192_L up 3-00:00:00 2 idle gvnode[005-006]
A2V80_ICE56M256_L up 3-00:00:00 2 mix ganode[002-003]
A2V80_ICE56M256_L up 3-00:00:00 2 idle ganode[001,004]
...In the example above:
SKY32M192_Lhas a time limit of 14 dayssome nodes are
alloc(allocated)some nodes are
idle(available)some nodes are
mix(partially allocated)
For detailed information on a specific partition, use:
scontrol show partition HAS24M128_LExample:
PartitionName=SKY32M192_L
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=14-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=skylake[001-009,011-034,036-054]
PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=1664 TotalNodes=52 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerCPU=6000 MaxMemPerCPU=6000
TRESBillingWeights=CPU=1.0,Mem=0.1666G
Setting Time Limits
It is important to set a time limit for each job.
The time limit tells SLURM when the job should be killed if it runs too long. If you do not specify a time limit, SLURM may use the maximum time allowed for the selected partition. This can negatively affect scheduling, because SLURM may assume your job will need the longest possible runtime.
If your application will likely finish in 3 hours, it is better to request something like 3.5 hours rather than the maximum allowed by the partition.
Each partition has its own maximum time limit. Review partition details with sinfo and scontrol show partition.
Submitting a Job and Checking Status
If your job script is named run.sh, submit it with:
sbatch run.shExample output:
Submitted batch job 100868You can check job details while it is queued or running with:
scontrol show job 100868Example output:
JobId=100868 JobName=sandbox
UserId=linkblueid(2006) GroupId=users(100) MCS_label=N/A
Priority=21277 Nice=0 Account=col_exampleprojectname_uksr QOS=sl2
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:01 TimeLimit=00:01:00 TimeMin=N/A
SubmitTime=2019-04-25T14:31:05 EligibleTime=2019-04-25T14:31:05
AccrueTime=2019-04-25T14:31:05
StartTime=2019-04-25T14:31:05 EndTime=2019-04-25T14:31:06 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-04-25T14:31:05
Partition=SAN16M64_D AllocNode:Sid=login002:87012
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cnode256
BatchHost=cnode256
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=4000M,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/scratch/linkblueid/sandbox/run.sh
WorkDir=/scratch/linkblueid/sandbox
StdErr=/scratch/linkblueid/sandbox/slurm-100868.err
StdIn=/dev/nullIf you specify mail options in the job script, SLURM can email you when the job is queued and when it finishes.
Using Debug Partitions
When creating a new submission script, it is often best to test it first in a debug partition.
Debug partitions:
typically have a 1-hour time limit
are useful for identifying script syntax or configuration errors
allow faster testing before submitting to longer-running partitions
You can view the available debug partitions with:
sinfo | grep _DExample output:
SKY32M192_D up 1:00:00 1 idle skylake056
P4V12_SKY32M192_D up 1:00:00 1 idle gphnode010
CAL48M192_D up 1:00:00 1 idle cascade001To use one of these, specify it in your script with #SBATCH -p.
Interactive Use of a Compute Node
To allocate a compute node for interactive use, run:
srun -A col_exampleprojectname_uksr -t 00:60:00 -p SKY32M192_D --pty bashWhen you are finished, exit the interactive session:
exitFor more information on partitions, see the relevant SLURM queue documentation.
Common SLURM Commands
Command | Description |
|---|---|
| Submit a SLURM job script |
| Cancel the job with the specified job ID |
| Show queued or running jobs for a user |
| Show partitions, time limits, node counts, and node state |
SLURM Job Script Options
Option | Short Version | Long Version | Example | Explanation |
|---|---|---|---|---|
Job name |
|
|
| Assign a custom label to the job |
Partition / queue |
|
|
| Select the partition where the job will run |
Time limit |
|
|
| Set the maximum runtime |
Memory |
|
|
| Request memory per node |
Project account |
|
|
| Charge the job to a project account |
Standard error filename |
|
|
| Save stderr to a file |
Standard output filename |
|
|
| Save stdout to a file |
Notes
Do not run computational workloads directly on login nodes. Always submit jobs through SLURM or use an approved interactive workflow.
For additional examples, review:
/share/examples/LCC