Submitting jobs on LCC (for first-time users)

Submitting jobs on LCC (for first-time users)

This page introduces the basic process for submitting jobs on the Lipscomb Compute Cluster (LCC) using SLURM. It covers job scripts, partitions, time limits, job submission, interactive use, and common SLURM commands.



Overview

Suppose you have a scientific application that you want to run in a terminal. On your local machine, you might run it directly like this:

$ my_app -x -y -z # Running my_app directly in local machine

On LCC, you are not allowed to run computational jobs directly on the login node. Instead, you must create a job script and submit it to SLURM using sbatch.

Example:

$ sbatch my_job_script.sh

Example script files are available in:

/share/examples/LCC

You can create a script file with a text editor on LCC, or create it locally and copy it to the cluster.

Example:

vim ./first_job.sh

 

Example Job Script

Below is a short example of a SLURM job script.

#!/bin/bash #SBATCH --time=00:15:00 # Time limit for the job (REQUIRED) #SBATCH --job-name=my_test_job # Job name #SBATCH --ntasks=1 # Number of cores for the job #SBATCH --partition=SKY32M192_D # Partition/queue to run the job in (REQUIRED) #SBATCH -e slurm-%j.err # Error file for this job #SBATCH -o slurm-%j.out # Output file for this job #SBATCH -A <your project account> # Project allocation account name (REQUIRED) echo "Hello world. This is my first job"

Submit the job with:

sbatch ./test_job.sh

After submission, SLURM returns a job ID:

Submitted batch job 123027

Once submitted, your job will run on a compute node, not on the login node. You may safely log out after submission; the job remains in the queue or continues running in the system.

 

Viewing Job Output

When the job is complete, the SLURM output files will be present in your working directory:

ls . slurm-123027.err slurm-123027.out test_job.sh

You can inspect the output file:

cat slurm-123027.out Hello world. This is my first job.

 

Viewing Job Details with scontrol

After submitting a job, you can inspect job metadata with scontrol:

scontrol show job 123027

Example output:

JobId=123027 JobName=my_test_job UserId=userid(1234) GroupId=users(100) MCS_label=N/A Priority=10018 Nice=0 Account=col_cwa236_uksr QOS=sl2 JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=13-01:14:11 TimeLimit=14-00:00:00 TimeMin=N/A SubmitTime=2019-10-26T16:26:49 EligibleTime=2019-10-26T16:26:49 AccrueTime=2019-10-26T16:26:49 StartTime=2019-10-26T16:26:50 EndTime=2019-11-09T15:26:50 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-10-26T16:26:50 Partition=CAS48M192_L AllocNode:Sid=login002:65722 ReqNodeList=(null) ExcNodeList=(null) NodeList=cascade030 BatchHost=cascade030 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=4000M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=./my_app WorkDir=/scratch/userid/myproj StdErr=/scratch/userid/myproj/slurm-123037.err StdIn=/dev/null StdOut=/scratch/userid/myproj/slurm-123037.out

In the example above, JobState=RUNNING indicates that the job is currently running.

 

A More Realistic Example

Below is a more realistic SLURM script with additional options:

#!/bin/bash #SBATCH --time=00:15:00 #SBATCH --job-name=myjob #SBATCH --nodes=1 #SBATCH --ntasks=8 #SBATCH --partition=SKY32M192_D #SBATCH -e slurm-%j.err #SBATCH -o slurm-%j.out #SBATCH -A <your project account> #SBATCH --mail-type=ALL #SBATCH --mail-user=<enter email address here> module purge module load intel/19.0.3.199 module load impi/2019.3.199 module load ccs/nwchem/6.8 mpirun -n 24 nwchem Input_c240_pbe0.nw

To see more examples, review the files in:

/share/examples/LCC

 

Queues and Partitions

Each job runs in a queue or partition. A partition is a set of nodes with specific resource limits, such as processor type, memory, and maximum walltime.

You can view partition information with:

sinfo

The sinfo command shows:

  • partition name

  • time limit

  • number of nodes

  • node state

  • node list

Example output:

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST SKY32M192_L up 14-00:00:0 5 mix skylake[001,003,040,042,050] SKY32M192_L up 14-00:00:0 13 alloc skylake[002,004-007,009,012,022,036-038,046,049] SKY32M192_L up 14-00:00:0 33 idle skylake[008,011,013-021,023-034,039,043-045,047-048,051-054] SKY32M192_D up 1:00:00 1 idle skylake056 P4V16_HAS16M128_L up 3-00:00:00 2 idle gpdnode[001-002] P4V12_SKY32M192_L up 3-00:00:00 1 comp gphnode001 P4V12_SKY32M192_L up 3-00:00:00 1 alloc gphnode002 P4V12_SKY32M192_L up 3-00:00:00 5 idle gphnode[004-006,008-009] P4V12_SKY32M192_D up 1:00:00 1 idle gphnode010 V4V16_SKY32M192_L up 3-00:00:00 2 idle gvnode[001-002] V4V32_SKY32M192_L up 3-00:00:00 1 mix gvnode003 V4V32_SKY32M192_L up 3-00:00:00 1 alloc gvnode004 V4V32_SKY32M192_L up 3-00:00:00 2 idle gvnode[005-006] A2V80_ICE56M256_L up 3-00:00:00 2 mix ganode[002-003] A2V80_ICE56M256_L up 3-00:00:00 2 idle ganode[001,004] ...

In the example above:

  • SKY32M192_L has a time limit of 14 days

  • some nodes are alloc (allocated)

  • some nodes are idle (available)

  • some nodes are mix (partially allocated)

For detailed information on a specific partition, use:

scontrol show partition HAS24M128_L

Example:

PartitionName=SKY32M192_L AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=14-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=skylake[001-009,011-034,036-054] PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=1664 TotalNodes=52 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=6000 MaxMemPerCPU=6000 TRESBillingWeights=CPU=1.0,Mem=0.1666G

 

Setting Time Limits

It is important to set a time limit for each job.

The time limit tells SLURM when the job should be killed if it runs too long. If you do not specify a time limit, SLURM may use the maximum time allowed for the selected partition. This can negatively affect scheduling, because SLURM may assume your job will need the longest possible runtime.

If your application will likely finish in 3 hours, it is better to request something like 3.5 hours rather than the maximum allowed by the partition.

Each partition has its own maximum time limit. Review partition details with sinfo and scontrol show partition.

 

Submitting a Job and Checking Status

If your job script is named run.sh, submit it with:

sbatch run.sh

Example output:

Submitted batch job 100868

You can check job details while it is queued or running with:

scontrol show job 100868

Example output:

JobId=100868 JobName=sandbox UserId=linkblueid(2006) GroupId=users(100) MCS_label=N/A Priority=21277 Nice=0 Account=col_exampleprojectname_uksr QOS=sl2 JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:01 TimeLimit=00:01:00 TimeMin=N/A SubmitTime=2019-04-25T14:31:05 EligibleTime=2019-04-25T14:31:05 AccrueTime=2019-04-25T14:31:05 StartTime=2019-04-25T14:31:05 EndTime=2019-04-25T14:31:06 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-04-25T14:31:05 Partition=SAN16M64_D AllocNode:Sid=login002:87012 ReqNodeList=(null) ExcNodeList=(null) NodeList=cnode256 BatchHost=cnode256 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=4000M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/scratch/linkblueid/sandbox/run.sh WorkDir=/scratch/linkblueid/sandbox StdErr=/scratch/linkblueid/sandbox/slurm-100868.err StdIn=/dev/null

If you specify mail options in the job script, SLURM can email you when the job is queued and when it finishes.

 

Using Debug Partitions

When creating a new submission script, it is often best to test it first in a debug partition.

Debug partitions:

  • typically have a 1-hour time limit

  • are useful for identifying script syntax or configuration errors

  • allow faster testing before submitting to longer-running partitions

You can view the available debug partitions with:

sinfo | grep _D

Example output:

SKY32M192_D up 1:00:00 1 idle skylake056 P4V12_SKY32M192_D up 1:00:00 1 idle gphnode010 CAL48M192_D up 1:00:00 1 idle cascade001

To use one of these, specify it in your script with #SBATCH -p.

 

Interactive Use of a Compute Node

To allocate a compute node for interactive use, run:

srun -A col_exampleprojectname_uksr -t 00:60:00 -p SKY32M192_D --pty bash

When you are finished, exit the interactive session:

exit

For more information on partitions, see the relevant SLURM queue documentation.

 

Common SLURM Commands

Command

Description

Command

Description

sbatch script_file

Submit a SLURM job script

scancel job_id

Cancel the job with the specified job ID

squeue -u user_id

Show queued or running jobs for a user

sinfo

Show partitions, time limits, node counts, and node state

 

SLURM Job Script Options

Option

Short Version

Long Version

Example

Explanation

Option

Short Version

Long Version

Example

Explanation

Job name

#SBATCH -J jobname

#SBATCH --job-name=jobname

#SBATCH --job-name=my_first_job

Assign a custom label to the job

Partition / queue

#SBATCH -p partition_id

#SBATCH --partition=partition_id

#SBATCH --partition=HAS24M128_D

Select the partition where the job will run

Time limit

#SBATCH -t time_limit

#SBATCH --time=time_limit

#SBATCH --time=01:00:00

Set the maximum runtime

Memory

 

#SBATCH --mem=memory_amount

#SBATCH --mem=32g

Request memory per node

Project account

#SBATCH -A account

#SBATCH --account=account

#SBATCH --account=col_pi123_uksr

Charge the job to a project account

Standard error filename

#SBATCH -e filename

#SBATCH --error=filename

#SBATCH --error=prog_error.log

Save stderr to a file

Standard output filename

#SBATCH -o filename

#SBATCH --output=filename

#SBATCH --output=prog_output.log

Save stdout to a file

 

Notes

Do not run computational workloads directly on login nodes. Always submit jobs through SLURM or use an approved interactive workflow.

For additional examples, review:

/share/examples/LCC

Center for Computational Sciences