Submitting jobs on LCC (for first-time users)

Say that you have some scientific application that you want to run in a terminal. In your desktop/local machine, you can just run it directly like this (if the application is named "my_app":

$ my_app -x -y -z  # Running my_app directly in  local machine

In LCC, you are not allowed to directly run like this. Instead, you have to write a job script that you will submit to the systsem. So, you would create a job script and then use the program `sbatch` to submit it into a queue. The job script will include the line that will specify your application, and it will also include lines on top that will tell the job scheduler about your job. These lines will begin with "#SBATCH ...". So, instead of running your program like in the previous example, you would run it through sbatch like below:

$ sbatch my_job_script.sh

Example script files are located in /share/examples/LCC. Below is a short example. First, use a text editor (e.g. vim) to create the script file. Or, you can create a script file on your local machine and copy it to LCC:

[linkblueid@login001 ~]$ vim ./first_job.sh

Example file content:

#!/bin/bash
#SBATCH --time=00:15:00     # Time limit for the job (REQUIRED). 
#SBATCH --job-name=my_test_job    # Job name
#SBATCH --ntasks=1       # Number of cores for the job. Same as SBATCH -n 1
#SBATCH --partition=SKY32M192_D     # Partition/queue to run the job in. (REQUIRED)
#SBATCH -e slurm-%j.err  # Error file for this job. 
#SBATCH -o slurm-%j.out  # Output file for this job.
#SBATCH -A <your project account>  # Project allocation account name (REQUIRED)

echo "Hello world. This is my first job"   # This is the program that will be executed. You will substitute this with your scientific program.

Then run this job using `sbatch`. Remember, you are not allowed to run computations on the login nodes (i.e. you can't just execute your program directly. You must use sbatch because running the program directly in a login node can bog down the login node and slow down other users. What happens after you so `sbatch myjob.sh` is that the system will run your job in a special set of machines called "compute nodes". After submitting the job, you will get an output saying so, including the job's id. Once submitted, you can safely log off the login node and the job will still be in the system:

[linkblueid@login001 ~]$ sbatch ./test_job.sh
Submitted batch job 123027

When you do the above command, your job may have to be put on queue (remember that many other users are using the system) before it is actually executed. That is, it may take some time before your scientific program actually runs (remember that many users are using the system simultaneously).

Once your job is done, you can see the slurm job output files  (slurm is LCC's automated job scheduler):

[linkblueid@login006 guide]$ ls .
slurm-123027.err  slurm-123027.out  test_job.sh

The job script we have should have printed out our Hello World message to the slurm output file:

[linkblueid@login006 guide]$ cat slurm-123027.out 
Hello world. This is my first job.

After you submit your job, you can see some useful details about it by calling `scontrol`:

[userid@login001 sample_job]$ scontrol show job 123027
JobId=123027 JobName=my_test_job
   UserId=userid(1234) GroupId=users(100) MCS_label=N/A
   Priority=10018 Nice=0 Account=col_cwa236_uksr QOS=sl2
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=13-01:14:11 TimeLimit=14-00:00:00 TimeMin=N/A
   SubmitTime=2019-10-26T16:26:49 EligibleTime=2019-10-26T16:26:49
   AccrueTime=2019-10-26T16:26:49
   StartTime=2019-10-26T16:26:50 EndTime=2019-11-09T15:26:50 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-10-26T16:26:50
   Partition=CAS48M192_L AllocNode:Sid=login002:65722
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=cascade030
   BatchHost=cascade030
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=4000M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=./my_app
   WorkDir=/scratch/userid/myproj
   StdErr=/scratch/userid/myproj/slurm-123037.err
   StdIn=/dev/null
   StdOut=/scratch/userid/myproj/slurm-123037.out
   Power=

Above, "JobState" tells us the job is currently running

Here's another more realistic example. Note the additional SBATCH flags and commands. Note some other flags, such as ones that specify your email address, which helps because you will get an automated email when your job finishes.

#!/bin/bash
#SBATCH --time 00:15:00     # Time limit for the job (REQUIRED)
#SBATCH --job-name=myjob    # Job name
#SBATCH --nodes=1        # Number of nodes to allocate. Same as SBATCH -N (Don't use this option for mpi jobs)
#SBATCH --ntasks=8       # Number of cores to allocate. Same as SBATCH -n
#SBATCH --partition=SKY32M192_D     # Partition/queue to run the job in. (REQUIRED)
#SBATCH -e slurm-%j.err  # Error file for this job. 
#SBATCH -o slurm-%j.out  # Output file for this job.
#SBATCH -A <your project account>  # Project allocation account name (REQUIRED)
#SBATCH --mail-type ALL    # Send email when job starts/ends 
#SBATCH --mail-user <enter email address here>   # Where email is sent to (optional)

module purge    # Unload other software modules
module load intel/19.0.3.199  # Load necessary software modules
module load impi/2019.3.199
module load ccs/nwchem/6.8 

mpirun -n 24 nwchem Input_c240_pbe0.nw   # Run my application

To see more examples, look at the script files under /share/examples/LCC.

Queues/partitions

Each job needs to run in a queue or partition, or a set of nodes that have a specific set of resource limits (e.g. the number of Partition/queue information can be found by doing `sinfo` and `scontrol show <partition name>`:

`sinfo` will show all partitions, the time limit for each partition, and the state for a set of nodes in the partition. A queue/partition may have some nodes that are allocated (jobs are being run on them) while other nodes may be idle and ready to have jobs run on them. This is why, for a given queue/partition, you may see a row with a set of nodes that have the state of "alloc". or allocated nodes, while another row will show the same partition name with an "idle" state.

$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
HAS24M128_L up 14-00:00:0 15 alloc haswell[001-004,009-019]
HAS24M128_L up 14-00:00:0 4 idle haswell[005-008]
HAS24M128_M up 7-00:00:00 15 alloc haswell[001-004,009-019]

A shortened output for LCC is shown above–LCC has many more queues/partitions. Above, there are two queues/partitions HAS24M128_L and HAS24M128_M. HAS24M128_L has 15 nodes that are allocated and 4 that are idle. The time limit for this partition is 14 days. You can also see the specific node name (e.g. haswell001 is in the L queue/partition)

For more information on queues/partitions, see: /wiki/spaces/PreReleaseUKYHPCDocs/pages/21332176

To see detailed information on a specific partition, do this:

linkblueid@login002 ~]$ scontrol show partition HAS24M128_L
PartitionName=HAS24M128_L
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=14-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=haswell[001-019]
   PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=456 TotalNodes=19 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=5000 MaxMemPerCPU=5000

Setting time limits

It's important to set a time limit for a job. The time limit tells SLURM that your job will be killed after that specified time (the idea is that you would have an estimate on when the job would finish). The reason why you want to specify a time limit is that if you don't specify it, SLURM will put a default value which equals to the maximum time for that partition (for example, some partitions have max limit of 7 days). This is bad because SLURM will assume your job will take the longest time possible for a given partition, and SLURM will have to wait until enough resources are available to run your job. It's possible, then, that jobs by other users will be put ahead of yours in the queue if their time limit is much shorter than your job's. Thus, if you know that your program will finish in 3 hours, you can set the time limit to, say, 3.5 hours. If you have no idea how long a program runs, then you may omit the time limit the first time you run a job, or you can judiciously choose a long time. See above section "Queues/partitions" to see that each queue has a different max time limit.

Submitting a job and checking status

If you've named your job script as submit.sh, submit a job by doing this:

sbatch run.sh
Submitted batch job 100868

You can remember the job number above to see the status of the job while it's waiting on queue or while it's running. After the job is finished, the information below will no longer be available:

$ scontrol show job 100868
JobId=100868 JobName=sandbox                                                                                                                                                                                                                                            
   UserId=linkblueid(2006) GroupId=users(100) MCS_label=N/A                                                                                                                                                                                                                 
   Priority=21277 Nice=0 Account=col_griff_uksr QOS=sl2                                                                                                                                                                                                                 
   JobState=COMPLETED Reason=None Dependency=(null)                                                                                                                                                                                                                     
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0                                                                                                                                                                                                               
   RunTime=00:00:01 TimeLimit=00:01:00 TimeMin=N/A                                                                                                                                                                                                                      
   SubmitTime=2019-04-25T14:31:05 EligibleTime=2019-04-25T14:31:05                                                                                                                                                                                                      
   AccrueTime=2019-04-25T14:31:05                                                                                                                                                                                                                                       
   StartTime=2019-04-25T14:31:05 EndTime=2019-04-25T14:31:06 Deadline=N/A                                                                                                                                                                                               
   PreemptTime=None SuspendTime=None SecsPreSuspend=0                                                                                                                                                                                                                   
   LastSchedEval=2019-04-25T14:31:05                                                                                                                                                                                                                                    
   Partition=SAN16M64_D AllocNode:Sid=login002:87012                                                                                                                                                                                                                    
   ReqNodeList=(null) ExcNodeList=(null)                                                                                                                                                                                                                                
   NodeList=cnode256                                                                                                                                                                                                                                                    
   BatchHost=cnode256                                                                                                                                                                                                                                                   
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*                                                                                                                                                                                                       
   TRES=cpu=1,mem=4000M,node=1,billing=1                                                                                                                                                                                                                                
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*                                                                                                                                                                                                                     
   MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0                                                                                                                                                                                                                    
   Features=(null) DelayBoot=00:00:00                                                                                                                                                                                                                                   
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)                                                                                                                                                                                                         
   Command=/scratch/linkblueid/sandbox/run.sh                                                                                                                                                                                                                               
   WorkDir=/scratch/linkblueid/sandbox                                                                                                                                                                                                                                      
   StdErr=/scratch/linkblueid/sandbox/slurm-100868.err                                                                                                                                                                                                                      
   StdIn=/dev/null                                                                                                                                                                                                                                                      
   StdOut=/s


If you have specified the mail options (see the first box), then SLURM will email you when the jub is put on queue and when your job is done.

Using debug partitions

If you just created a new job script, it's advisable to first run it on a debug queue/partition. A debug/queue partition will have a time limit of 1 hour, but it's useful to check if there are syntax errors in your submission script (if you don't use the debug queue and you submit it to other queues, you may have to wait while long-running jobs are taking up the nodes). Once you see that the script runs, you can cancel it and submit it on the queues with longer time limits.

Here are the 4 different debug partitions (partition name is on the far left column):

[shuso2@login001 ~]$ sinfo | grep _D
HAS24M128_D          up    1:00:00      1   idle haswell020
SAN16M64_D           up    1:00:00      1   idle cnode256
SKY32M192_D          up    1:00:00      1   idle skylake056
CAS48M192_D          up    1:00:00      1   idle cascade001

You can then submit a slurm script by specifiying #SBATCH -p with one of the above partition names.

Interactive use of a compute node

To allocate a node and use it interactively :

srun -A col_exampleprojectname_uksr -t 00:60:00 -p SAN16M64_D --pty bash

Once done working interactively, please exit the interactive session by executing:

exit

Common Slurm Commands

CommandDescription
sbatch script_fileSubmit SLURM job script
scancel job_idCancel job that has job_id
squeue -u user_idShow jobs that are on queue for user_id
sinfoShow partitions/queues, their time limits, number of nodes, and which compute nodes are running jobs or idle.

Slurm Job Script Options

OptionShort VersionLong VersionExample(s)Explanation
Job name #SBATCH –J jobname#SBATCH --job-name=jobname#SBATCH --job-name=my_first_jobThe job will be custom-labeled with jobname (in addition to an integer id for the job automatically given by the program)
Partition/queue#SBATCH -p partition_id#SBATCH --partition=partition_id

#SBATCH -partition=HAS24M128_D      # haswell partition

#SBATCH -partition=SKY32M192_D       # skylake partition

The job will be ran in  compute node(s) that is/are in partition_id
Time limit #SBATCH -t time_limit#SBATCH --time=time_limit

#SBATCH --time=01:00:00     # one hour limit

#SBATCH --time=2-00:00:00     # 2 day limit

The job will be killed if it reaches time_limit specified.
Memory (RAM)
#SBATCH --mem=memory_amount#SBATCH --mem=32g    # 32 GB ram asked
The job will use up to the specified memory_amount per node.
Project account#SBATCH -A account#SBATCH --account=account#SBATCH --account=col_pi123_uksrRun the job under this project account.
Standard error filename#SBATCH -e filename#SBATCH --error=filename

#SBATCH --error=slurm%A_@a.err    # special variables used; will be substituted with job array number and job id number

#SBATCH --error=prog_error.log     # You can use any file name (no whitespaces)

Standard error of the job will be stored under filename
Standard output filename#SBATCH -o filename#SBATCH --output=filename

#SBATCH --output=slurm%A_@a.out

#SBATCH --output=prog_output.log

Standard output  of the job will be stored under filename

Center for Computational Sciences