Submitting jobs on MCC (for first-time users)

Say that you have some scientific application that you want to run in a terminal. In your desktop/local machine, you can just run it directly like this (if the application is named "my_app":

$ my_app -x -y -z  # Running my_app directly in  local machine

In MCC, you are not allowed to directly run like this. Instead, you have to write a job script that you will submit to the systsem. So, you would create a job script and then use the program `sbatch` to submit it into a queue. The job script will include the line that will specify your application, and it will also include lines on top that will tell the job scheduler about your job. These lines will begin with "#SBATCH ...". So, instead of running your program like in the previous example, you would run it through sbatch like below:

$ sbatch my_job_script.sh

Example script files are located in /share/examples/MCC Below is a short example. First, use a text editor (e.g. vim) to create the script file. Or, you can create a script file on your local machine and copy it to MCC:

[linkblueid@mcc-login001 ~]$ vim ./first_job.sh

Example file content:

#!/bin/bash
#SBATCH --time=00:15:00     		# Time limit for the job (REQUIRED). 
#SBATCH --job-name=my_test_job    	# Job name
#SBATCH --ntasks=1       			# Number of cores for the job. Same as SBATCH -n 1
#SBATCH --partition=normal     		# Partition/queue to run the job in. (REQUIRED)
#SBATCH -e slurm-%j.err  			# Error file for this job. 
#SBATCH -o slurm-%j.out  			# Output file for this job.
#SBATCH -A <your project account>  	# Project allocation account name (REQUIRED)

echo "Hello world. This is my first job"   # This is the program that will be executed. You will substitute this with your scientific program.

Then run this job using `sbatch`. Remember, you are not allowed to run computations on the login nodes (i.e. you can't just execute your program directly. You must use sbatch because running the program directly in a login node can bog down the login node and slow down other users. What happens after you so `sbatch myjob.sh` is that the system will run your job in a special set of machines called "compute nodes". After submitting the job, you will get an output saying so, including the job's id. Once submitted, you can safely log off the login node and the job will still be in the system:

[linkblueid@mcc-login001]$ sbatch ./test_job.sh
Submitted batch job 123027

When you do the above command, your job may have to be put on queue (remember that many other users are using the system) before it is actually executed. That is, it may take some time before your scientific program actually runs (remember that many users are using the system simultaneously).

Once your job is done, you can see the slurm job output files  (slurm is MCC's automated job scheduler):

[linkblueid@mcc-login001]$ ls .
slurm-123027.err  slurm-123027.out  test_job.sh

The job script we have should have printed out our Hello World message to the slurm output file:

[linkblueid@mcc-login001]$ cat slurm-123027.out 
Hello world. This is my first job.

After you submit your job, you can see some useful details about it by calling `scontrol`:

[userid@mcc-login001 sample_job]$ scontrol show job 123027

Above, "JobState" tells us the job is currently running

Here's another more realistic example. Note the additional SBATCH flags and commands. Note some other flags, such as ones that specify your email address, which helps because you will get an automated email when your job finishes.

#!/bin/bash
#SBATCH --time 00:15:00     	# Time limit for the job (REQUIRED)
#SBATCH --job-name=myjob    	# Job name
#SBATCH --nodes=1        		# Number of nodes to allocate. Same as SBATCH -N (Don't use this option for mpi jobs)
#SBATCH --ntasks=8       		# Number of cores to allocate. Same as SBATCH -n
#SBATCH --partition=normal     	# Partition/queue to run the job in. (REQUIRED)
#SBATCH -e slurm-%j.err  		# Error file for this job. 
#SBATCH -o slurm-%j.out  		# Output file for this job.
#SBATCH -A <your project account>  # Project allocation account name (REQUIRED)
#SBATCH --mail-type ALL    		# Send email when job starts/ends 
#SBATCH --mail-user <enter email address here>   # Where email is sent to (optional)

module purge    			  # Unload other software modules
module load gnu9/9.3.0    ucx/1.9.0    libfabric/1.10.1   openmpi4/4.0.5
# Program execution command
mpirun -np 256 ./hello   # Run my application

To see more examples, look at the script files under /share/examples/MCC.

Queues/partitions

Each job needs to run in a queue or partition, or a set of nodes that have a specific set of resource limits (e.g. the number of Partition/queue information can be found by doing `sinfo` and `scontrol show <partition name>`:

`sinfo` will show all partitions, the time limit for each partition, and the state for a set of nodes in the partition. A queue/partition may have some nodes that are allocated (jobs are being run on them) while other nodes may be idle and ready to have jobs run on them. This is why, for a given queue/partition, you may see a row with a set of nodes that have the state of "alloc". or allocated nodes, while another row will show the same partition name with an "idle" state.

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 14-00:00:0     16    mix rome[008-009,012-013,018,021,026,033-036,038-039,041-043]
normal*      up 14-00:00:0     18  alloc rome[001-004,015-017,019-020,027-032,037,040,044]
normal*      up 14-00:00:0     10   idle rome[005-007,010-011,014,022-025]
jumbo        up 14-00:00:0      1    mix frome001

A shortened output for MCC is shown above. Above, there are two queues/partitions normal and jumbo. The time limit for these partitions is 14 days. 

For more information on queues/partitions, see: /wiki/spaces/UPR/pages/7668656

To see detailed information on a specific partition, do this:

linkblueid@mcc-login001]$ scontrol show partition normal
PartitionName=normal
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=14-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=rome[001-020]
   PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2560 TotalNodes=20 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4000 MaxMemPerNode=UNLIMITED


Setting time limits

It's important to set a time limit for a job. The time limit tells SLURM that your job will be killed after that specified time (the idea is that you would have an estimate on when the job would finish). The reason why you want to specify a time limit is that if you don't specify it, SLURM will put a default value which equals to the maximum time for that partition (for example, some partitions have max limit of 7 days). This is bad because SLURM will assume your job will take the longest time possible for a given partition, and SLURM will have to wait until enough resources are available to run your job. It's possible, then, that jobs by other users will be put ahead of yours in the queue if their time limit is much shorter than your job's. Thus, if you know that your program will finish in 3 hours, you can set the time limit to, say, 3.5 hours. If you have no idea how long a program runs, then you may omit the time limit the first time you run a job, or you can judiciously choose a long time. See above section "Queues/partitions" to see that each queue has a different max time limit.

Submitting a job and checking status

If you've named your job script as submit.sh, submit a job by doing this:

sbatch run.sh
Submitted batch job 100868

You can remember the job number above to see the status of the job while it's waiting on queue or while it's running. After the job is finished, the information below will no longer be available:

$ scontrol show job 100868
JobId=100868 JobName=sandbox                                                                                                                                                                                                                                            
   UserId=linkblueid(2006) GroupId=users(100) MCS_label=N/A                                                                                                                                                                                                                 
   Priority=21277 Nice=0 Account=col_griff_uksr QOS=sl2                                                                                                                                                                                                                 
   JobState=COMPLETED Reason=None Dependency=(null)                                                                                                                                                                                                                     
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0                                                                                                                                                                                                               
   RunTime=00:00:01 TimeLimit=00:01:00 TimeMin=N/A                                                                                                                                                                                                                      
   SubmitTime=2019-04-25T14:31:05 EligibleTime=2019-04-25T14:31:05                                                                                                                                                                                                      
   AccrueTime=2019-04-25T14:31:05                                                                                                                                                                                                                                       
   StartTime=2019-04-25T14:31:05 EndTime=2019-04-25T14:31:06 Deadline=N/A                                                                                                                                                                                               
   PreemptTime=None SuspendTime=None SecsPreSuspend=0                                                                                                                                                                                                                   
   LastSchedEval=2019-04-25T14:31:05                                                                                                                                                                                                                                    
   Partition=SAN16M64_D AllocNode:Sid=login002:87012                                                                                                                                                                                                                    
   ReqNodeList=(null) ExcNodeList=(null)                                                                                                                                                                                                                                
   NodeList=cnode256                                                                                                                                                                                                                                                    
   BatchHost=cnode256                                                                                                                                                                                                                                                   
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*                                                                                                                                                                                                       
   TRES=cpu=1,mem=4000M,node=1,billing=1                                                                                                                                                                                                                                
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*                                                                                                                                                                                                                     
   MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0                                                                                                                                                                                                                    
   Features=(null) DelayBoot=00:00:00                                                                                                                                                                                                                                   
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)                                                                                                                                                                                                         
   Command=/scratch/linkblueid/sandbox/run.sh                                                                                                                                                                                                                               
   WorkDir=/scratch/linkblueid/sandbox                                                                                                                                                                                                                                      
   StdErr=/scratch/linkblueid/sandbox/slurm-100868.err                                                                                                                                                                                                                      
   StdIn=/dev/null                                                                                                                                                                                                                                                      
   StdOut=/s


If you have specified the mail options (see the first box), then SLURM will email you when the job is put on queue and when your job is done.

Interactive use of a compute node

To allocate a node and use it interactively :

srun -A col_exampleprojectname_uksr -t 00:60:00 -p normal --pty bash

or

salloc -N 1 --exclusive --partition=normal -A exampleprojectname_uksr  -t 12:00:00

Once allocated, then you can ssh to that compute node from that terminal or any login nodes. Once done working interactively, please cancel that job by running the command

scanel jobid

or type exit and then again exit from the same terminal where the salloc command was issued.

Common Slurm Commands

CommandDescription
sbatch script_fileSubmit SLURM job script
scancel job_idCancel job that has job_id
squeue -u user_idShow jobs that are on queue for user_id
sinfoShow partitions/queues, their time limits, number of nodes, and which compute nodes are running jobs or idle.

Slurm Job Script Options

OptionShort VersionLong VersionExample(s)Explanation
Job name #SBATCH –J jobname#SBATCH --job-name=jobname#SBATCH --job-name=my_first_jobThe job will be custom-labeled with jobname (in addition to an integer id for the job automatically given by the program)
Partition/queue#SBATCH -p partition_id#SBATCH --partition=partition_id

#SBATCH -partition=normal      # normal partition

#SBATCH -partition=jumbo       # jumbo partition

The job will be ran in  compute node(s) that is/are in partition_id
Time limit #SBATCH -t time_limit#SBATCH --time=time_limit

#SBATCH --time=01:00:00     # one hour limit

#SBATCH --time=2-00:00:00     # 2 day limit

The job will be killed if it reaches time_limit specified.
Memory (RAM)
#SBATCH --mem=memory_amount#SBATCH --mem=32g    # 32 GB ram asked
The job will use up to the specified memory_amount.
Project account#SBATCH -A account#SBATCH --account=account#SBATCH --account=col_pi123_uksrRun the job under this project account.
Standard error filename#SBATCH -e filename#SBATCH --error=filename

#SBATCH --error=slurm%A_@a.err    # special variables used; will be substituted with job array number and job id number

#SBATCH --error=prog_error.log     # You can use any file name (no whitespaces)

Standard error of the job will be stored under filename
Standard output filename#SBATCH -o filename#SBATCH --output=filename

#SBATCH --output=slurm%A_@a.out

#SBATCH --output=prog_output.log

Standard output  of the job will be stored under filename

Center for Computational Sciences