Submitting jobs on LCC (for first-time users)
Say that you have some scientific application that you want to run in a terminal. In your desktop/local machine, you can just run it directly like this (if the application is named "my_app":
$ my_app -x -y -z # Running my_app directly in local machine
In LCC, you are not allowed to directly run like this. Instead, you have to write a job script that you will submit to the systsem. So, you would create a job script and then use the program `sbatch` to submit it into a queue. The job script will include the line that will specify your application, and it will also include lines on top that will tell the job scheduler about your job. These lines will begin with "#SBATCH ...". So, instead of running your program like in the previous example, you would run it through sbatch like below:
$ sbatch my_job_script.sh
Example script files are located in /share/examples/LCC. Below is a short example. First, use a text editor (e.g. vim) to create the script file. Or, you can create a script file on your local machine and copy it to LCC:
[linkblueid@login001 ~]$ vim ./first_job.sh
Example file content:
#!/bin/bash #SBATCH --time=00:15:00 # Time limit for the job (REQUIRED). #SBATCH --job-name=my_test_job # Job name #SBATCH --ntasks=1 # Number of cores for the job. Same as SBATCH -n 1 #SBATCH --partition=SKY32M192_D # Partition/queue to run the job in. (REQUIRED) #SBATCH -e slurm-%j.err # Error file for this job. #SBATCH -o slurm-%j.out # Output file for this job. #SBATCH -A <your project account> # Project allocation account name (REQUIRED) echo "Hello world. This is my first job" # This is the program that will be executed. You will substitute this with your scientific program.
Then run this job using `sbatch`. Remember, you are not allowed to run computations on the login nodes (i.e. you can't just execute your program directly. You must use sbatch because running the program directly in a login node can bog down the login node and slow down other users. What happens after you so `sbatch myjob.sh` is that the system will run your job in a special set of machines called "compute nodes". After submitting the job, you will get an output saying so, including the job's id. Once submitted, you can safely log off the login node and the job will still be in the system:
[linkblueid@login001 ~]$ sbatch ./test_job.sh Submitted batch job 123027
When you do the above command, your job may have to be put on queue (remember that many other users are using the system) before it is actually executed. That is, it may take some time before your scientific program actually runs (remember that many users are using the system simultaneously).
Once your job is done, you can see the slurm job output files (slurm is LCC's automated job scheduler):
[linkblueid@login006 guide]$ ls . slurm-123027.err slurm-123027.out test_job.sh
The job script we have should have printed out our Hello World message to the slurm output file:
[linkblueid@login006 guide]$ cat slurm-123027.out Hello world. This is my first job.
After you submit your job, you can see some useful details about it by calling `scontrol`:
[userid@login001 sample_job]$ scontrol show job 123027 JobId=123027 JobName=my_test_job UserId=userid(1234) GroupId=users(100) MCS_label=N/A Priority=10018 Nice=0 Account=col_cwa236_uksr QOS=sl2 JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=13-01:14:11 TimeLimit=14-00:00:00 TimeMin=N/A SubmitTime=2019-10-26T16:26:49 EligibleTime=2019-10-26T16:26:49 AccrueTime=2019-10-26T16:26:49 StartTime=2019-10-26T16:26:50 EndTime=2019-11-09T15:26:50 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-10-26T16:26:50 Partition=CAS48M192_L AllocNode:Sid=login002:65722 ReqNodeList=(null) ExcNodeList=(null) NodeList=cascade030 BatchHost=cascade030 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=4000M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=./my_app WorkDir=/scratch/userid/myproj StdErr=/scratch/userid/myproj/slurm-123037.err StdIn=/dev/null StdOut=/scratch/userid/myproj/slurm-123037.out Power=
Above, "JobState" tells us the job is currently running
Here's another more realistic example. Note the additional SBATCH flags and commands. Note some other flags, such as ones that specify your email address, which helps because you will get an automated email when your job finishes.
#!/bin/bash #SBATCH --time 00:15:00 # Time limit for the job (REQUIRED) #SBATCH --job-name=myjob # Job name #SBATCH --nodes=1 # Number of nodes to allocate. Same as SBATCH -N (Don't use this option for mpi jobs) #SBATCH --ntasks=8 # Number of cores to allocate. Same as SBATCH -n #SBATCH --partition=SKY32M192_D # Partition/queue to run the job in. (REQUIRED) #SBATCH -e slurm-%j.err # Error file for this job. #SBATCH -o slurm-%j.out # Output file for this job. #SBATCH -A <your project account> # Project allocation account name (REQUIRED) #SBATCH --mail-type ALL # Send email when job starts/ends #SBATCH --mail-user <enter email address here> # Where email is sent to (optional) module purge # Unload other software modules module load intel/19.0.3.199 # Load necessary software modules module load impi/2019.3.199 module load ccs/nwchem/6.8 mpirun -n 24 nwchem Input_c240_pbe0.nw # Run my application
To see more examples, look at the script files under /share/examples/LCC.
Queues/partitions
Each job needs to run in a queue or partition, or a set of nodes that have a specific set of resource limits (e.g. the number of Partition/queue information can be found by doing `sinfo` and `scontrol show <partition name>`:
`sinfo` will show all partitions, the time limit for each partition, and the state for a set of nodes in the partition. A queue/partition may have some nodes that are allocated (jobs are being run on them) while other nodes may be idle and ready to have jobs run on them. This is why, for a given queue/partition, you may see a row with a set of nodes that have the state of "alloc". or allocated nodes, while another row will show the same partition name with an "idle" state.
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST HAS24M128_L up 14-00:00:0 15 alloc haswell[001-004,009-019] HAS24M128_L up 14-00:00:0 4 idle haswell[005-008] HAS24M128_M up 7-00:00:00 15 alloc haswell[001-004,009-019]
A shortened output for LCC is shown above–LCC has many more queues/partitions. Above, there are two queues/partitions HAS24M128_L and HAS24M128_M. HAS24M128_L has 15 nodes that are allocated and 4 that are idle. The time limit for this partition is 14 days. You can also see the specific node name (e.g. haswell001 is in the L queue/partition)
For more information on queues/partitions, see: /wiki/spaces/PreReleaseUKYHPCDocs/pages/21332176
To see detailed information on a specific partition, do this:
linkblueid@login002 ~]$ scontrol show partition HAS24M128_L PartitionName=HAS24M128_L AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=14-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=haswell[001-019] PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=456 TotalNodes=19 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=5000 MaxMemPerCPU=5000
Setting time limits
It's important to set a time limit for a job. The time limit tells SLURM that your job will be killed after that specified time (the idea is that you would have an estimate on when the job would finish). The reason why you want to specify a time limit is that if you don't specify it, SLURM will put a default value which equals to the maximum time for that partition (for example, some partitions have max limit of 7 days). This is bad because SLURM will assume your job will take the longest time possible for a given partition, and SLURM will have to wait until enough resources are available to run your job. It's possible, then, that jobs by other users will be put ahead of yours in the queue if their time limit is much shorter than your job's. Thus, if you know that your program will finish in 3 hours, you can set the time limit to, say, 3.5 hours. If you have no idea how long a program runs, then you may omit the time limit the first time you run a job, or you can judiciously choose a long time. See above section "Queues/partitions" to see that each queue has a different max time limit.
Submitting a job and checking status
If you've named your job script as submit.sh, submit a job by doing this:
sbatch run.sh Submitted batch job 100868
You can remember the job number above to see the status of the job while it's waiting on queue or while it's running. After the job is finished, the information below will no longer be available:
$ scontrol show job 100868 JobId=100868 JobName=sandbox UserId=linkblueid(2006) GroupId=users(100) MCS_label=N/A Priority=21277 Nice=0 Account=col_griff_uksr QOS=sl2 JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:01 TimeLimit=00:01:00 TimeMin=N/A SubmitTime=2019-04-25T14:31:05 EligibleTime=2019-04-25T14:31:05 AccrueTime=2019-04-25T14:31:05 StartTime=2019-04-25T14:31:05 EndTime=2019-04-25T14:31:06 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-04-25T14:31:05 Partition=SAN16M64_D AllocNode:Sid=login002:87012 ReqNodeList=(null) ExcNodeList=(null) NodeList=cnode256 BatchHost=cnode256 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=4000M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/scratch/linkblueid/sandbox/run.sh WorkDir=/scratch/linkblueid/sandbox StdErr=/scratch/linkblueid/sandbox/slurm-100868.err StdIn=/dev/null StdOut=/s
If you have specified the mail options (see the first box), then SLURM will email you when the jub is put on queue and when your job is done.
Using debug partitions
If you just created a new job script, it's advisable to first run it on a debug queue/partition. A debug/queue partition will have a time limit of 1 hour, but it's useful to check if there are syntax errors in your submission script (if you don't use the debug queue and you submit it to other queues, you may have to wait while long-running jobs are taking up the nodes). Once you see that the script runs, you can cancel it and submit it on the queues with longer time limits.
Here are the 4 different debug partitions (partition name is on the far left column):
[shuso2@login001 ~]$ sinfo | grep _D HAS24M128_D up 1:00:00 1 idle haswell020 SAN16M64_D up 1:00:00 1 idle cnode256 SKY32M192_D up 1:00:00 1 idle skylake056 CAS48M192_D up 1:00:00 1 idle cascade001
You can then submit a slurm script by specifiying #SBATCH -p with one of the above partition names.
Interactive use of a compute node
To allocate a node and use it interactively :
srun -A col_exampleprojectname_uksr -t 00:60:00 -p SAN16M64_D --pty bash
Once done working interactively, please exit the interactive session by executing:
exit
Common Slurm Commands
Command | Description |
---|---|
sbatch script_file | Submit SLURM job script |
scancel job_id | Cancel job that has job_id |
squeue -u user_id | Show jobs that are on queue for user_id |
sinfo | Show partitions/queues, their time limits, number of nodes, and which compute nodes are running jobs or idle. |
Slurm Job Script Options
Option | Short Version | Long Version | Example(s) | Explanation |
---|---|---|---|---|
Job name | #SBATCH –J jobname | #SBATCH --job-name=jobname | #SBATCH --job-name=my_first_job | The job will be custom-labeled with jobname (in addition to an integer id for the job automatically given by the program) |
Partition/queue | #SBATCH -p partition_id | #SBATCH --partition=partition_id | #SBATCH -partition=HAS24M128_D # haswell partition #SBATCH -partition=SKY32M192_D # skylake partition | The job will be ran in compute node(s) that is/are in partition_id |
Time limit | #SBATCH -t time_limit | #SBATCH --time=time_limit | #SBATCH --time=01:00:00 # one hour limit #SBATCH --time=2-00:00:00 # 2 day limit | The job will be killed if it reaches time_limit specified. |
Memory (RAM) | #SBATCH --mem=memory_amount | #SBATCH --mem=32g # 32 GB ram asked | The job will use up to the specified memory_amount per node. | |
Project account | #SBATCH -A account | #SBATCH --account=account | #SBATCH --account=col_pi123_uksr | Run the job under this project account. |
Standard error filename | #SBATCH -e filename | #SBATCH --error=filename | #SBATCH --error=slurm%A_@a.err # special variables used; will be substituted with job array number and job id number #SBATCH --error=prog_error.log # You can use any file name (no whitespaces) | Standard error of the job will be stored under filename |
Standard output filename | #SBATCH -o filename | #SBATCH --output=filename | #SBATCH --output=slurm%A_@a.out #SBATCH --output=prog_output.log | Standard output of the job will be stored under filename |
Center for Computational Sciences