Submitting jobs on MCC (for first-time users)
Say that you have some scientific application that you want to run in a terminal. In your desktop/local machine, you can just run it directly like this (if the application is named "my_app":
$ my_app -x -y -z # Running my_app directly in local machine
In MCC, you are not allowed to directly run like this. Instead, you have to write a job script that you will submit to the systsem. So, you would create a job script and then use the program `sbatch` to submit it into a queue. The job script will include the line that will specify your application, and it will also include lines on top that will tell the job scheduler about your job. These lines will begin with "#SBATCH ...". So, instead of running your program like in the previous example, you would run it through sbatch like below:
$ sbatch my_job_script.sh
Example script files are located in /share/examples/MCC Below is a short example. First, use a text editor (e.g. vim) to create the script file. Or, you can create a script file on your local machine and copy it to MCC:
[linkblueid@mcc-login001 ~]$ vim ./first_job.sh
Example file content:
#!/bin/bash #SBATCH --time=00:15:00 # Time limit for the job (REQUIRED). #SBATCH --job-name=my_test_job # Job name #SBATCH --ntasks=1 # Number of cores for the job. Same as SBATCH -n 1 #SBATCH --partition=normal # Partition/queue to run the job in. (REQUIRED) #SBATCH -e slurm-%j.err # Error file for this job. #SBATCH -o slurm-%j.out # Output file for this job. #SBATCH -A <your project account> # Project allocation account name (REQUIRED) echo "Hello world. This is my first job" # This is the program that will be executed. You will substitute this with your scientific program.
Then run this job using `sbatch`. Remember, you are not allowed to run computations on the login nodes (i.e. you can't just execute your program directly. You must use sbatch because running the program directly in a login node can bog down the login node and slow down other users. What happens after you so `sbatch myjob.sh` is that the system will run your job in a special set of machines called "compute nodes". After submitting the job, you will get an output saying so, including the job's id. Once submitted, you can safely log off the login node and the job will still be in the system:
[linkblueid@mcc-login001]$ sbatch ./test_job.sh Submitted batch job 123027
When you do the above command, your job may have to be put on queue (remember that many other users are using the system) before it is actually executed. That is, it may take some time before your scientific program actually runs (remember that many users are using the system simultaneously).
Once your job is done, you can see the slurm job output files (slurm is MCC's automated job scheduler):
[linkblueid@mcc-login001]$ ls . slurm-123027.err slurm-123027.out test_job.sh
The job script we have should have printed out our Hello World message to the slurm output file:
[linkblueid@mcc-login001]$ cat slurm-123027.out Hello world. This is my first job.
After you submit your job, you can see some useful details about it by calling `scontrol`:
[userid@mcc-login001 sample_job]$ scontrol show job 123027
Above, "JobState" tells us the job is currently running
Here's another more realistic example. Note the additional SBATCH flags and commands. Note some other flags, such as ones that specify your email address, which helps because you will get an automated email when your job finishes.
#!/bin/bash #SBATCH --time 00:15:00 # Time limit for the job (REQUIRED) #SBATCH --job-name=myjob # Job name #SBATCH --nodes=1 # Number of nodes to allocate. Same as SBATCH -N (Don't use this option for mpi jobs) #SBATCH --ntasks=8 # Number of cores to allocate. Same as SBATCH -n #SBATCH --partition=normal # Partition/queue to run the job in. (REQUIRED) #SBATCH -e slurm-%j.err # Error file for this job. #SBATCH -o slurm-%j.out # Output file for this job. #SBATCH -A <your project account> # Project allocation account name (REQUIRED) #SBATCH --mail-type ALL # Send email when job starts/ends #SBATCH --mail-user <enter email address here> # Where email is sent to (optional) module purge # Unload other software modules module load gnu9/9.3.0 ucx/1.9.0 libfabric/1.10.1 openmpi4/4.0.5 # Program execution command mpirun -np 256 ./hello # Run my application
To see more examples, look at the script files under /share/examples/MCC.
Queues/partitions
Each job needs to run in a queue or partition, or a set of nodes that have a specific set of resource limits (e.g. the number of Partition/queue information can be found by doing `sinfo` and `scontrol show <partition name>`:
`sinfo` will show all partitions, the time limit for each partition, and the state for a set of nodes in the partition. A queue/partition may have some nodes that are allocated (jobs are being run on them) while other nodes may be idle and ready to have jobs run on them. This is why, for a given queue/partition, you may see a row with a set of nodes that have the state of "alloc". or allocated nodes, while another row will show the same partition name with an "idle" state.
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST normal* up 14-00:00:0 16 mix rome[008-009,012-013,018,021,026,033-036,038-039,041-043] normal* up 14-00:00:0 18 alloc rome[001-004,015-017,019-020,027-032,037,040,044] normal* up 14-00:00:0 10 idle rome[005-007,010-011,014,022-025] jumbo up 14-00:00:0 1 mix frome001
A shortened output for MCC is shown above. Above, there are two queues/partitions normal and jumbo. The time limit for these partitions is 14 days.
For more information on queues/partitions, see: /wiki/spaces/UPR/pages/7668656
To see detailed information on a specific partition, do this:
linkblueid@mcc-login001]$ scontrol show partition normal PartitionName=normal AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=14-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=rome[001-020] PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=2560 TotalNodes=20 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4000 MaxMemPerNode=UNLIMITED
Setting time limits
It's important to set a time limit for a job. The time limit tells SLURM that your job will be killed after that specified time (the idea is that you would have an estimate on when the job would finish). The reason why you want to specify a time limit is that if you don't specify it, SLURM will put a default value which equals to the maximum time for that partition (for example, some partitions have max limit of 7 days). This is bad because SLURM will assume your job will take the longest time possible for a given partition, and SLURM will have to wait until enough resources are available to run your job. It's possible, then, that jobs by other users will be put ahead of yours in the queue if their time limit is much shorter than your job's. Thus, if you know that your program will finish in 3 hours, you can set the time limit to, say, 3.5 hours. If you have no idea how long a program runs, then you may omit the time limit the first time you run a job, or you can judiciously choose a long time. See above section "Queues/partitions" to see that each queue has a different max time limit.
Submitting a job and checking status
If you've named your job script as submit.sh, submit a job by doing this:
sbatch run.sh Submitted batch job 100868
You can remember the job number above to see the status of the job while it's waiting on queue or while it's running. After the job is finished, the information below will no longer be available:
$ scontrol show job 100868 JobId=100868 JobName=sandbox UserId=linkblueid(2006) GroupId=users(100) MCS_label=N/A Priority=21277 Nice=0 Account=col_griff_uksr QOS=sl2 JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:01 TimeLimit=00:01:00 TimeMin=N/A SubmitTime=2019-04-25T14:31:05 EligibleTime=2019-04-25T14:31:05 AccrueTime=2019-04-25T14:31:05 StartTime=2019-04-25T14:31:05 EndTime=2019-04-25T14:31:06 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-04-25T14:31:05 Partition=SAN16M64_D AllocNode:Sid=login002:87012 ReqNodeList=(null) ExcNodeList=(null) NodeList=cnode256 BatchHost=cnode256 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=4000M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/scratch/linkblueid/sandbox/run.sh WorkDir=/scratch/linkblueid/sandbox StdErr=/scratch/linkblueid/sandbox/slurm-100868.err StdIn=/dev/null StdOut=/s
If you have specified the mail options (see the first box), then SLURM will email you when the job is put on queue and when your job is done.
Interactive use of a compute node
To allocate a node and use it interactively :
srun -A col_exampleprojectname_uksr -t 00:60:00 -p normal --pty bash
Once done working interactively, please exit the interactive session by executing:
exit
Common Slurm Commands
Command | Description |
---|---|
sbatch script_file | Submit SLURM job script |
scancel job_id | Cancel job that has job_id |
squeue -u user_id | Show jobs that are on queue for user_id |
sinfo | Show partitions/queues, their time limits, number of nodes, and which compute nodes are running jobs or idle. |
Slurm Job Script Options
Option | Short Version | Long Version | Example(s) | Explanation |
---|---|---|---|---|
Job name | #SBATCH –J jobname | #SBATCH --job-name=jobname | #SBATCH --job-name=my_first_job | The job will be custom-labeled with jobname (in addition to an integer id for the job automatically given by the program) |
Partition/queue | #SBATCH -p partition_id | #SBATCH --partition=partition_id | #SBATCH -partition=normal # normal partition #SBATCH -partition=jumbo # jumbo partition | The job will be ran in compute node(s) that is/are in partition_id |
Time limit | #SBATCH -t time_limit | #SBATCH --time=time_limit | #SBATCH --time=01:00:00 # one hour limit #SBATCH --time=2-00:00:00 # 2 day limit | The job will be killed if it reaches time_limit specified. |
Memory (RAM) | #SBATCH --mem=memory_amount | #SBATCH --mem=32g # 32 GB ram asked | The job will use up to the specified memory_amount per node. | |
Project account | #SBATCH -A account | #SBATCH --account=account | #SBATCH --account=col_pi123_uksr | Run the job under this project account. |
Standard error filename | #SBATCH -e filename | #SBATCH --error=filename | #SBATCH --error=slurm%A_@a.err # special variables used; will be substituted with job array number and job id number #SBATCH --error=prog_error.log # You can use any file name (no whitespaces) | Standard error of the job will be stored under filename |
Standard output filename | #SBATCH -o filename | #SBATCH --output=filename | #SBATCH --output=slurm%A_@a.out #SBATCH --output=prog_output.log | Standard output of the job will be stored under filename |
Center for Computational Sciences