Two most important commands for monitoring your job status aresqueue
andscontrol show job
.
squeue -l -u <username>
. Shows all your jobs that are in the SLURM queue.squeue -l -u <username> -p <partition name>
. Shows all your jobs that are in the specific partition (in case you used multiple) in the SLURM queue.scontrol show job -dd <job_id>
. Shows all information about specific SLURM job. It is worth paying attention to the following information:- Requeue. Shows how many times your job was re-queued. Some jobs may have higher priority and may pre-empt (i.e. cancel) your running jobs and put them back to the queue. If your job takes too long time and Requeue is greater than 1 then, most probably, the reason why your job takes so long is because it was cancelled and re-queued several times.
- TimeLimit. Shows time limit of your job.
- Command. The SLURM script that was executed. (only for
sbatch script.sh
) - StdErr. File where STDERR is written.
- StdOut. File where STDOUT is written.
- BatchScript. The command that was executed. (only for
sbatch --wrap="script.sh args..."
)
Individual job status can be queried using the checkjob command, followed by the JobID:
$ squeue -j [JOB-ID]
SCC allow user to see all the jobs in the queue by using
$ showq
Jobs can be cancelled with the canceljob command
$ scancel [JOB-ID]
Again, these commands have many options, which can be read about on their man pages.