SCC is using new queueing system called Slurm to manage compute resources and to schedule jobs that use them. Users use Slurm commands to submit batch and interactive jobs and to monitor their progress during execution.
Log in
First, log into an SCC login node. This step is dependent on your local environment; but in an OS X or Linux environment, you should be able to use the standard OpenSSH command-line client.
$ ssh –X -l $username scclogin.camhres.ca
DO NOT RUN ANY PROGRAMS ON THIS NODE, please connect to one of our development nodes to prepare your jobs for submission to the cluster:
ssh dev01
ssh dev02
Transitioning from PBS to Slurm
The original SCC cluster ran the PBS batch system. We are planning migrate the batch system to Slurm.
In general, a PBS batch script is a bash or csh script that will work in Slurm. Slurm will attempt to convert PBS directives appropriately. In many cases, you may not need to change your existing PBS batch scripts to work with Slurm. This is fine for scripts that have simple PBS directives, e.g. #PBS -m be.
We have manage to convert the following PBS Environment variables to Slurm Environment
- PBS_JOBID
- PBS_JOBNAME
- PBS_O_WORKDIR
- PBS_O_HOST
- PBS_NUM_NODES
- PBS_NUM_PPN
- PBS_NP
- PBS_O_NODENUM
- PBS_O_VNODENUM
- PBS_O_TASKNUM
- PBS_ARRAYID
Note that except the above variables, not all PBS Environment variables will not be converted by Slurm. variables For anything more complicated, you should rewrite your batch scrips in Slurm syntax. Batch scripts for parallel jobs in particular should be rewritten for Slurm.
Equivalent commands in PBS and Slurm
Purpose | PBS | Slurm |
---|---|---|
Submit a job | qsub jobscript | sbatch jobscript |
Delete a job | qdel job_id | scancel job_id |
Delete all jobs belonging to user | qdel `qselect -u user` | scancel -u user |
Job status | qstat -u user | squeue -u user |
Show all jobs | qstat -a | squeue |
Environment variables | ||
Job ID | $PBS_JOBID | $SLURM_JOBID |
Submit directory | $PBS_O_WORKDIR | $SLURM_SUBMIT_DIR |
Allocated node list | $PBS_NODEFILE | $SLURM_JOB_NODELIST |
Job array index | $PBS_ARRAY_INDEX | $SLURM_ARRAY_TASK_ID |
Number of cores/processes | $SLURM_CPUS_PER_TASK | |
$SLURM_NTASKS | ||
Job specifications | ||
Set a wallclock limit | qsub -l nodes=1,walltime=HH:MM:SS | sbatch -t [min] OR -t [days-hh:mm:ss] |
Standard output file | qsub -o filename
| sbatch -o filename
|
Standard errror file | qsub -e filename
| sbatch -e filename
|
Combine stdout/stderr | qsub -j oe
| This is the default. |
Location of out/err files | qsub -k oe
| not needed. By default, slurm will write stdout/stderr files to the directory from which the job is submitted. |
Export environment to allocated node | qsub -V | sbatch --export=all (default) |
Export a single variable | qsub -v np=12 | sbatch --export=np |
Email notifications | qsub -m be
| END|FAIL|ALL
|
Job name | qsub -N jobname -l nodes=1 jobscript
| sbatch --job-name=name jobscript
|
Job restart | n] | sbatch --requeue OR --no-requeue |
Working directory | - | sbatch --workdir=[dirname] |
Memory requirement | qsub -l nodes=1:g8 | sbatch --mem=8g |
qsub -l nodes=1,mem=256gb | sbatch --mem=256g | |
Job dependency | qsub -W depend=afterany:jobid | sbatch --depend=afterany:jobid |
Job Blocking | qsub -W block=true | ---- no equivalent ------------------ |
Job arrays | qsub -J 1-100 jobscript | sbatch --array=1-100 jobscript |
Licenses | qsub -l nodes=1,matlab=1 | sbatch --licenses=matlab???????????? |
Converting a PBS batch script to a Slurm batch script
Defaults:
Slurm will, by default, attempt to understand all PBS options in the batch script. For example, a batch script containing
#PBS -N JobName
will be internally translated by Slurm into
#SBATCH --job-name=JobName
and the job will show up in in the squeue output with the job name 'JobName'.
Thus, most of your old PBS batch scripts should work in Slurm without problems. For new batch scripts, we recommend that you start using the SLURM options.
Ignore PBS directives:
If you do not want the PBS directives in your batch script to be internally translated by Slurm, use the --ignore-pbs option to Slurm. For example, submitting with:
[biowulf ~]$ sbatch --ignore-pbs jobscript
will cause Slurm to ignore all #PBS directives in the batch script.
pbs2slurm:
A script called pbs2slurm.py can be used to convert your existing PBS batch scripts to Slurm scripts.
Sample session.
[biowulf2 ~]$ pbs2slurm.py < run1.pbs > run1.slurm
run1.pbs #!/bin/bash -l | run1.slurm #!/bin/bash -l |
Note that the directive #PBS -k oe is not translated. This directive is unnecessary in Slurm, so there is no equivalent. Slurm defaults to writing a single stderr/stdout file to the directory from which the job was submitted. (This Slurm behaviour can be changed with the #SBATCH -o filename and #SBATCH -e filename flags).
Batch jobs
Submitting job
Slurm is primarily a resource manager for batch jobs: a user writes a job script that Slurm schedules to run non-interactively when resources are available. Users primarily submit computational jobs to the Slurm queue using the sbatch command.
$ sbatch job-script.sh
sbatch takes a number of command-line arguments. These arguments can be supplied on the command-line:
$ sbatch --ntasks 16 job-script.sh
or embedded in the header of the job script itself using #SBATCH directives:
- !/bin/bash
- SBATCH --ntasks 16
You can use the scancel command to cancel a job that has been queued, whether the job is pending or currently running. Jobs are cancelled by specifying the job id that is assigned to the job during submission.
Example batch job script: hello-world.sh
- !/bin/bash --login
- SBATCH --ntasks 1
- SBATCH --tasks-per-node=1
- SBATCH --output hello-world.out
- SBATCH --qos debug
- SBATCH --time=00:05:00
echo Running on $(hostname --fqdn): 'Hello, world!'
This minimal example job script, hello-world.sh, when submitted with sbatch, writes the name of the cluster node on which the job ran, along with the standard programmer's greeting, "Hello, world!", into the output file hello-world.out
$ sbatch hello-world.sh
Note that any Slurm arguments must precede the name of the job script.
Job requirements
Slurm uses the requirements declared by job scripts and submission arguments to schedule and execute jobs as efficiently as possible. To minimize the time your jobs spend waiting to run, define your job's resource requirements as accurately as possible.
--nodes
The number of nodes your job requires to run.
--mem
The amount of memory required on each node.
--ntasks
The number of simultaneous tasks your job requires. (These tasks are analogous to MPI ranks.)
--ntasks-per-node
The number of tasks (or cores) your job will use on each node.
--time
The amount of time your job needs to run.
The --time requirement (also referred to as "walltime") deserves special mention. Job execution time can be somewhat variable, leading some users to overestimate (or even maximize) the defined time limit to prevent premature job termination; but an unnecessarily long time limit may delay the start of the job and allow undetected stuck jobs to waste more resources before they are terminated.
The --mem requirement if not defined in your scripts, it will be set to default 16GB/core
For all resources, --time included, smaller resource requirements generally lead to shorter wait times.
Summit nodes can be shared, meaning each such node may execute multiple jobs simultaneously, even from different users.
Additional job parameters can be got with the sbatch --help or man sbatch.
Summit Partitions
On SCC, partitions was defined as the following table.
Partition name | Description | MaxNodes | Max Walltime |
---|---|---|---|
short | short (default) | 24 | 4H |
medium | Medium time | 16 | 8H |
long | Long time | 8 | 4000H |
debug | Debug | 2 | 20H/core |
gpu | GPU-enabled | 1 | n/a |
for non-gpu purpose job, you do not need to define partition in your scripts . we will assaign to correct partition according to job’s request time.
Quality of service (QOS)
.
On SCC, QoSes are used to constrain or modify the characteristics that a job can have. For example, by selecting the "debug" QoS, a user can obtain higher queue priority for a job with the tradeoff that the maximum allowed wall time is reduced from what would otherwise be allowed on that partition.
The Current available Summit QoSes are
QOS name | Description | Max walltime | Max jobs/user | Node limits | Priority boost |
---|---|---|---|---|---|
normal | default | Derived from partition | n/a | n/a | 0 |
debug | For quicker turnaround when testing | 20h for 1 core | 2 | 2/job | Equiv. of 1-day queue wait time |
Shell variables and environment
Jobs submitted to Summit are not automatically set up with the same environment variables as the shell from which they were submitted. Thus, it is required to load any necessary modules or set any environment variables needed by the job within the job script. These settings should be included after any #SBATCH directives in the job script.
Job arrays
Job arrays provide a mechanism for running several instances of the same job with minor variations.
Job arrays are submitted using sbatch, similar to standard batch jobs.
$ sbatch --array=[0-9] job-script.sh
Each job in the array will have access to a $SLURM_ARRAY_TASK_ID set to the value that represents that job's position in the array. By consulting this variable, the running job can perform the appropriate variant task.
Example array job script: hello-world.sh
- !/bin/bash
- SBATCH --array 0-9
- SBATCH --ntasks 1
- SBATCH --output array-job.out
- SBATCH --open-mode append
- SBATCH --qos debug
- SBATCH --time=00:05:00
echo "$(hostname --fqdn): index ${SLURM_ARRAY_TASK_ID}"
This minimal example job script, array-job.sh, when submitted with sbatch, submits ten jobs with indexes 0 through 9. Each job appends the name of the cluster node on which the job ran, along with the job's array index, into the output file array-job.out
$ sbatch array-job.sh
Allocations
Access to computational resources is allocated via shares of CPU time assigned to Slurm allocation accounts. You can determine your default allocation account using the sacctmgr command.
$ sacctmgr list Users Users=$USER format=DefaultAccount
Use the --account argument to submit a job for an account other than your default.
- SBATCH --account=crcsupport
You can use the sacctmgr command to list your available accounts.
$ sacctmgr list Associations Users=$USER format=Account
Job mail
Slurm can be configured to send email notifications at different points in a job's lifetime. This is configured using the --mail-type and --mail-user arguments.
- SBATCH --mail-type=END
- SBATCH --mail-user=user@example.com
The --mail-type configures what points during job execution should generate notifications. Valid values include BEGIN, END, FAIL, and ALL.
Resource accounting
Resources used by Slurm jobs are recorded in the Slurm accounting database. This accounting data is used to track allocation usage.
The sacct command displays accounting data from the Slurm accounting database. To query the accounting data for a single job, use the --job argument.
$ sacct --job $jobid
sacct queries can take some time to complete. Please be patient.
You can change the fields that are printed with the --format option, and the fields available can be listed using the --helpformat option.
$ sacct --job=200 --format=jobid,jobname,qos,user,nodelist,state,start,maxrss,end
If you don't have a record of your job IDs, you can use date-range queries in sacct to find your job.
$ sacct --user=$USER --starttime=2017-01-01 --endtime=2017-01-03
To query the resources being used by a running job, use sstat instead:
$sstat -a -j JobID.batch
where you should replace JObID with the actual ID of your running job. sstat is especially useful for determining how much memory your job is using; see the "MaxRSS" field.
Monitoring job progress
The squeue command can be used to inspect the the Slurm job queue and a job's progress through it.
By default, squeue will list all jobs currently queued by all users. This is useful for inspecting the full queue; but, more often, users simply want to inspect the current state of their own jobs.
$ squeue --user=$USER
Slurm can provide an estimate of when your jobs will start, along with what resources it expects to dispatch your jobs to. Please keep in mind that this is only an estimate!
$ squeue --user=$USER --start
More detailed information about a specific job can be accessed using the scontrol command.
$ scontrol show job $SLURM_JOB_ID
Interactive jobs
Interactive jobs allow users to log in to a compute node to run commands interactively on the command line. They are commonly run with the debug QoS as part of an interactive programming and debugging workflow. The simplest way to establish an interactive session is to use the srun command:
$ srun -p debug --time=01:00:00 --pty /bin/bash
This will open a login shell using one core on one node for one hour. If you prefer to submit an existing job script or other executable as an interactive job, use the salloccommand.
$ salloc -p debug --time=01:00:00 job-script.sh
If you do not provide a command to execute, salloc starts up a Slurm job that nodes will be assigned to, but it does not log you in to the allocated node(s).
The srun and salloc commands each support the same parameters as sbatch, and can override any default configuration. Note that any #SBATCH directives in your job script will not be interpreted by salloc when it is executed in this way. You must specify all arguments directly on the command line.
Interactive are only allowed in debug and gpu partition on Scc Cluter. The maximum walltime allowed is 2 hours.
Temporary Directories
When a SLURM job starts, the scheduler creates a temporary directory for the job on the compute node's local hard drive. This $SLURM_TMPDIR directory is very useful for jobs that need to use or generate a large number of small files, as the /export/ramdisk ramdisk filesystem is optimized for files. The default maximum size of this local ramdisk size is around half of the total memory of the node.
The directory is owned by the user running the job. The path to the temporary directory is made available as the $SLURM_TMPDIR variable. At the end of the job, the temporary directory is automatically removed.
You can use the ${SLURM_TMPDIR} variable in job scripts to copy temporary data to the temporary job directory. If necessary, it can also be used as argument for applications that accept a temporary directory argument.
Note - Default Paths
Many applications and programming languages use the $TMPDIR environment variable, if available, as the default temporary directory path. If this variable is not set, the applications will default to using the /tmp directory, which is not desirable. SLURM will set $TMPDIR to the same value as $SLURM_TMPDIR unless $TMPDIR has already been set, in which case it will be ignored.
If you are using large memory program which will using half of the node memory, check your job script(s) and shell initialization files like .bashrc and .bash_profile to make sure you have $TMPDIR set to other place because ramdisk will take the usable memory.
If a personal Singularity container is used, make sure that the $SINGULARITYENV_TMPDIR variable is set within the job to export the local scratch location into the Singularity container.
CIFS Directories
Windows AD account user can access X and Y drive in SCC directly with readonly permission.
Before submitting job, do the following:
- Check kerberos ticket
[andytest@camh.ca@scclogin01 ~]$ klist
Usually you will got the following result :
Ticket cache: KEYRING:persistent:1861908071:1861908071
Default principal: andytest@CAMH.CA
Valid starting Expires Service principal
04/04/2018 14:58:31 04/05/2018 00:58:31 krbtgt/CAMH.CA@CAMH.CA
renew until 04/11/2018 14:58:29
if you got something like:
klist: Credentials cache keyring 'persistent:1861908071:1861908071' not found
you need to reinitialize your kerberos ticket
Initial kerberos ticket
If you log in for long time, the kerberos ticket may expire, using kinit to reinitialize
[andytest@camh.ca@scclogin01 ~]$ kinit
Submitting job with auks parameter
When submiting job, please add -auks=yes parameter in all slurm command like:
Example batch job script: test_cifs.sh
- !/bin/bash --login
- SBATCH --ntasks 1
- SBATCH --tasks-per-node=1
- SBATCH --output test_cifs.out
- SBATCH --qos debug
- SBATCH --time=00:05:00
echo Running on $(hostname --fqdn): 'Hello, world!'
ls /cifs/X
ls /cifs/Y
sbatch –auks=yes test_cifs.sh
check the file test_cifs.out , you should see something like:
[andytest@camh.ca@scclogin01 ~]$ cat test_cifs.out
Running on node20.camhres.ca: Hello, world!
ReseachIT_Agreements
Andytest_CS
ReseachIT_Agreements