Submitting Jobs to the Queue
The SCC is a shared system, and jobs that are to run on them are submitted to a queue; the Scheduler then orders the jobs in order to make the best use of the machine, and has them launched when resources become available. The intervention of the scheduler can mean that the jobs aren't quite run in a first-in first-out order.
The maximum wall clock time for a job in the queue is 4000 hours; Our cluster will put your jobs in diffrent queue depend on the walltime you request.Short time jobs will got more resource and will start sooner. So if your job need fewer time, please specify that in your script -- your job will start sooner. (It's easier for the scheduler to fit in a short job than a long job). On the downside, the job will be killed automatically by the queue manager software at the end of the specified wallclock time, so if you guess wrong you might lose some work. So the standard procedure is to estimate how long your job will take and add 10% or so.
Because of the group-based allocation, it is conceivable that your jobs won't run if your colleagues have already exhausted your group's limits.
Note that scheduling big jobs greatly affects the queuer and other users, so you have to talk to us first to run massively parallel jobs. We will help make sure that your jobs start and run efficiently.
Batch Submission Script
You interact with the queuing system through the Slurm queue/resource manager. To submit a job, you must write a script which describes the job and how it is to be run and submit it to the queue, using the command A sample submission script is shown below with the #PBS directives at the top and the rest being what will be executed on the compute node.
#SBATCH --nodes=1 #SBATCH --cpus-per-task=8 #SBATCH --time=02:00:00 #SBATCH --mem=128G #SBATCH --mail-user=netid@gmail.com #SBATCH --mail-type=begin #SBATCH --mail-type=end #SBATCH --error=JobName.%J.err #SBATCH --output=JobName.%J.out cd $SLURM_SUBMIT_DIR module load modulename your_commands_goes_here yourscripts data_1 > output
The lines that begin #SBATCH are commands that are parsed and interpreted by sbatch at submission time, and control administrative things about your job. In this example, the script above requests one nodes, using 8 processors per node, for a wallclock time of two hour.
Not all lines above is required be in the submission script, some have default value like cpu default to 1 core and ram default to 8G. If you do not need mail notifications, you do not need include --mail line in it. But we strongly recommand you to add the line like cpu, time and mem parameter to your scripts.
Slurm Directives
Resource | Flag Syntax | Description | Notes |
---|---|---|---|
partition | --partition=general-compute | Partition is a queue for jobs. | default on ub-hpc is general-compute |
time | --time=01:00:00 | Time limit for the job. | 1 hour; |
nodes | --nodes=2 | Number of compute nodes for the job. | default is 1; compute nodes |
cpus/cores | --ntasks-per-node=8 | Corresponds to number of cores on the compute node. | default is 1 |
resource feature | --gres=gpu:2 | Request use of GPUs on compute nodes | default is no feature specified; |
memory | --mem=24000 | Memory limit per compute node for the job. Do not use with mem-per-cpu flag. | memory in MB; default limit is 3000MB per core |
memory | --mem-per-cpu=4000 | Per core memory limit. Do not use the mem flag, | memory in MB; default limit is 3000MB per core |
job name | --job-name="hello_test" | Name of job. | default is the JobID |
output file | --output=test.out | Name of file for stdout. | default is the JobID |
email address | --mail-user=username@buffalo.edu | User's email address | required |
email notification | --mail-type=ALL | ||
--mail-type=END | When email is sent to user. | omit for no email | |
access | --exclusive | Exclusive acccess to compute nodes. | default is sharing nodes |
Slurm environment variables
The Slurm controller will set variables in the environment of the batch script. Here we make a list of them and the corresponding Torque/MOAB environment variables for the comparison.
SLURM Variables | Torque/MOAB | Description |
---|---|---|
SLURM_ARRAY_JOB_ID | PBS_JOBID | Job array's master job ID number |
SLURM_ARRAY_TASK_COUNT | Total number of tasks in a job array | |
SLURM_ARRAY_TASK_ID | PBS_ARRAYID | Job array ID (index) number |
SLURM_ARRAY_TASK_MAX | Job array's maximum ID (index) number | |
SLURM_ARRAY_TASK_MIN | Job array's minimum ID (index) number | |
SLURM_ARRAY_TASK_STEP | Job array's index step size | |
SLURM_CLUSTER_NAME | Name of the cluster on which the job is executing | |
SLURM_CPUS_ON_NODE | Number of CPUS on the allocated node | |
SLURM_CPUS_PER_TASK | PBS_VNODENUM | Number of cpus requested per task. Only set if the --cpus-per-taskoption is specified. |
SLURM_JOB_ACCOUNT | Account name associated of the job allocation | |
SLURM_JOB_CPUS_PER_NODE | PBS_NUM_PPN | Count of processors available to the job on this node. |
SLURM_JOB_DEPENDENCY | Set to value of the --dependency option | |
SLURM_JOB_NAME | PBS_JOBNAME | Name of the job |
SLURM_JOBID | ||
SLURM_JOB_ID | PBS_JOBID | The ID of the job allocation |
SLURM_MEM_PER_CPU | Same as --mem-per-cpu | |
SLURM_MEM_PER_NODE | Same as --mem | |
SLURM_NNODES | ||
SLURM_JOB_NUM_NODES | Total number of different nodes in the job's resource allocation | |
SLURM_NODELIST | ||
SLURM_JOB_NODELIST | PBS_NODEFILE | List of nodes allocated to the job |
SLURM_NTASKS_PER_NODE | Number of tasks requested per node. Only set if the --ntasks-per-node option is specified. | |
SLURM_NTASKS_PER_SOCKET | Number of tasks requested per socket. Only set if the --ntasks-per-socket option is specified. | |
SLURM_NTASKS | ||
SLURM_NPROCS | PBS_NUM_NODES | Same as -n, --ntasks |
SLURM_SUBMIT_DIR | PBS_O_WORKDIR | The directory from which sbatch was invoked |
SLURM_SUBMIT_HOST | PBS_O_HOST | The hostname of the computer from which sbatch was invoked |
SLURM_TASK_PID | The process ID of the task being started | |
SLURMD_NODENAME | Name of the node running the job script |
</pre>
Job Submission
$ sbatch [SCRIPT-FILE-NAME]
where you will replace [SCRIPT-FILE-NAME] with the file containing the submission script. This will return a job ID, for example 51923, which is used to identify the jobs. Information about a queued job can be found using