Below are some frequently asked questions and their answers. If you feel that there are other questions/answers you would like available, let us know.
Q: Why are my jobs in the “PD” state?
A: Jobs in the ‘PD’ state are ‘pending’ in the queue. There are several reasons why a job may be in the pending state, however the most common is when the resources that are requested, either by the Slurm directives, or in-line with the ‘sbatch’ command request more resources than the system can satisfy. For example, a job submitted as follows would be automatically placed in the ‘pending’ state: sbatch --mem=600gb jobs.sh. Here we are requesting 600Gb of RAM, which none of the compute nodes is currently equipped. In this case the queue would recognize that there are insufficient existing resources and hold the job, rather than attempt to run it.
If you believe that your resource allocation has been set incorrectly, you can change the values using the ‘scontrol’ command:
scontrol update JobID=JOBID MinMemoryNode=<megabytes>
Jobs will also be held when there are no ‘queue slots’ available, meaning that the maximum number of jobs allowed in the queue are already running. In this case, your job will automatically run as soon as room in the running queue is available.
If your jobs utilize dependencies, then your jobs will be held, until the dependency requirements are satisfied.
Q: My jobs are running longer than anticipated and will soon reach their original ‘walltime’, how can I resolve this?
A: Using the following command to change if the job is not running yet.
scontrol update jobid=<job_id> TimeLimit=<new_timelimit>
If it is running, please contact us and we will help you.
Q: Is the SCC Backed-Up? Are my data secure?
A: A full daily backup of your /home is maintained on a tape library. Furthermore, all system data, as well as the contents of your /scratch directory are backed up, to one day prior (only one snapshot is maintained due to space limitations). This means that in the event that you delete important files in your /home or /scratch directory they can be restored for you.
The Operating system and System files are backed up on a separate schedule, on a monthly basis. Furthermore if there are any major updates to the system, an additional backup is taken in the event that these changes negatively affect system performance.,
Note that the storage arrays are RAID 6 compliant, such that the data are still in-tact after two drive failures. Hot-spares are also configured, so it would require three simultaneous drive failures before data are in danger of corruption.
Q: How can I utilize/launch graphical utilities on the SCC (Such as MATLAB).
A: The SCC is access via secure shell “ssh”. In order to utilize graphical user interfaces you must use “X-forwarding”. This can be achieved by including a “-X” flag when you run the ssh command: ssh –X user@scclogin.camhres.ca. If using “Putty” you can enable X11 forwarding in the command menu.
Q: Some of my jobs are failing. How can I de-bug them?
A: By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number. The file will be generated on the first node of the job allocation. Other than the batch script itself, Slurm does no movement of user files. you can review the standard error and output for your jobs that will likely describe the issues that may be preventing your job from completing. Prologue and Epilogue scripts are automatically run when you submit a job, that also provide some general information about where the job was run and under what conditions. For detailed information regarding a particular job which is current running you can call ‘qstat –f JOBID’. This commands returns information regarding the particular job, and the environmental variables associated with it.
Q: What Software is available on the SCC? Can new Software be installed?
A: You can review the software suites available on the SCC by reviewing the available modules: ‘module avail’. Multiple versions are often supported for different software utilities and can be loaded individually.
New software can be installed as a module at any time by contacting the SCC Administrator, or SCC Support.
Q: module: command not found
Because the module command is an alias or shell function (see "Package Initialization" in module(1)). The Modules package and the module command are initialized when a shell-specific initialization script is sourced into the shell. The script creates the module command, either as an alias or shell function, …” If you need to run the module command in a script, find the initialization script that defines the module command and source it from the script.
There are two solutions:
1. use an interactive shell, neglecting the specific history of the present shell, modifying the shebang of your script with
- !/bin/bash -i
or
- !/bin/bash -l
2. If instead you prefer to inherit the specific story of the present shell, you can try to source it ... but in a subshell ( source runit.sh )
Q: "Warning: no display specified." with -X flag
Mac OS X
By default, X11 forwarding is not enabled on Mac Leopard. To enable it you need to have a line "X11Forwarding yes" in the file /private/etc/sshd_config. To achieve this do this command from the terminal. $ sudo echo "X11Forwarding yes" >> /private/etc/sshd_config Enter the password when prompted.
Windows
In order to run X Windows from remote systems you must have an X client installed on your system with Windows OS.
We highly recomand using mobaxterm, an enhanced terminal for Windows with X11 server, tabbed SSH client, network tools and much more.
Xming is another implementations of the X Window System that runs under Microsoft Windows.