National Computational Infrastructure
NCI National Facility
Frequently asked questions

Accessing the AC and LC

Problems with batch jobs


Where do I go to apply for an account? Details on applying for accounts are give at Accounts web page.
How do I log on the the AC and LC? You need to use ssh (and, in particular, ssh2) or slogin to access the National Facility computers. Details of how to log on are given in the User Guide and more details on ssh are available from the Software web page. Scroll down to Network Access and Grid Services.
How do I change my password? You can now do this on the AC and the LC so logon and follow the instructions in the User Guide .
How do I open new windows on the AC? You need to have set up X11 forwarding and how you do this depends on what sort of desktop machine you are using to access the AC or LC. Some suggestions for Windows, Linux and Mac are given on the ssh software web page and on the vnc software web page.VNC is generally faster than some of the alternatives but you must use TightVNC.
How do I get help? Email to is sent to several people so you should get an answer quite quickly in work hours. More information on emailing help is given here.
Is there a simple editor on the AC? There are several editors on the AC and LC listed on the software web page. Of these, nano is very straightforward.
Where do I look for my output? If your batch job is named, for example, runjob.sh and your output is not redirected in the batch script then your job output will appear in runjob.sh..e**** where the final digits are the job number. The final entries in the .o file give you the details on walltime and virtual memory used by the job.
Where do I look for error messages if the job doesn't work? If your batch script was called runjob.sh then this will be in runjob.sh.e****. There is a limit to the length of this filename so, if you have a particularly long batch script name, it may be truncated in the resulting error and output file names.
What does it mean if the error message says ...ac-pb.SC: Command not found? This can happen if you created your batch script file on a Windows box then copied it across to the AC or LC. Windows introduces some extraneous invisible characters at the end of the lines in the file. To remove these from a batch script called, say, runjob.sh and convert it from DOS to Unix format you do
      module load dos2unix
      dos2unix runjob.sh
      
The PBS resources I requested are being ignored. Your job script should have all the resources set at the top with no blank lines, and no other commands. When specifying a list of resources with the -l option, there also must be no spaces within the comma separated list. e.g. The start of the script should be something like:
    #!/bin/sh
    #PBS -Pa99 
    #PBS -lwalltime=20:00:00,vmem=300MB
    #PBS -ljobfs=1GB
    #PBS -wd
My batch job is accepted but takes no time and produces no output. This could be due to a couple of problems:
  1. If you submit a script as an argument to qsub, check that there is a newline character at the end of the last executable line. The easiest way to do this is to simply cat the script - if the last line of the output has your shell prompt attached, edit the file to put a blank line at the end.

  2. Often when using a workstation, people run their job in the background, say ./runjob &, which works fine interactively. However, when translated to a queue batch script the result is often
        #!/bin/sh
        #PBS -q normal
        #PBS -l walltime=00:10:00,vmem=400MB
    
        ./runjob &
    

    This script will exit almost immediately as it is trying to run runjob in the background. Since the script exits immediately, the queue system assumes that the job is finished and kills off all user processes. Consquently, your code which runs fine interactively gets killed almost immedately out on the queue.

    There are two solutions. Try the batch job script

        #!/bin/sh
        #PBS -q normal
        #PBS -l walltime=00:10:00,vmem=400MB
    
        ./runjob
    
    NOTE: the missing &

    BUT, if your runjob is itself a complicated script which starts up all sorts of program in the background try

        #!/bin/sh
        #PBS -q normal
        #PBS -l walltime=00:10:00,vmem=400MB
    
        ./runjob
    
        wait
    
    which will tell the shell (/bin/sh) to wait until all background jobs are finished before exiting. This will prevent your background jobs from being killed and allow your program to complete.
Why won't my batch job start? A batch job requires different resources before it can start running such as adequate free memory, dedicated processors and licenses for any licensed software being used. The queue scheduler will not start your job until it can get all the resources it needs. The priority for starting up jobs also depends on jobs in the queue submitted by other members of your project and how much of your grant has been used. Your job may be be held up because another user from your project has queued a job which cannot start and it is above yours in the queue.

Check the software license web page to see if there are sufficient software licenses for your job to start.

Why does my batch job get suspended? If your job is suspended either a higher priority (e.g. express queue job) or a larger parallel job is running on the CPUs your job had been running on.

Parallel jobs require x cpus to be available to start running. To start a parallel job we have a choice - either cpus are held idle as other jobs finish until x CPUs become available, or we allow jobs to jump the x CPU parallel job in the queue up to the point where the order of x CPUs worth of jobs have jumped ahead, at which time the parallel job can suspend all the jobs which previously jumped it in the queue, and start running itself.

In order to optimize CPU utilization on the machine, we opt to start jobs running as soon as possible, which results in them spending less time in queued state, but then potentially some time in suspended state. We believe that job preemption (suspension/resumption) is giving us around 25-30% better system utilization than systems based on FIFO where cpus have to be held idle to accumulate sufficient for parallel jobs. That means everyone's grants are effectively 30% larger - we hope you will accept occasional longer suspensions for this win.

There is more explanation in the scheduling policy document.

Can I keep submitting jobs when my grant runs out? Yes, you can continue to submit jobs to the batch queues when you have used up all your quarterly grant. These jobs will run at a lower priority than jobs of projects with remaining grant. As a result they may take longer to start and may be suspended more often. However, as time goes by, more and more users are in the same situation.
Email problems, suggestions, questions to