http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/Intel64Cluster/Doc/Jobs.html
Jobs should not be run on the interactive nodes. Their use is primarily
for compiling and building your programs. Instead, please run jobs on
the compute nodes. See the section on qsub
-I
for instructions on how to run an interactive job on the compute nodes.
MPI
All the implementations of MPI on the NCSA Intel 64 Linux Cluster have
the
mpirun script for running an MPI program. See the sample
batch scripts for syntax details for the MPI implementations.
Notes:
- The environment variable $PBS_NODEFILE is automatically
defined in a batch job to point to a temporary file that
contains the list of nodes assigned to the job.
- The arguments to mpirun need to come before your executable.
Any arguments after your executable are considered to be arguments
to your executable.
- The VMI2 MPI implementation does not propagate environment
variables well.
The workaround is to create a wrapper script that sets all the
environment
variables that your code will need along with the executable. Then in
your
batch script use the wrapper script as the executable in your mpirun
line.
- As noted in the MVAPICH2 sample
batch script, in order to run MVAPICH2 jobs, a file named
.mpd.conf needs to exist in your home directory with the
line:
MPD_SECRETWORD=XXXXXXX
where XXXXXXX is a string of random alphanumeric characters,
with at least one alphabetic character.
The file should also be readable and writeable only by the owner, so the
permissions need to be set as follows:
chmod 700 $HOME/.mpd.conf
OpenMP
Before you run an OpenMP program, set the environment variable
OMP_NUM_THREADS to the number of thtreads you want. For example,
to run program a.out interactively with two threads:
setenv OMP_NUM_THREADS 2
./a.out
The following environment variables may also be useful in running your
OpenMP
programs:
OMP_SCHEDULE |
Sets the schedule type and (optionally) the chunk size
for DO and PARALLEL DO loops declared with a schedule
of RUNTIME. The default is STATIC. |
KMP_LIBRARY |
sets the run-time execution mode. The default is throughput,
but it can be set to turnaround so worker threads do
not
yield while waiting for work. |
KMP_STACKSIZE |
Sets the number of bytes to allocate for the stack of
each parallel thread. You can use a suffix k, m, or
g to specify kilobytes, megabytes or gigabytes.
The default is 4m. |
Hybrid MPI/OpenMP
To run a MPI/OpenMP hybrid program, you need to set the envionment
variable OMP_NUM_THREADS to the number of threads you want, and change
the number of cpus per node for MPI accordingly. For example, to run a
program with 10 MPI ranks and 8 threads for each rank, do the following
in your batch script:
#PBS -l nodes=10:ppn=1
setenv OMP_NUM_THREADS 8
See the exception with VMI2 in the MPI section above on using a
wrapper.
(See the qsub
section for information on PBS directives.)
The NCSA Intel 64 Linux Cluster uses the
Torque Resource
Manager
with the Moab Workload
Manager
for running jobs. Torque is based upon OpenPBS, so the commands are the
same
as PBS commands.
3.1 Scheduling Policies
The
scheduling policy on Abe is set to highly favor large node-count jobs.
Also, as with other HPC systems at NCSA, the scheduling policy includes
fair-share.
This is a policy whereby a job's priority may be increased or
decreased because of other jobs that the user's project may be running
or have
recently run. Basically, in order to give everyone a fair opportunity
to
run jobs, a user's job will have a higher priority if users in their
project haven't run
jobs in the recent past.
Fair-share also factors in the
ratio of the service units the user's project is allocated and the
time to the allocation expiration.
To maximize utilization,
the scheduler will also back-fill jobs. When trying to schedule
large blocks of nodes for large jobs, there are often "holes" where some
nodes are idle waiting to be added to a pool to start a large waiting
job.
The scheduler back-fills smaller jobs into these holes.
When figuring out a job's priority relative to other jobs, there are
several factors which are taken into account. Some of these factors
include:
- job size (how many nodes)
- job expansion factor (the ratio of the time the job has spent
eligible
to be run versus how much time the job has requested)
- the raw amount of time the job has spent eligible to be run
- fair-share factors
A relative weighting of these factors contributes to a job's priority.
A debug queue is available to facilitate fast turnaround on
debugging/testing jobs. Jobs in this queue have an intrinsically
higher priority; additionally, they accrue priority
at a much higher rate because the expansion factor (and its associated
priority factor) increases very quickly.
In order to keep jobs from the long queue from dominating the
system and causing shorter jobs to wait behind them, there is a limit on
the nodes currently running jobs from the long queue.
Given the fluid nature of our job load, this limit is adjusted from time
to time, but in the general case we tend to keep it between 1/4 and 1/3
of
the available nodes.
When that limit is reached, subsequent jobs in the queue may go into a
blocked state until running jobs finish and free up resources.
Then the jobs will automatically be moved from the blocked state and
get scheduled to run.
3.2 Queues
The following queues are currently available for users:
Queue | Walltime | Max # Nodes |
debug |
30 mins |
16 |
normal(default) |
48
hours |
256
(as of July 1 2009) |
wide |
48 hours |
600
(as of September 16 2009) |
long |
168 hours |
256
(as of July 1 2009) |
NOTES:
- Jobs submitted to all but the wide queue will stay within
the bounds of either the 16GB memory or the 8GB memory
nodes
i.e., jobs will not span across the two types of
resources
unless submitted to the wide queue.
- The minimum node count for the wide queue is 64.
- Access to resources over 600 nodes (up to a maximum of 1024
nodes) is available by special request.
Please send email to consult@ncsa.uiuc.edu
to request access. Include the number of nodes and the wall time
required, and the number of jobs to be run.
Below are brief descriptions of the useful batch commands.
For more detailed information, refer to the individual man pages.
The qsub command is used to submit a batch job to a queue.
All options to qsub can be specified either on the command
line
or as a line in a script (known as an embedded option). Command
line
options have precedence over embedded options.
Scripts can be submitted using
qsub [list of qsub options] script_name
The main qsub commands are listed below.
The sample
batch scripts illustrates
qsub usage and options.
Also see the qsub man page for other options.
-
-l resource-list: specifies resource limits.
The resource_list argument is of the form:
resource_name[=[value]][,resource_name[=[value]],...]:resource
The resource_names are:
walltime: maximum wall clock time (hh:mm:ss) [default: 10
mins]
nodes: number of 8-core nodes [default: 1 node]
ppn: how many cores per node to use (1 through 8)
[default: ppn=1]
resource: resource to be used. The available resource is
himem to access the 16 GB memory nodes.
Note: Specify the himem resource only if you absolutely need
the higher memory
nodes since it can impact turnaround time of the job.
Examples:
#PBS -l walltime=00:30:00,nodes=2:ppn=8
#PBS -l walltime=00:30:00,nodes=2:ppn=8:himem
-
-q queue_name: specify queue name.[default: normal]
- -N jobname: specifies the job name.
- -o out_file:
store the standard output of the job to file out_file.
After the job is done, this file will be found in the directory from
which the qsub command was issued.
[default :<jobname>.o<PBS_JOBID>]
- -e err_file:
store the standard error of the job to file err_file.
After the job is done, this file will be found in the directory from
which the qsub command was issued.
[default :<jobname>.e<PBS_JOBID>]
- -j oe:
merge standard output and standard error into standard output file.
- -V:
export all your environment variables to the batch job.
-
-m be:
send mail at the beginning and end of a job.
-
-M myemail@myuniv.edu : send any email to given email
address.
- -A project:
charge your job to a specific project (TeraGrid project or NCSA PSN).
(for users in more than one project)
-
-X: enables X11 forwarding.
Notes:
- Using the -N option will generate stdout and stderr
files of the form:
<jobname>.o<jobid> and <jobname>.o<jobid>
respectively
in the directory from where the batch job was submitted when used
without the -o and -e options.
- Temporary stdout/stderr files while the job is running are
located in the
home directory [$HOME/.pbs_spool or $HOME], and named <jobid>.abem5.OU
and
<jobid>.abem5.ER.
The -I option tells qsub you want to run an interactive job. You can
also
use other qsub options such as those documented in the batch sample
scripts.
For example, the following command:
qsub -I -V -l walltime=00:30:00,nodes=2:ppn=8
will run an interactive job with a wall clock limit of 30 minutes, using
two nodes and eight cores per node.
After you enter the command, you will have to wait for Torque to start
the
job. As with any job, your interactive job will wait in the queue until
the specified number of nodes is available. If you specify a small
number of nodes for smaller amounts of time, the wait should be shorter
because your job will backfill among larger jobs.
Once the job starts, you
will see something like this:
qsub: waiting for job 1244.abem5.ncsa.uiuc.edu to start
qsub: job 1244.abem5.ncsa.uiuc.edu ready
Now you are logged into the launch node. At this point, you can use the
appropriate command to start your program.
When you are done with your runs, you can use the exit command to end
the job.
The
qstat command displays the status of batch jobs.
- qstat -a gives the status of all jobs on the system.
- qstat -n lists nodes allocated to a running job in
addition to basic information.
- qstat -f PBS_JOBID gives detailed
information
on a particular job.
Note: Currently PBS_JOBID needs to be the full extension:
<jobid>.abem5.ncsa.uiuc.edu.
- qstat -q provides summary information on all the
queues.
See the man page for other options available.
qhist, a locally written tool available on the NCSA Intel 64
Linux Cluster, summarizes the raw accounting record(s) for one or more
jobs.
See the output of "qhist --help" for details.
NOTE: As of May 6 2009, SU charges for a job are available the day after
the job completes.
To display information about a specific job, the syntax is qhist
PBS_JOBID.
The qdel command deletes a queued job or kills a running
job.
The syntax is qdel PBS_JOBID.
Note: You only need to use the numeric part of the Job ID.
Sample batch scripts are available in the directory
/usr/local/doc/batch_scripts for use as a template.
Scratch space for batch jobs is provided via a per-job scratch directory
that
is created at the beginning of the job. This directory is created under
/scratch/batch, and is based on the JobID. If the batch
script uses one of the sample scripts as a template, the name of this
scratch directory is
available to job scripts with the $SCR environment variable.
Your job scratch directory may be deleted soon
[possibly immediately] after your job completes, so
you should take care to transfer results to the mass storage system (see
the section Automated
Saving of Files from Batch Jobs).
The cdjob command
can be used to change the working directory to the scratch directory of a
running batch job.
The syntax is
cdjob PBS_JOBID
The saveafterjob utility is available for
automated, guaranteed saving of output files from batch jobs to the mass
storage system. It needs to be added with the SoftEnv key: +saj
For details on its use, see the saveafterjob
page and the sample PBS batch scripts.
- To avoid excessive paging, we recommend restricting
job memory
to 875MB/core or 7GB/node.
- While a job is running, you can ssh to
the
compute nodes on which your job is running. qstat -n
provides the list of hosts assigned to your job.
The first host on the list is the
launch node.