Running Jobs
The mpirun command is used to execute a job on the Blue Gene. Because
we do not allow interactive use of the mpirun command on the login
nodes (except by special arrangement), one must submit it as a
job to the LoadLeveler batch system. There are two ways to submit
batch jobs. The standard method is for the user to set up a job
command file (jcf), in which the mpirun
command along with the executable name and other parameters are
specified. This jcf is then submitted to the batch system.
Here is a sample jcf. The alternative
method is to run a script that accepts input parameters such as the
executable file name from the user on the command line. A jcf
is then generated based on the command line input parameters
and submitted to the batch queue.
Mpirun
To run a Blue Gene executable via the mpirun command, you should
compile an executable and place it on a file system that can be
accessed by the Blue Gene (/project, /project2, /projectnb1,
/projectnb2, or any of /usr1-/usr4). You should then modify the sample
jcf, particularly by changing the "@ arguments" line to reference
your executable and set appropriate command line arguments for mpirun.
The main command line arguments to mpirun are:
| Argument Usage |
Mandatory? |
Description |
| -np N |
Mandatory |
N = number of MPI tasks (See section below) |
| -cwd start_dir |
Mandatory |
start_dir = full pathname of directory where
program runs |
| -exe path_to_executable |
Mandatory |
Full pathname of your executable. |
| -verbose 0-4 |
Optional |
controls diagnostic output, default is 0, 1 is recommended |
| -args "list_of_args" |
Optional |
"list of args" = list of args to executable (enclosed in quotes) |
| -mode CO|VN |
Optional |
CO|VN = COprocessor(default) or Virtual Node mode (See section below) |
| -connect MESH|TORUS |
Optional |
MESH|TORUS = defaults to MESH, N must be
multiple of 512 to use TORUS |
MPI parallel tasks and tasks per node
In the above table, N is defined as the number of MPI parallel tasks instead of the traditional definition of number of processors to
more appropriately represent the Blue Gene nodes' hardware configuration.
A Blue Gene node consists of 2 processors. In the default
COprocessor mode (CO),
one processor is used for computation and the other is dedicated to communication. This results in
1 MPI task per node and the node's entire 512 MB of memory is available to the
task. In the Virtual Node mode (VN), both processors are used for computations. In this case, there
are 2 tasks per node and both tasks share the node's 512 MB of memory. Our
Blue Gene has a total of 1024 nodes. In the CO mode (1 task per node), the
maximum number of tasks you can request is
1024. In the VN mode (2 tasks per node), the maximum number of tasks is 2048.
Important note on the number of MPI tasks, N. Although you can choose N to be any value
up to 1024 (CO) or 2048 (VN), the system will only allocate 32, 128, 512, or
1024 physical nodes to a job. The system allocates the smallest number
of allowed physical nodes necessary to run one task per node (or two
per node in VN mode.)
The following examples demonstrate some typical arguments to mpirun that are
specified in the jcf as well as how the CO|VN mode affects the number
of nodes allocated.
Example 1.
mpirun -np 1000 -cwd /project/xyz/abc -exe /project/xyz/exe/a.out
In this case, the system allocates 1024 nodes for the job because the job runs under the
default CO mode and 1024 is the smallest allowable number of nodes (among 32, 128, 512, 1024) necessary to accomodate the requested 1000 tasks.
Example 2.
mpirun -np 1000 -cwd /project/xyz/abc -exe /project/xyz/exe/a.out -mode VN
In this case, the system allocates 512 nodes for the job because the job runs under the VN mode
and 512 is the smallest allowable number of nodes (among 32, 128, 512, 1024) necessary to
accomodate the requested 1000 tasks.
For additional mpirun options, enter mpirun -h at the system prompt.
For complete documentation, please refer to Chapter 3 of IBM System Blue Gene Solution: System Administration
(html |pdf)
Batch scheduling policy
The scheduler implements the following usage limits:
| Limit |
During Business Hours* |
During Off Hours |
| Maximum Runtime per Job |
5 hours |
5 hours |
| Maximum Nodes used per User |
512 |
1024 |
(*Business Hours: 9am - 5pm Eastern Time, Monday - Friday.)
After enforcing the above limits, the scheduler prioritizes the
runnable jobs and runs the highest priority one if the necessary
resources are available. If the necessary resources are not yet
available the scheduler uses a backfilling strategy to run lower
priority, short duration jobs on any available resources as long as
doing so will not delay the starting of the highest priority job.
The primary ordering criterion used to prioritize jobs is the amount
of recent runtime accumulated by the user. This quantity is displayed
by the qstat command under the SYSPRI column as a negative value. Jobs
with the same SYSPRI are ordered by submission time. Finally, a user
can alter the relative ordering of their own jobs with the llprio
command. This command modifies the "user priority" which is displayed
in the PRI column of the llq and qstat commands. The qstat command
lists the waiting jobs in the above described scheduling order.
Alternative method to submit a batch job -- without a jcf
This method essentially involves a wrapper script bglsub which accepts
command line input such as the number of tasks and executable name from the user to
generate a jcf as required in the standard method. The last operation of
bglsub is to submit this newly generated jcf to the batch queue.
(More ...)
Interactive mpirun usage
Sometimes it is convenient (e.g. during program development or
debugging) to execute the mpirun directly on the login node rather
than through the batch system. This is normally not permitted but it
can be arranged by sending a request to help@twister.bu.edu. We will allocate a partition of the machine for your exclusive use and
you will be able to use it by invoking mpirun in the following way:
levi% mpirun -noallocate -partition YOUR_PARTITION ...
where YOUR_PARTITION is the name of the partition assigned to you and
"..." represents all the other flags you would normally pass to mpirun.
Batch Job Management Commands
|