[ASP-GAU-Users] IBM LoadLeveler commands

Wed Mar 16 17:20:01 MST 2005

Hi everybody,
as a follow-up to today's meeting I would like to send you the some  
LoadLeveler options (for IBM machines only). Here is a copy of a job  
script we looked at:

#@ class          = com_rg32
#@ node           = 1
#@ tasks_per_node = 8
#@ wall_clock_limit = 00:59:00
#@ output         = out.$(jobid)
#@ error          = out.$(jobid)
#@ job_type       = parallel
#@ network.MPI    = csss,not_shared,us
#@ node_usage     = not_shared
#@ account_no     = 54042108
#@ ja_report      = yes
#@ environment  =  
AIXTHREAD_SCOPE=S;MALLOCMULTIHEAP=TRUE;MP_SHARED_MEMORY=yes; 
MEMORY_AFFINITY=MCM
#@ queue

  ...

# must be set equal to (CPUs-per-node / tasks_per_node)
setenv OMP_NUM_THREADS 4

In this particular example, we request 1 node with 32 processors (class  
com_rg32). 8 of the 32 processors get assigned to 8 (tasks_per_node)  
MPI processes. In the models this is most commonly done in a so-called  
domain decomposition approach.
The OpenMP parallelization sits on top of the MPI parallelization and  
works within the (here) 8 domains. In the example above, each MPI  
process spawns 4 OpenMP parallel threads that e.g. parallelize a loop  
in the vertical direction. Please note, that an MPI or OpenMP  
parallelization is not automatic and needs to be explicitly specified  
in the code. This is already done in NCAR's standard codes. In general,  
it's best if
   OMP_NUM_THREADS  * tasks_per_node = number of processors in a node  
(here 32)

The corresponding configuration for the com_rg8 nodes is
#@ class          = com_rg8
#@ node           =  4
#@ tasks_per_node = 2
...
setenv OMP_NUM_THREADS 4

The
#@ environment ...
option (one long string) might speed up the calculation.  
MP_SHARED_MEMORY=yes avoids unnecessary MPI messages across the  
communication network if the communication takes place within 1 node  
(which corresponds to memory copies).

If you know the approximate run time of your job the
#@ wall_clock_limit
setting could reduce the waiting time in the queue. Then the jobs  
(especially short ones) can probably be squeezed in if a group of nodes  
gets reserved (and run idle) for bigger jobs. If you don't specify the  
wallclock time, the maximum time for the queue is assumed (e.g. 6h for  
the regular queues).

Best,
Christiane