[Wrf-users] Unpredictable crashes - MPI/RSL/Nest related?

Scott scott.rowe at globocean.fr
Wed Aug 24 09:22:55 MDT 2011


Hello all,

I would like to know if others have come across this problem. The best I 
can do is give a general description because it is quite unpredictable. 
In point form:

- General Details -

o I am performing a simulation with one parent domain (@25km) and three 
child domains (@12.5km)
o I am able to run just the parent domain without problem on 2 CPUs with 
4 cores each, ie 8 threads using MPI for communications, in a single 
computer.
o I can run the parent domain on at least 30 odd cores without problem, 
using MPI over a network. --> no nests, no worries
o When I increase maxdom to include from one to three child domains, the 
simulations will work fine when run on a single core. --> no MPI, no worries
o As soon as I increase the number of cores, simulation success becomes 
less likely. --> nests + MPI = worries
o The strange thing is, when it performs correctly with say, two cores, 
I will increase this to three cores, WRF will crash. Upon returning to 
two cores, this simulation will no longer function, and this without 
touching any other configuration aspect! Success is highly unpredictable.
o When WRF crashes, it is most often in radiation routines, but 
sometimes in cumulus, this is also highly unpredictable.
o Successive runs always crash at the same timestep and in the same routine.
o Timestep values for the parent domain and child domains are very 
conservative, and are also shown to function well when run without MPI I 
will add
o Many combinations of physics and dynamics options have been trialled 
to no avail. I note again that the options chosen, when run without MPI, 
run fine.
o I have tried several configurations for the widths of relaxation zones 
for boundary conditions, a wider relaxation does seem to increase the 
chance of success, but this is hard to verify.
o No CFL warnings appear in the rsl log files, the crashes are brusque 
and take the form of a segmentation fault whilst treating a child 
domain, never in the parent domain.
o The only hint I have seen in output files is the TSK field becoming 
NaN over land inside the child domain. This does not occur 100% of the 
time however.

It would thus appear to be a MPI or compiler issue rather than WRF. This 
said, it is only the combination of nests AND MPI that causes problems, 
not one or the other alone. Could it be RSL?

Does anyone have any debugging ideas, even just general approaches to 
try and find the culprit?
Any MPI parameters that could be ajusted?


- Technical Details -

o Using OpenMPI 1.4.3
o Aiming for WRFV3.3 use but have tried v3.2.1 also
o EM/ARW core
o Compiler is ifort and icc v10.1
o Have tried compiling with -O0, -O2 and -O3 with thourough cleaning 
each time
o GFS boundary conditions, WPSV3.3. No obvious problems to report here. 
geo*.nc and met_em* appear fine.

Thank you for any help you may be able to give.



More information about the Wrf-users mailing list