[Wrf-users] Unpredictable crashes - MPI/RSL/Nest related?
Scott
scott.rowe at globocean.fr
Wed Aug 24 09:22:55 MDT 2011
Hello all,
I would like to know if others have come across this problem. The best I
can do is give a general description because it is quite unpredictable.
In point form:
- General Details -
o I am performing a simulation with one parent domain (@25km) and three
child domains (@12.5km)
o I am able to run just the parent domain without problem on 2 CPUs with
4 cores each, ie 8 threads using MPI for communications, in a single
computer.
o I can run the parent domain on at least 30 odd cores without problem,
using MPI over a network. --> no nests, no worries
o When I increase maxdom to include from one to three child domains, the
simulations will work fine when run on a single core. --> no MPI, no worries
o As soon as I increase the number of cores, simulation success becomes
less likely. --> nests + MPI = worries
o The strange thing is, when it performs correctly with say, two cores,
I will increase this to three cores, WRF will crash. Upon returning to
two cores, this simulation will no longer function, and this without
touching any other configuration aspect! Success is highly unpredictable.
o When WRF crashes, it is most often in radiation routines, but
sometimes in cumulus, this is also highly unpredictable.
o Successive runs always crash at the same timestep and in the same routine.
o Timestep values for the parent domain and child domains are very
conservative, and are also shown to function well when run without MPI I
will add
o Many combinations of physics and dynamics options have been trialled
to no avail. I note again that the options chosen, when run without MPI,
run fine.
o I have tried several configurations for the widths of relaxation zones
for boundary conditions, a wider relaxation does seem to increase the
chance of success, but this is hard to verify.
o No CFL warnings appear in the rsl log files, the crashes are brusque
and take the form of a segmentation fault whilst treating a child
domain, never in the parent domain.
o The only hint I have seen in output files is the TSK field becoming
NaN over land inside the child domain. This does not occur 100% of the
time however.
It would thus appear to be a MPI or compiler issue rather than WRF. This
said, it is only the combination of nests AND MPI that causes problems,
not one or the other alone. Could it be RSL?
Does anyone have any debugging ideas, even just general approaches to
try and find the culprit?
Any MPI parameters that could be ajusted?
- Technical Details -
o Using OpenMPI 1.4.3
o Aiming for WRFV3.3 use but have tried v3.2.1 also
o EM/ARW core
o Compiler is ifort and icc v10.1
o Have tried compiling with -O0, -O2 and -O3 with thourough cleaning
each time
o GFS boundary conditions, WPSV3.3. No obvious problems to report here.
geo*.nc and met_em* appear fine.
Thank you for any help you may be able to give.
More information about the Wrf-users
mailing list