[Wrf-users] Unpredictable crashes - MPI/RSL/Nest related?(Scott)
Bart Brashers
bbrashers at Environcorp.com
Fri Aug 26 12:18:24 MDT 2011
I have been running _almost_ the same setup as you, just using
- GFS (and NAM 12km) GRIB files for inits
- WRFv3.3
- Openmpi-1.4.3
- PGI 10.6-0
- CentOS 5.x (2.6.18-53.1.14 kernel) on a Rocks 5.0 system
- gE interconnect.
I've not seen any similar problems. FWIW, here's how I compiled stuff:
# grep -A11 DMPARALLEL /usr/local/src/wrf/WRFV3.3-openmpi/configure.wrf
DMPARALLEL = 1
OMPCPP = # -D_OPENMP
OMP = # -mp -Minfo=mp -Mrecursive
OMPCC = # -mp
SFC = pgf90
SCC = gcc
CCOMP = pgcc
DM_FC = /usr/local/src/openmpi-1.4.3/bin/mpif90
DM_CC = /usr/local/src/openmpi-1.4.3/bin/mpicc
-DMPI2_SUPPORT
FC = $(DM_FC)
CC = $(DM_CC) -DFSEEKO64_OK
LD = $(FC)
# cat /usr/local/src/openmpi-1.4.3/my.configure
#!/bin/tcsh -f
setenv CC pgcc
setenv CFLAGS ''
setenv CXX pgCC
setenv CXXFLAGS ''
setenv FC pgf90
setenv FCFLAGS '-fast'
setenv FFLAGS '-O2'
setenv F90 pgf90
./configure --prefix=/usr/local/src/openmpi-1.4.3 --with-tm=/opt/torque
--disable-ipv6 >&! my.configure.out
make all >&! make.out
make install >&! make.install.out
Maybe there's something there that is different, that will help.
Bart
> -----Original Message-----
> From: wrf-users-bounces at ucar.edu [mailto:wrf-users-bounces at ucar.edu]
On
> Behalf Of Creighton, Glenn A Civ USAF AFWA 16 WS/WXN
> Sent: Thursday, August 25, 2011 12:57 PM
> To: wrf-users at ucar.edu
> Subject: Re: [Wrf-users] Unpredictable crashes - MPI/RSL/Nest
> related?(Scott)
>
> Scott,
>
> I have a similar problem with version 3.3, but not version 3.2. It
may be a
> related issue to that which you are experiencing. WRF will either seg
fault
> somewhere in a call to alloc_space_field, or collect_on_comm,
debugging
> shows me that in these cases its dying in the MPI code ( calling
libmpi.so.0
> -> libopen-pal.so.0 -> mca_btl_openib.so ). It seems to die in a
different
> place every time. Sometimes it will just hang while creating the
first
> wrfout file for d02. It dies more frequently with nested runs.
Running
> openmpi 1.4.2.
>
> I can run it 5 times and it will die 4 different ways.
> 1. module_comm_dm.f90:812 -> c_code.c:627 -> libmpi.so.0:?? ->
libopen-
> pal.so.0:?? -> mca_btl_openib.so:?? libmlx4-rdav2.so:??
>
> 2. module_comm_nesting_dm:11793 -> c_code.c:627 -> libmpi.so.0:?? ->
> libopen-pal.so.0:?? -> mca_btl_openib.so:?? libmlx4-rdav2.so:?? ->
> libpthread.so.0:??
>
> 3. Hung writing wrfout_d02
>
> 4. mediation_integrate.f90:234 -> wrf_ext_read_field.f90:130 ->
> module_io.f90:14873 -> module_io.f90:15043 -> module_io.f90:16177 ...
-> ...
> -> libpthread.so.0:??
>
> I'm trying to work with the folks at ncar on this right now. It's a
weird
> bug that seems very machine/compiler dependent (I'm running this on a
Linux
> with the ifort/icc also. Same code works just fine on our AIX and
another
> Linux box we have here. Very strange bug.
> Glenn
>
>
> -----Original Message-----
> Date: Wed, 24 Aug 2011 17:22:55 +0200
> From: Scott <scott.rowe at globocean.fr>
> Subject: [Wrf-users] Unpredictable crashes - MPI/RSL/Nest related?
> To: wrf-users at ucar.edu
> Message-ID: <4E55174F.9030306 at globocean.fr>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Hello all,
>
> I would like to know if others have come across this problem. The best
I
> can do is give a general description because it is quite
unpredictable.
> In point form:
>
> - General Details -
>
> o I am performing a simulation with one parent domain (@25km) and
three
> child domains (@12.5km)
> o I am able to run just the parent domain without problem on 2 CPUs
with
> 4 cores each, ie 8 threads using MPI for communications, in a single
> computer.
> o I can run the parent domain on at least 30 odd cores without
problem,
> using MPI over a network. --> no nests, no worries
> o When I increase maxdom to include from one to three child domains,
the
> simulations will work fine when run on a single core. --> no MPI, no
worries
> o As soon as I increase the number of cores, simulation success
becomes
> less likely. --> nests + MPI = worries
> o The strange thing is, when it performs correctly with say, two
cores,
> I will increase this to three cores, WRF will crash. Upon returning to
> two cores, this simulation will no longer function, and this without
> touching any other configuration aspect! Success is highly
unpredictable.
> o When WRF crashes, it is most often in radiation routines, but
> sometimes in cumulus, this is also highly unpredictable.
> o Successive runs always crash at the same timestep and in the same
routine.
> o Timestep values for the parent domain and child domains are very
> conservative, and are also shown to function well when run without MPI
I
> will add
> o Many combinations of physics and dynamics options have been trialled
> to no avail. I note again that the options chosen, when run without
MPI,
> run fine.
> o I have tried several configurations for the widths of relaxation
zones
> for boundary conditions, a wider relaxation does seem to increase the
> chance of success, but this is hard to verify.
> o No CFL warnings appear in the rsl log files, the crashes are brusque
> and take the form of a segmentation fault whilst treating a child
> domain, never in the parent domain.
> o The only hint I have seen in output files is the TSK field becoming
> NaN over land inside the child domain. This does not occur 100% of the
> time however.
>
> It would thus appear to be a MPI or compiler issue rather than WRF.
This
> said, it is only the combination of nests AND MPI that causes
problems,
> not one or the other alone. Could it be RSL?
>
> Does anyone have any debugging ideas, even just general approaches to
> try and find the culprit?
> Any MPI parameters that could be ajusted?
>
>
> - Technical Details -
>
> o Using OpenMPI 1.4.3
> o Aiming for WRFV3.3 use but have tried v3.2.1 also
> o EM/ARW core
> o Compiler is ifort and icc v10.1
> o Have tried compiling with -O0, -O2 and -O3 with thourough cleaning
> each time
> o GFS boundary conditions, WPSV3.3. No obvious problems to report
here.
> geo*.nc and met_em* appear fine.
>
> Thank you for any help you may be able to give.
This message contains information that may be confidential, privileged or otherwise protected by law from disclosure. It is intended for the exclusive use of the Addressee(s). Unless you are the addressee or authorized agent of the addressee, you may not review, copy, distribute or disclose to anyone the message or any information contained within. If you have received this message in error, please contact the sender by electronic reply to email at environcorp.com and immediately delete all copies of the message.
More information about the Wrf-users
mailing list