[Wrf-users] Unpredictable crashes - MPI/RSL/Nest related? (Scott)

Creighton, Glenn A Civ USAF AFWA 16 WS/WXN Glenn.Creighton at offutt.af.mil
Thu Aug 25 13:57:28 MDT 2011


I have a similar problem with version 3.3, but not version 3.2.  It may be a related issue to that which you are experiencing.  WRF will either seg fault somewhere in a call to alloc_space_field, or collect_on_comm, debugging shows me that in these cases its dying in the MPI code ( calling libmpi.so.0 -> libopen-pal.so.0 -> mca_btl_openib.so ).  It seems to die in a different place every time.  Sometimes it will just hang while creating the first wrfout file for d02.  It dies more frequently with nested runs.  Running openmpi 1.4.2.

I can run it 5 times and it will die 4 different ways.
1. module_comm_dm.f90:812 -> c_code.c:627 -> libmpi.so.0:?? -> libopen-pal.so.0:?? -> mca_btl_openib.so:?? libmlx4-rdav2.so:??

2. module_comm_nesting_dm:11793 -> c_code.c:627 -> libmpi.so.0:?? -> libopen-pal.so.0:?? -> mca_btl_openib.so:?? libmlx4-rdav2.so:?? -> libpthread.so.0:??

3. Hung writing wrfout_d02

4. mediation_integrate.f90:234 -> wrf_ext_read_field.f90:130 -> module_io.f90:14873 -> module_io.f90:15043 -> module_io.f90:16177 ... -> ... -> libpthread.so.0:??

I'm trying to work with the folks at ncar on this right now. It's a weird bug that seems very machine/compiler dependent (I'm running this on a Linux with the ifort/icc also.  Same code works just fine on our AIX and another Linux box we have here.  Very strange bug.

-----Original Message-----
From: wrf-users-bounces at ucar.edu [mailto:wrf-users-bounces at ucar.edu] On Behalf Of wrf-users-request at ucar.edu
Sent: Thursday, August 25, 2011 1:00 PM
To: wrf-users at ucar.edu
Subject: Wrf-users Digest, Vol 84, Issue 14

Send Wrf-users mailing list submissions to
	wrf-users at ucar.edu

To subscribe or unsubscribe via the World Wide Web, visit
or, via email, send a message with subject or body 'help' to
	wrf-users-request at ucar.edu

You can reach the person managing the list at
	wrf-users-owner at ucar.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Wrf-users digest..."

Today's Topics:

   1. Postdoctoral Researcher in hydro-climate predictions and
      impact assessments (Ashfaq, Moetasim)
   2. Unpredictable crashes - MPI/RSL/Nest related? (Scott)
   3. plotting sigma lalyers with rip (Ulas IM)


Message: 1
Date: Wed, 24 Aug 2011 10:28:17 -0400
From: "Ashfaq, Moetasim" <mashfaq at ornl.gov>
Subject: [Wrf-users] Postdoctoral Researcher in hydro-climate
	predictions and impact assessments
To: "wrf-users at ucar.edu" <wrf-users at ucar.edu>
Message-ID: <695D6190-40FD-4D98-96D3-026C73FD83E1 at ornl.gov>
Content-Type: text/plain; charset=us-ascii

Postdoctoral Researcher in hydro-climate predictions and impact assessments at Oak Ridge National Lab
The Computational Earth Sciences group of the Computer Science and Mathematics Division at Oak Ridge National Laboratory seeks to hire a Post Doctoral Researcher to participate in research on understanding the roles of natural and anthropogenic forcing in near-term decadal-scale regional hydro-climatic variability over continental United States and South Asia. In addition, the research will also focus on the projection of potential impacts of decadal-scale regional hydro-climatic variability on energy, water resources and associated critical infrastructures. This research will use a suite of Earth system models and statistical techniques to downscale predictions from a multi-model ensemble of IPCC-AR5 GCMs to an ultra-high horizontal resolution of 4 km over the United States and the South Asia.
The successful candidate will be expected to 1) develop and perform experiments with regional and hydrological models on Oak Ridge Leadership Computing Facility (OLCF) 2) present the research at national and international conferences, and 3) report results in peer reviewed journals, technical manuals, and conference proceedings.
This position requires a PhD in Atmospheric and Hydrological Sciences or a related field within the past five years from an accredited college or university. Candidate is expected to have a strong understanding of North American climate and/or South Asian monsoon system. Experience in the use and application of a regional climate model and/or a hydrological model, and ability to perform advanced data analysis on large datasets is required. Excellent interpersonal skills, oral and written communications skills, organizational skills, and strong personal motivation are necessary. Ability to work effectively and contribute to a dynamic, team environment is required. Ability to assimilate new concepts and adapt to a rapidly evolving scientific and computational environment is necessary. Experience with numerical methods, parallel algorithms, MPI, FORTRAN, C, C++, and parallel software development on large scale computational resources will be an advantage.
We anticipate it to be a two years position, dependent on continuing funding. Applications will be accepted until the position is filled.
Technical Questions: For more information about this position please contact Dr. Moetasim Ashfaq (mashfaq at ornl.gov<mailto:mashfaq at ornl.gov>). Please reference this position title in your correspondence.
Interested candidates should apply online: https://www3.orau.gov/ORNL_TOppS/Posting/Details/185
Please refer to the following link for the application requirements:
This appointment is offered through the ORNL Postgraduate Research Participation Program and is administered by the Oak Ridge Institute for Science and Education (ORISE). The program is open to all qualified U.S. and non-U.S. citizens without regard to race, color, age, religion, sex, national origin, physical or mental disability, or status as a Vietnam-era veteran or disabled veteran.


Message: 2
Date: Wed, 24 Aug 2011 17:22:55 +0200
From: Scott <scott.rowe at globocean.fr>
Subject: [Wrf-users] Unpredictable crashes - MPI/RSL/Nest related?
To: wrf-users at ucar.edu
Message-ID: <4E55174F.9030306 at globocean.fr>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hello all,

I would like to know if others have come across this problem. The best I 
can do is give a general description because it is quite unpredictable. 
In point form:

- General Details -

o I am performing a simulation with one parent domain (@25km) and three 
child domains (@12.5km)
o I am able to run just the parent domain without problem on 2 CPUs with 
4 cores each, ie 8 threads using MPI for communications, in a single 
o I can run the parent domain on at least 30 odd cores without problem, 
using MPI over a network. --> no nests, no worries
o When I increase maxdom to include from one to three child domains, the 
simulations will work fine when run on a single core. --> no MPI, no worries
o As soon as I increase the number of cores, simulation success becomes 
less likely. --> nests + MPI = worries
o The strange thing is, when it performs correctly with say, two cores, 
I will increase this to three cores, WRF will crash. Upon returning to 
two cores, this simulation will no longer function, and this without 
touching any other configuration aspect! Success is highly unpredictable.
o When WRF crashes, it is most often in radiation routines, but 
sometimes in cumulus, this is also highly unpredictable.
o Successive runs always crash at the same timestep and in the same routine.
o Timestep values for the parent domain and child domains are very 
conservative, and are also shown to function well when run without MPI I 
will add
o Many combinations of physics and dynamics options have been trialled 
to no avail. I note again that the options chosen, when run without MPI, 
run fine.
o I have tried several configurations for the widths of relaxation zones 
for boundary conditions, a wider relaxation does seem to increase the 
chance of success, but this is hard to verify.
o No CFL warnings appear in the rsl log files, the crashes are brusque 
and take the form of a segmentation fault whilst treating a child 
domain, never in the parent domain.
o The only hint I have seen in output files is the TSK field becoming 
NaN over land inside the child domain. This does not occur 100% of the 
time however.

It would thus appear to be a MPI or compiler issue rather than WRF. This 
said, it is only the combination of nests AND MPI that causes problems, 
not one or the other alone. Could it be RSL?

Does anyone have any debugging ideas, even just general approaches to 
try and find the culprit?
Any MPI parameters that could be ajusted?

- Technical Details -

o Using OpenMPI 1.4.3
o Aiming for WRFV3.3 use but have tried v3.2.1 also
o EM/ARW core
o Compiler is ifort and icc v10.1
o Have tried compiling with -O0, -O2 and -O3 with thourough cleaning 
each time
o GFS boundary conditions, WPSV3.3. No obvious problems to report here. 
geo*.nc and met_em* appear fine.

Thank you for any help you may be able to give.


Message: 3
Date: Wed, 24 Aug 2011 13:56:57 +0300
From: Ulas IM <ulasim at chemistry.uoc.gr>
Subject: [Wrf-users] plotting sigma lalyers with rip
To: wrf-users at ucar.edu
	<CAE7P9+=fJGQjGiWGT8K-w0eR=3TmiGLDxVPuz9FFdiwUsEw65g at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

Dear users

I am trying to plot the terrain following sigma layers on a cross
section from a wrf output. Is there a default way to accomplish this?

thank you

Ulas IM, PhD
University of Crete
Department of Chemistry
Environmental Chemical Processes Laboratory (ECPL)
Voutes, Heraklion
Crete, Greece
E - mail: ulasim at chemistry.uoc.gr
Web: http://ulas-im.tr.gg
Phone: (+30) 2810 545162
Fax: (+30) 2810 545001


Wrf-users mailing list
Wrf-users at ucar.edu

End of Wrf-users Digest, Vol 84, Issue 14

More information about the Wrf-users mailing list