[Wrf-users] WRF 3.2 jobs hanging up sporadically on wrfout output
Michael.Zulauf at iberdrolausa.com
Fri Apr 16 11:11:22 MDT 2010
Thanks for the response, Don.
The specific RDMA suggestion isn't relevant to our case (our hardware
doesn't support it), but you may be right that this is an optimizations
related issue. I'll probably try playing with optimizations next. I've
got the same settings as has worked for previous versions - but perhaps
something in the new code has made one of the settings problematic.
Regarding the suggestions I've been getting relating to
WRFIO_NCD_LARGE_FILE_SUPPORT - I don't think that's the problem. I'm
splitting my output into single frame files to keep the file size small.
I may try that also, just for the heck of it.
Based on the sporadic nature of this (sometimes it happens, sometimes it
doesn't, when it hangs seems fairly random), I suspect it's some type of
timing issue like a race condition. If I can't get it working, I may
just drop back to 3.1.1, at least until 3.2.1 comes out. ;-)
From: Don Morton [mailto:Don.Morton at alaska.edu]
Sent: Friday, April 16, 2010 9:09 AM
To: Zulauf, Michael
Cc: wrf-users at ucar.edu
Subject: Re: [Wrf-users] WRF 3.2 jobs hanging up sporadically on wrfout
I was having these sorts of problems with WRF 3.1.1 a few weeks ago on
our Sun Opteron cluster. It was always hanging on the writing of
wrfout, typically on an inner nest, and it wasn't consistent from run to
run. I had the luxury of being able to try these cases on other
machines, and didn't experience problems on those.
Our folks here suggested I turn off the MPI RDMA (Remote Direct Memory
Access) optimizations, which slowed performance substantially, but
resolved the issue.
It's been my experience over the years with WRF, that frequently these
problems are resolved if you turn off optimizations.
If you're using a Sun cluster, I can give you a little more info
On Thu, Apr 15, 2010 at 1:49 PM, Zulauf, Michael
<Michael.Zulauf at iberdrolausa.com> wrote:
I'm trying to get WRF V3.2 running by utilizing a setup that I've
successfully run with V3.1.1 (and earlier). The configure/compile
seemed to go fine using the same basic configuration details that have
worked in the past. When I look over the Updates in V3.2, I don't see
anything problematic for me.
We're running with four grids, nesting from 27km to 1km, initialized and
forced with GFS output. The nest initializations are delayed from the
outer grid initialization by 3, 6, and 9 hours, respecitively. The 1km
grid has wrfout (netcdf) output every 20 minutes, the other grids every
What I'm seeing is that the job appears to be running fine for some
time, but eventually the job hangs up during wrfout output - usually on
the finest grid - but not exclusively. Changing small details (such as
changing restart_interval) can make it run longer or shorter. Sometimes
even with no changes it will run a different length of time.
I've got debug_level set to 300, so I get tons of output. When it
hangs, the wrf process don't die, but all output stops. There are no
error messages or anything else that indicate a problem (at least none
that I can find). What I do get is a truncated (always 32 byte) wrfout
file. For example:
-rw-r--r-- 1 p20457 staff 32 Apr 15 13:02
The wrfout's that get written before it hangs appear to be fine, with
valid data. frames_per_outfile is set to 1, so the files never get
excessively large - maybe on the order of 175MB. All of the previous
versions of WRF that I've used continue work fine on this hardware/OS
combination (a cluster of dual-dual core Opterons, running CentOS) -
just V3.2 has issues.
Like I said, the wrf processes don't die, but all output ceases, even
with the massive amount of debug info. The last lines in the rsl.error
and rsl.out files is always something of this type:
ds 1 1 1
de 1 1 1
ps 1 1 1
pe 1 1 1
ms 1 1 1
me 1 1 1
output_wrf.b writing 0d real
The specific times and and variables being written vary, depending on
when the job hangs.
I haven't dug deeply into what's going on, but it seems like possibly
some sort of race condition or communications deadlock or something.
Does anybody have ideas of where I should go from here? It seems to me
like maybe something basic has changed with V3.2, and perhaps I need to
adjust something in my configuration or setup.
Wind Asset Management
1125 NW Couch, Suite 700
Portland, OR 97209
Office: 503-478-6304 Cell: 503-913-0403
This message is intended for the exclusive attention of the address(es)
indicated. Any information contained herein is strictly confidential
and privileged, especially as regards person data,
which must not be disclosed. If you are the intended recipient and have
received it by mistake or learn about it in any other way, please notify
us by return e-mail and delete this message from
your computer system. Any unauthorized use, reproduction, alteration,
filing or sending of this message and/or any attached files to third
parties may lead to legal proceedings being taken. Any
opinion expressed herein is solely that of the author(s) and does not
necessarily represent the opinion of Iberdrola. The sender does not
guarantee the integrity, speed or safety of this
message, not accept responsibility for any possible damage arising from
the interception, incorporation of virus or any other manipulation
carried out by third parties.
Wrf-users mailing list
Wrf-users at ucar.edu
Arctic Region Supercomputing Center
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Wrf-users