I was having these sorts of problems with WRF 3.1.1 a few weeks ago on our Sun Opteron cluster. It was always hanging on the writing of wrfout, typically on an inner nest, and it wasn't consistent from run to run. I had the luxury of being able to try these cases on other machines, and didn't experience problems on those.<div>
<br></div><div>Our folks here suggested I turn off the MPI RDMA (Remote Direct Memory Access) optimizations, which slowed performance substantially, but resolved the issue.</div><div><br></div><div>It's been my experience over the years with WRF, that frequently these problems are resolved if you turn off optimizations.</div>
<div><br></div><div>If you're using a Sun cluster, I can give you a little more info privately.</div><div><br><div class="gmail_quote">On Thu, Apr 15, 2010 at 1:49 PM, Zulauf, Michael <span dir="ltr"><<a href="mailto:Michael.Zulauf@iberdrolausa.com">Michael.Zulauf@iberdrolausa.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Hi all,<br>
<br>
I'm trying to get WRF V3.2 running by utilizing a setup that I've<br>
successfully run with V3.1.1 (and earlier). The configure/compile<br>
seemed to go fine using the same basic configuration details that have<br>
worked in the past. When I look over the Updates in V3.2, I don't see<br>
anything problematic for me.<br>
<br>
We're running with four grids, nesting from 27km to 1km, initialized and<br>
forced with GFS output. The nest initializations are delayed from the<br>
outer grid initialization by 3, 6, and 9 hours, respecitively. The 1km<br>
grid has wrfout (netcdf) output every 20 minutes, the other grids every<br>
hour.<br>
<br>
What I'm seeing is that the job appears to be running fine for some<br>
time, but eventually the job hangs up during wrfout output - usually on<br>
the finest grid - but not exclusively. Changing small details (such as<br>
changing restart_interval) can make it run longer or shorter. Sometimes<br>
even with no changes it will run a different length of time.<br>
<br>
I've got debug_level set to 300, so I get tons of output. When it<br>
hangs, the wrf process don't die, but all output stops. There are no<br>
error messages or anything else that indicate a problem (at least none<br>
that I can find). What I do get is a truncated (always 32 byte) wrfout<br>
file. For example:<br>
<br>
-rw-r--r-- 1 p20457 staff 32 Apr 15 13:02<br>
wrfout_d04_2009-12-14_09:00:00<br>
<br>
The wrfout's that get written before it hangs appear to be fine, with<br>
valid data. frames_per_outfile is set to 1, so the files never get<br>
excessively large - maybe on the order of 175MB. All of the previous<br>
versions of WRF that I've used continue work fine on this hardware/OS<br>
combination (a cluster of dual-dual core Opterons, running CentOS) -<br>
just V3.2 has issues.<br>
<br>
Like I said, the wrf processes don't die, but all output ceases, even<br>
with the massive amount of debug info. The last lines in the rsl.error<br>
and rsl.out files is always something of this type:<br>
<br>
date 2009-12-14_09:00:00<br>
ds 1 1 1<br>
de 1 1 1<br>
ps 1 1 1<br>
pe 1 1 1<br>
ms 1 1 1<br>
me 1 1 1<br>
output_wrf.b writing 0d real<br>
<br>
The specific times and and variables being written vary, depending on<br>
when the job hangs.<br>
<br>
I haven't dug deeply into what's going on, but it seems like possibly<br>
some sort of race condition or communications deadlock or something.<br>
Does anybody have ideas of where I should go from here? It seems to me<br>
like maybe something basic has changed with V3.2, and perhaps I need to<br>
adjust something in my configuration or setup.<br>
<br>
Thanks,<br>
Mike<br>
<br>
--<br>
Mike Zulauf<br>
Meteorologist<br>
Wind Asset Management<br>
Iberdrola Renewables<br>
1125 NW Couch, Suite 700<br>
Portland, OR 97209<br>
Office: 503-478-6304 Cell: 503-913-0403<br>
<br>
<br>
<br>
<br>
<br>
This message is intended for the exclusive attention of the address(es) indicated. Any information contained herein is strictly confidential and privileged, especially as regards person data,<br>
which must not be disclosed. If you are the intended recipient and have received it by mistake or learn about it in any other way, please notify us by return e-mail and delete this message from<br>
your computer system. Any unauthorized use, reproduction, alteration, filing or sending of this message and/or any attached files to third parties may lead to legal proceedings being taken. Any<br>
opinion expressed herein is solely that of the author(s) and does not necessarily represent the opinion of Iberdrola. The sender does not guarantee the integrity, speed or safety of this<br>
message, not accept responsibility for any possible damage arising from the interception, incorporation of virus or any other manipulation carried out by third parties.<br>
<br>
_______________________________________________<br>
Wrf-users mailing list<br>
<a href="mailto:Wrf-users@ucar.edu">Wrf-users@ucar.edu</a><br>
<a href="http://mailman.ucar.edu/mailman/listinfo/wrf-users" target="_blank">http://mailman.ucar.edu/mailman/listinfo/wrf-users</a><br>
</blockquote></div><br><br clear="all"><br>-- <br>Arctic Region Supercomputing Center<br><a href="http://www.arsc.edu/~morton/">http://www.arsc.edu/~morton/</a><br>
</div>