[Wrf-users] WRF 3.2 jobs hanging up sporadically on wrfout output

Don Morton Don.Morton at alaska.edu
Fri Apr 16 10:08:46 MDT 2010


I was having these sorts of problems with WRF 3.1.1 a few weeks ago on our
Sun Opteron cluster.  It was always hanging on the writing of wrfout,
typically on an inner nest, and it wasn't consistent from run to run.  I had
the luxury of being able to try these cases on other machines, and didn't
experience problems on those.

Our folks here suggested I turn off the MPI RDMA (Remote Direct Memory
Access) optimizations, which slowed performance substantially, but resolved
the issue.

It's been my experience over the years with WRF, that frequently these
problems are resolved if you turn off optimizations.

If you're using a Sun cluster, I can give you a little more info privately.

On Thu, Apr 15, 2010 at 1:49 PM, Zulauf, Michael <
Michael.Zulauf at iberdrolausa.com> wrote:

> Hi all,
>
> I'm trying to get WRF V3.2 running by utilizing a setup that I've
> successfully run with V3.1.1 (and earlier).  The configure/compile
> seemed to go fine using the same basic configuration details that have
> worked in the past.  When I look over the Updates in V3.2, I don't see
> anything problematic for me.
>
> We're running with four grids, nesting from 27km to 1km, initialized and
> forced with GFS output.  The nest initializations are delayed from the
> outer grid initialization by 3, 6, and 9 hours, respecitively.  The 1km
> grid has wrfout (netcdf) output every 20 minutes, the other grids every
> hour.
>
> What I'm seeing is that the job appears to be running fine for some
> time, but eventually the job hangs up during wrfout output - usually on
> the finest grid - but not exclusively.  Changing small details (such as
> changing restart_interval) can make it run longer or shorter.  Sometimes
> even with no changes it will run a different length of time.
>
> I've got debug_level set to 300, so I get tons of output.  When it
> hangs, the wrf process don't die, but all output stops.  There are no
> error messages or anything else that indicate a problem (at least none
> that I can find).  What I do get is a truncated (always 32 byte) wrfout
> file.  For example:
>
> -rw-r--r--  1 p20457 staff 32 Apr 15 13:02
> wrfout_d04_2009-12-14_09:00:00
>
> The wrfout's that get written before it hangs appear to be fine, with
> valid data.  frames_per_outfile is set to 1, so the files never get
> excessively large - maybe on the order of 175MB.  All of the previous
> versions of WRF that I've used continue work fine on this hardware/OS
> combination (a cluster of dual-dual core Opterons, running CentOS) -
> just V3.2 has issues.
>
> Like I said, the wrf processes don't die, but all output ceases, even
> with the massive amount of debug info.  The last lines in the rsl.error
> and rsl.out files is always something of this type:
>
>  date 2009-12-14_09:00:00
>  ds             1            1            1
>  de             1            1            1
>  ps             1            1            1
>  pe             1            1            1
>  ms             1            1            1
>  me             1            1            1
>  output_wrf.b writing 0d real
>
> The specific times and and variables being written vary, depending on
> when the job hangs.
>
> I haven't dug deeply into what's going on, but it seems like possibly
> some sort of race condition or communications deadlock or something.
> Does anybody have ideas of where I should go from here?  It seems to me
> like maybe something basic has changed with V3.2, and perhaps I need to
> adjust something in my configuration or setup.
>
> Thanks,
> Mike
>
> --
> Mike Zulauf
> Meteorologist
> Wind Asset Management
> Iberdrola Renewables
> 1125 NW Couch, Suite 700
> Portland, OR 97209
> Office: 503-478-6304  Cell: 503-913-0403
>
>
>
>
>
> This message is intended for the exclusive attention of the address(es)
> indicated.  Any information contained herein is strictly confidential and
> privileged, especially as regards person data,
> which must not be disclosed.  If you are the intended recipient and have
> received it by mistake or learn about it in any other way, please notify us
> by return e-mail and delete this message from
>  your computer system. Any unauthorized use, reproduction, alteration,
> filing or sending of this message and/or any attached files to third parties
> may lead to legal proceedings being taken. Any
> opinion expressed herein is solely that of the author(s) and does not
> necessarily represent the opinion of Iberdrola. The sender does not
> guarantee the integrity, speed or safety of this
> message, not accept responsibility for any possible damage arising from the
> interception, incorporation of virus or any other manipulation carried out
> by third parties.
>
> _______________________________________________
> Wrf-users mailing list
> Wrf-users at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/wrf-users
>



-- 
Arctic Region Supercomputing Center
http://www.arsc.edu/~morton/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/wrf-users/attachments/20100416/e6832d05/attachment-0001.html 


More information about the Wrf-users mailing list