[Wrf-users] WRF restart runs sporadically hanging up

Mon Feb 3 10:53:57 MST 2014

I can concur that in the distant past, I had similar issues which were
tracked down to the MPI stack.  Seems like Open MPI was more stable at the
time.  These things typically happened during a scatter/gather type of
operation, like the reading of a restart file, or an lbc file, or even just
an output, where Task 0 is doing some heavy communication with all the
other tasks.

Voice:  +1 907 450 8679
Arctic Region Supercomputing Center
http://weather.arsc.edu/
http://people.arsc.edu/~morton/ <http://www.arsc.edu/%7Emorton/>

On Mon, Feb 3, 2014 at 10:26 AM, Dominikus Heinzeller
<climbfuji at ymail.com>wrote:

> Hi Michael and Dmitry,
>
> I encountered a similar problem with WRF/ARW 3.5 for the following
> combination, very similar to yours:
>
> pgi64-12.10
> netcdf-3.6.3 / netcdf-4.2.1.1
> mvapich2-1.9
>
> It went away when using mpich-3.0.2 instead of mvapich2. For other
> compilers (gnu, intel), this version of mvapich2 worked fine with any
> netcdf version.
>
> Dom
>
> On 31/01/2014, at 11:25 pm, Dmitry N. Mikushin <maemarcus at gmail.com>
> wrote:
>
> > Hi Michael,
> >
> > This looks like a nasty software bug, that needs a state comparison.
> > In order to get one, I'd try to modify the model code to signal
> > entering abnormal state as early as possible. In this code I'd also
> > put an infinite loop, such that engineer can get back on cluster and
> > attach the debugger to the processes of problematic run. Then -
> > keeping the problematic run on hold, I'd start another equal run in
> > another instance of debugger, let it work to the point where the first
> > run entered the fail-path. As result, you will have "bad" and "good"
> > runs frozen at the point of problem, and will be able to compare their
> > states interactively through debuggers.
> >
> > - D.
> >
> >
> > 2014-01-31 Zulauf, Michael <Michael.Zulauf at iberdrolaren.com>:
> >> Hey folks - for my work I have WRF/ARW 3.3.1 running in an operational
> mode
> >> for forecasting purposes.  We use GFS as the primary forcing data, and
> we
> >> begin our runs once we have 24 hours of GFS available (in order to get
> >> things underway quickly.)  Once that portion is complete, and assuming
> we
> >> have the remainder of our desired GFS data available, we restart the
> run and
> >> continue our forecast (after running the usual WPS pre-processing, etc).
> >>
> >>
> >>
> >> Generally, this works quite well.  Maybe once a week (ie, once every 30
> runs
> >> or so, we run 4 runs a day, 7 days a week), the job hangs up shortly
> after
> >> beginning the restart.  The job typically outputs the wrfout files for
> the
> >> first time step, and sends significant output to the rsl.error/rsl.out
> >> files.  The processes still show as running, just no further output of
> any
> >> kind appears, and eventually the queue/batch system kills it once it
> goes
> >> beyond the allocated time.  There are no error messages in any logs or
> rsl
> >> files or anywhere else I've seen.  On rerunning the job, it nearly
> always
> >> runs to completion without problem.
> >>
> >>
> >>
> >> Anyone seen this sort of thing?  Until recently, I didn't see this as a
> >> major problem, because by far the most important data was within the
> first
> >> 24 hours.  Lately our business needs are making the longer-lead
> forecasts
> >> more valuable than they were.  I suppose I can put in additional
> >> job-monitoring machinery, and attempt to "restart the restart" if it
> hangs
> >> up, but obviously I'd like to minimize the incidence in the first place.
> >>
> >>
> >>
> >> I've got WRF 3.5.1 running experimentally, but not enough yet to see if
> that
> >> helps.  We try not to change our WRF version too frequently (or other
> >> components), since this is an operational system, and even slight
> changes
> >> can change behavior, skewing statistics, etc.  But if it helped, I'd
> >> consider it.
> >>
> >>
> >>
> >> Since it always seems to happen at the same point (after restart, after
> >> first wrfouts, before additional time stepping), I doubt it's a
> hardware or
> >> systems issue.  Maybe some infrequently triggered race condition or
> similar?
> >>
> >>
> >>
> >> Other details:
> >>
> >> PGI 10.6
> >>
> >>                netcdf-4.1.3
> >>
> >>                mvapich2-1.7
> >>
> >>
> >>
> >> Thoughts?  Thanks,
> >>
> >> Mike
> >>
> >>
> >>
> >> --
> >>
> >> Mike Zulauf
> >>
> >> Meteorologist, Lead Senior
> >>
> >> Operational Meteorology
> >>
> >> Iberdrola Renewables
> >>
> >> 1125 NW Couch, Suite 700
> >>
> >> Portland, OR 97209
> >>
> >> Office: 503-478-6304  Cell: 503-913-0403
> >>
> >>
> >>
> >> This message is intended for the exclusive attention of the recipient(s)
> >> indicated.  Any information contained herein is strictly confidential
> and
> >> privileged. If you are not the intended recipient, please notify us by
> >> return e-mail and delete this message from your computer system. Any
> >> unauthorized use, reproduction, alteration, filing or sending of this
> >> message and/or any attached files may lead to legal action being taken
> >> against the party(ies) responsible for said unauthorized use. Any
> opinion
> >> expressed herein is solely that of the author(s) and does not
> necessarily
> >> represent the opinion of the Company. The sender does not guarantee the
> >> integrity, speed or safety of this message, and does not accept
> >> responsibility for any possible damage arising from the interception,
> >> incorporation of viruses, or any other damage as a result of
> manipulation.
> >>
> >>
> >> _______________________________________________
> >> Wrf-users mailing list
> >> Wrf-users at ucar.edu
> >> http://mailman.ucar.edu/mailman/listinfo/wrf-users
> >>
> > _______________________________________________
> > Wrf-users mailing list
> > Wrf-users at ucar.edu
> > http://mailman.ucar.edu/mailman/listinfo/wrf-users
>
> _______________________________________________
> Wrf-users mailing list
> Wrf-users at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/wrf-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/wrf-users/attachments/20140203/0c2e7169/attachment.html