[Wrf-users] WRF restart runs sporadically hanging up

Dominikus Heinzeller climbfuji at ymail.com
Mon Feb 3 03:26:29 MST 2014


Hi Michael and Dmitry,

I encountered a similar problem with WRF/ARW 3.5 for the following combination, very similar to yours:

pgi64-12.10
netcdf-3.6.3 / netcdf-4.2.1.1
mvapich2-1.9

It went away when using mpich-3.0.2 instead of mvapich2. For other compilers (gnu, intel), this version of mvapich2 worked fine with any netcdf version.

Dom

On 31/01/2014, at 11:25 pm, Dmitry N. Mikushin <maemarcus at gmail.com> wrote:

> Hi Michael,
> 
> This looks like a nasty software bug, that needs a state comparison.
> In order to get one, I'd try to modify the model code to signal
> entering abnormal state as early as possible. In this code I'd also
> put an infinite loop, such that engineer can get back on cluster and
> attach the debugger to the processes of problematic run. Then -
> keeping the problematic run on hold, I'd start another equal run in
> another instance of debugger, let it work to the point where the first
> run entered the fail-path. As result, you will have "bad" and "good"
> runs frozen at the point of problem, and will be able to compare their
> states interactively through debuggers.
> 
> - D.
> 
> 
> 2014-01-31 Zulauf, Michael <Michael.Zulauf at iberdrolaren.com>:
>> Hey folks - for my work I have WRF/ARW 3.3.1 running in an operational mode
>> for forecasting purposes.  We use GFS as the primary forcing data, and we
>> begin our runs once we have 24 hours of GFS available (in order to get
>> things underway quickly.)  Once that portion is complete, and assuming we
>> have the remainder of our desired GFS data available, we restart the run and
>> continue our forecast (after running the usual WPS pre-processing, etc).
>> 
>> 
>> 
>> Generally, this works quite well.  Maybe once a week (ie, once every 30 runs
>> or so, we run 4 runs a day, 7 days a week), the job hangs up shortly after
>> beginning the restart.  The job typically outputs the wrfout files for the
>> first time step, and sends significant output to the rsl.error/rsl.out
>> files.  The processes still show as running, just no further output of any
>> kind appears, and eventually the queue/batch system kills it once it goes
>> beyond the allocated time.  There are no error messages in any logs or rsl
>> files or anywhere else I've seen.  On rerunning the job, it nearly always
>> runs to completion without problem.
>> 
>> 
>> 
>> Anyone seen this sort of thing?  Until recently, I didn't see this as a
>> major problem, because by far the most important data was within the first
>> 24 hours.  Lately our business needs are making the longer-lead forecasts
>> more valuable than they were.  I suppose I can put in additional
>> job-monitoring machinery, and attempt to "restart the restart" if it hangs
>> up, but obviously I'd like to minimize the incidence in the first place.
>> 
>> 
>> 
>> I've got WRF 3.5.1 running experimentally, but not enough yet to see if that
>> helps.  We try not to change our WRF version too frequently (or other
>> components), since this is an operational system, and even slight changes
>> can change behavior, skewing statistics, etc.  But if it helped, I'd
>> consider it.
>> 
>> 
>> 
>> Since it always seems to happen at the same point (after restart, after
>> first wrfouts, before additional time stepping), I doubt it's a hardware or
>> systems issue.  Maybe some infrequently triggered race condition or similar?
>> 
>> 
>> 
>> Other details:
>> 
>> PGI 10.6
>> 
>>                netcdf-4.1.3
>> 
>>                mvapich2-1.7
>> 
>> 
>> 
>> Thoughts?  Thanks,
>> 
>> Mike
>> 
>> 
>> 
>> --
>> 
>> Mike Zulauf
>> 
>> Meteorologist, Lead Senior
>> 
>> Operational Meteorology
>> 
>> Iberdrola Renewables
>> 
>> 1125 NW Couch, Suite 700
>> 
>> Portland, OR 97209
>> 
>> Office: 503-478-6304  Cell: 503-913-0403
>> 
>> 
>> 
>> This message is intended for the exclusive attention of the recipient(s)
>> indicated.  Any information contained herein is strictly confidential and
>> privileged. If you are not the intended recipient, please notify us by
>> return e-mail and delete this message from your computer system. Any
>> unauthorized use, reproduction, alteration, filing or sending of this
>> message and/or any attached files may lead to legal action being taken
>> against the party(ies) responsible for said unauthorized use. Any opinion
>> expressed herein is solely that of the author(s) and does not necessarily
>> represent the opinion of the Company. The sender does not guarantee the
>> integrity, speed or safety of this message, and does not accept
>> responsibility for any possible damage arising from the interception,
>> incorporation of viruses, or any other damage as a result of manipulation.
>> 
>> 
>> _______________________________________________
>> Wrf-users mailing list
>> Wrf-users at ucar.edu
>> http://mailman.ucar.edu/mailman/listinfo/wrf-users
>> 
> _______________________________________________
> Wrf-users mailing list
> Wrf-users at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/wrf-users



More information about the Wrf-users mailing list