[Wrf-users] WRF restart runs sporadically hanging up
Dominikus Heinzeller
climbfuji at ymail.com
Mon Feb 3 03:26:29 MST 2014
Hi Michael and Dmitry,
I encountered a similar problem with WRF/ARW 3.5 for the following combination, very similar to yours:
pgi64-12.10
netcdf-3.6.3 / netcdf-4.2.1.1
mvapich2-1.9
It went away when using mpich-3.0.2 instead of mvapich2. For other compilers (gnu, intel), this version of mvapich2 worked fine with any netcdf version.
Dom
On 31/01/2014, at 11:25 pm, Dmitry N. Mikushin <maemarcus at gmail.com> wrote:
> Hi Michael,
>
> This looks like a nasty software bug, that needs a state comparison.
> In order to get one, I'd try to modify the model code to signal
> entering abnormal state as early as possible. In this code I'd also
> put an infinite loop, such that engineer can get back on cluster and
> attach the debugger to the processes of problematic run. Then -
> keeping the problematic run on hold, I'd start another equal run in
> another instance of debugger, let it work to the point where the first
> run entered the fail-path. As result, you will have "bad" and "good"
> runs frozen at the point of problem, and will be able to compare their
> states interactively through debuggers.
>
> - D.
>
>
> 2014-01-31 Zulauf, Michael <Michael.Zulauf at iberdrolaren.com>:
>> Hey folks - for my work I have WRF/ARW 3.3.1 running in an operational mode
>> for forecasting purposes. We use GFS as the primary forcing data, and we
>> begin our runs once we have 24 hours of GFS available (in order to get
>> things underway quickly.) Once that portion is complete, and assuming we
>> have the remainder of our desired GFS data available, we restart the run and
>> continue our forecast (after running the usual WPS pre-processing, etc).
>>
>>
>>
>> Generally, this works quite well. Maybe once a week (ie, once every 30 runs
>> or so, we run 4 runs a day, 7 days a week), the job hangs up shortly after
>> beginning the restart. The job typically outputs the wrfout files for the
>> first time step, and sends significant output to the rsl.error/rsl.out
>> files. The processes still show as running, just no further output of any
>> kind appears, and eventually the queue/batch system kills it once it goes
>> beyond the allocated time. There are no error messages in any logs or rsl
>> files or anywhere else I've seen. On rerunning the job, it nearly always
>> runs to completion without problem.
>>
>>
>>
>> Anyone seen this sort of thing? Until recently, I didn't see this as a
>> major problem, because by far the most important data was within the first
>> 24 hours. Lately our business needs are making the longer-lead forecasts
>> more valuable than they were. I suppose I can put in additional
>> job-monitoring machinery, and attempt to "restart the restart" if it hangs
>> up, but obviously I'd like to minimize the incidence in the first place.
>>
>>
>>
>> I've got WRF 3.5.1 running experimentally, but not enough yet to see if that
>> helps. We try not to change our WRF version too frequently (or other
>> components), since this is an operational system, and even slight changes
>> can change behavior, skewing statistics, etc. But if it helped, I'd
>> consider it.
>>
>>
>>
>> Since it always seems to happen at the same point (after restart, after
>> first wrfouts, before additional time stepping), I doubt it's a hardware or
>> systems issue. Maybe some infrequently triggered race condition or similar?
>>
>>
>>
>> Other details:
>>
>> PGI 10.6
>>
>> netcdf-4.1.3
>>
>> mvapich2-1.7
>>
>>
>>
>> Thoughts? Thanks,
>>
>> Mike
>>
>>
>>
>> --
>>
>> Mike Zulauf
>>
>> Meteorologist, Lead Senior
>>
>> Operational Meteorology
>>
>> Iberdrola Renewables
>>
>> 1125 NW Couch, Suite 700
>>
>> Portland, OR 97209
>>
>> Office: 503-478-6304 Cell: 503-913-0403
>>
>>
>>
>> This message is intended for the exclusive attention of the recipient(s)
>> indicated. Any information contained herein is strictly confidential and
>> privileged. If you are not the intended recipient, please notify us by
>> return e-mail and delete this message from your computer system. Any
>> unauthorized use, reproduction, alteration, filing or sending of this
>> message and/or any attached files may lead to legal action being taken
>> against the party(ies) responsible for said unauthorized use. Any opinion
>> expressed herein is solely that of the author(s) and does not necessarily
>> represent the opinion of the Company. The sender does not guarantee the
>> integrity, speed or safety of this message, and does not accept
>> responsibility for any possible damage arising from the interception,
>> incorporation of viruses, or any other damage as a result of manipulation.
>>
>>
>> _______________________________________________
>> Wrf-users mailing list
>> Wrf-users at ucar.edu
>> http://mailman.ucar.edu/mailman/listinfo/wrf-users
>>
> _______________________________________________
> Wrf-users mailing list
> Wrf-users at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/wrf-users
More information about the Wrf-users
mailing list