[Wrf-users] WRF restart runs sporadically hanging up
Dmitry N. Mikushin
maemarcus at gmail.com
Fri Jan 31 15:25:31 MST 2014
Hi Michael,
This looks like a nasty software bug, that needs a state comparison.
In order to get one, I'd try to modify the model code to signal
entering abnormal state as early as possible. In this code I'd also
put an infinite loop, such that engineer can get back on cluster and
attach the debugger to the processes of problematic run. Then -
keeping the problematic run on hold, I'd start another equal run in
another instance of debugger, let it work to the point where the first
run entered the fail-path. As result, you will have "bad" and "good"
runs frozen at the point of problem, and will be able to compare their
states interactively through debuggers.
- D.
2014-01-31 Zulauf, Michael <Michael.Zulauf at iberdrolaren.com>:
> Hey folks - for my work I have WRF/ARW 3.3.1 running in an operational mode
> for forecasting purposes. We use GFS as the primary forcing data, and we
> begin our runs once we have 24 hours of GFS available (in order to get
> things underway quickly.) Once that portion is complete, and assuming we
> have the remainder of our desired GFS data available, we restart the run and
> continue our forecast (after running the usual WPS pre-processing, etc).
>
>
>
> Generally, this works quite well. Maybe once a week (ie, once every 30 runs
> or so, we run 4 runs a day, 7 days a week), the job hangs up shortly after
> beginning the restart. The job typically outputs the wrfout files for the
> first time step, and sends significant output to the rsl.error/rsl.out
> files. The processes still show as running, just no further output of any
> kind appears, and eventually the queue/batch system kills it once it goes
> beyond the allocated time. There are no error messages in any logs or rsl
> files or anywhere else I've seen. On rerunning the job, it nearly always
> runs to completion without problem.
>
>
>
> Anyone seen this sort of thing? Until recently, I didn't see this as a
> major problem, because by far the most important data was within the first
> 24 hours. Lately our business needs are making the longer-lead forecasts
> more valuable than they were. I suppose I can put in additional
> job-monitoring machinery, and attempt to "restart the restart" if it hangs
> up, but obviously I'd like to minimize the incidence in the first place.
>
>
>
> I've got WRF 3.5.1 running experimentally, but not enough yet to see if that
> helps. We try not to change our WRF version too frequently (or other
> components), since this is an operational system, and even slight changes
> can change behavior, skewing statistics, etc. But if it helped, I'd
> consider it.
>
>
>
> Since it always seems to happen at the same point (after restart, after
> first wrfouts, before additional time stepping), I doubt it's a hardware or
> systems issue. Maybe some infrequently triggered race condition or similar?
>
>
>
> Other details:
>
> PGI 10.6
>
> netcdf-4.1.3
>
> mvapich2-1.7
>
>
>
> Thoughts? Thanks,
>
> Mike
>
>
>
> --
>
> Mike Zulauf
>
> Meteorologist, Lead Senior
>
> Operational Meteorology
>
> Iberdrola Renewables
>
> 1125 NW Couch, Suite 700
>
> Portland, OR 97209
>
> Office: 503-478-6304 Cell: 503-913-0403
>
>
>
> This message is intended for the exclusive attention of the recipient(s)
> indicated. Any information contained herein is strictly confidential and
> privileged. If you are not the intended recipient, please notify us by
> return e-mail and delete this message from your computer system. Any
> unauthorized use, reproduction, alteration, filing or sending of this
> message and/or any attached files may lead to legal action being taken
> against the party(ies) responsible for said unauthorized use. Any opinion
> expressed herein is solely that of the author(s) and does not necessarily
> represent the opinion of the Company. The sender does not guarantee the
> integrity, speed or safety of this message, and does not accept
> responsibility for any possible damage arising from the interception,
> incorporation of viruses, or any other damage as a result of manipulation.
>
>
> _______________________________________________
> Wrf-users mailing list
> Wrf-users at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/wrf-users
>
More information about the Wrf-users
mailing list