[Wrf-users] WRF restart runs sporadically hanging up

Zulauf, Michael Michael.Zulauf at iberdrolaren.com
Fri Jan 31 11:46:25 MST 2014


Hey folks - for my work I have WRF/ARW 3.3.1 running in an operational
mode for forecasting purposes.  We use GFS as the primary forcing data,
and we begin our runs once we have 24 hours of GFS available (in order
to get things underway quickly.)  Once that portion is complete, and
assuming we have the remainder of our desired GFS data available, we
restart the run and continue our forecast (after running the usual WPS
pre-processing, etc).

 

Generally, this works quite well.  Maybe once a week (ie, once every 30
runs or so, we run 4 runs a day, 7 days a week), the job hangs up
shortly after beginning the restart.  The job typically outputs the
wrfout files for the first time step, and sends significant output to
the rsl.error/rsl.out files.  The processes still show as running, just
no further output of any kind appears, and eventually the queue/batch
system kills it once it goes beyond the allocated time.  There are no
error messages in any logs or rsl files or anywhere else I've seen.  On
rerunning the job, it nearly always runs to completion without problem.

 

Anyone seen this sort of thing?  Until recently, I didn't see this as a
major problem, because by far the most important data was within the
first 24 hours.  Lately our business needs are making the longer-lead
forecasts more valuable than they were.  I suppose I can put in
additional job-monitoring machinery, and attempt to "restart the
restart" if it hangs up, but obviously I'd like to minimize the
incidence in the first place.

 

I've got WRF 3.5.1 running experimentally, but not enough yet to see if
that helps.  We try not to change our WRF version too frequently (or
other components), since this is an operational system, and even slight
changes can change behavior, skewing statistics, etc.  But if it helped,
I'd consider it.

 

Since it always seems to happen at the same point (after restart, after
first wrfouts, before additional time stepping), I doubt it's a hardware
or systems issue.  Maybe some infrequently triggered race condition or
similar?

 

Other details: 

PGI 10.6

                netcdf-4.1.3

                mvapich2-1.7

 

Thoughts?  Thanks,

Mike

 

-- 

Mike Zulauf

Meteorologist, Lead Senior

Operational Meteorology 

Iberdrola Renewables

1125 NW Couch, Suite 700

Portland, OR 97209

Office: 503-478-6304  Cell: 503-913-0403

 


This message is intended for the exclusive attention of the recipient(s) indicated.  Any information contained herein is strictly confidential and privileged.  If you are not the intended recipient, please notify us by return e-mail and delete this message from your computer system. Any unauthorized use, reproduction, alteration, filing or sending of this message and/or any attached files may lead to legal action being taken against the party(ies) responsible for said unauthorized use. Any opinion expressed herein is solely that of the author(s) and does not necessarily represent the opinion of the Company. The sender does not guarantee the integrity, speed or safety of this message, and does not accept responsibility for any possible damage arising from the interception, incorporation of viruses, or any other damage as a result of manipulation.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/wrf-users/attachments/20140131/67bfad67/attachment.html 


More information about the Wrf-users mailing list