[Wrf-users] WRF restart runs sporadically hanging up

Tue Apr 22 13:58:14 MDT 2014

Hello (again) all...

Back at the end of January, I posted the following question about my WRF
restarts occasionally hanging up (see original message below).  I got
several useful responses, but was only recently able to begin digging
into the problem further, and trying to address it.  I have gotten some
more information, which I hope will be useful.

Based on some of the responses, I've rebuilt WRF (same version) with
OpenMPI V1.4.3.  I'm still encountering the same restart problem, but
now it crashes (still infrequently) instead of hanging.  And the crash
yields the following message:
	MPI_ABORT was invoked on rank 52 in communicator MPI_COMM_WORLD
with errorcode 1.

Having seen that, I then found this at the end of rsl.error.0052:
	d03 2014-04-23_12:00:00  alloc_space_field: domain             4
,                  78220372  bytes allocated
	  RESTART: nest, opening wrfrst_d04_2014-04-23_12:00:00 for
reading
	 d03 2014-04-23_12:00:00 Input data processed for aux input  10
for domain   3
	 WRF TILE   1 IS    141 IE    175 JS    167 JE    194
	 WRF NUMBER OF TILES =   1
	-------------- FATAL CALLED ---------------
	 normalize_basetime:  denominator of seconds cannot be negative
	 -------------------------------------------

Since switching to OpenMPI , both times this has happened it's been the
same process - #52 out of 72 total.  It's possible that this was
happening the same way with the other version of MPI I used, but I
didn't have the same clues to investigate as well.  It seems suspicious
to me that it was the same taskid, which appears to correspond with the
same geographical sub-region of my domain.  Perhaps something to do with
the topography?

Does anybody have any thoughts about what this normalize_basetime error
is due to?  How to fix this problem?

I'm going to continue investigating on my own, but hopefully somebody
can help point me in the right direction...

Thanks,
Mike

------------------------------------Original
Message------------------------------------
Message: 1
Date: Fri, 31 Jan 2014 10:46:25 -0800
From: "Zulauf, Michael" <Michael.Zulauf at iberdrolaren.com>
Subject: [Wrf-users] WRF restart runs sporadically hanging up
To: <wrf-users at ucar.edu>
Message-ID:

<B2A259FAA3CF26469FF9A7C7402C49971BE1109E at POREXUW03.ppmenergy.us>
Content-Type: text/plain; charset="us-ascii"

Hey folks - for my work I have WRF/ARW 3.3.1 running in an operational
mode for forecasting purposes.  We use GFS as the primary forcing data,
and we begin our runs once we have 24 hours of GFS available (in order
to get things underway quickly.)  Once that portion is complete, and
assuming we have the remainder of our desired GFS data available, we
restart the run and continue our forecast (after running the usual WPS
pre-processing, etc).

Generally, this works quite well.  Maybe once a week (ie, once every 30
runs or so, we run 4 runs a day, 7 days a week), the job hangs up
shortly after beginning the restart.  The job typically outputs the
wrfout files for the first time step, and sends significant output to
the rsl.error/rsl.out files.  The processes still show as running, just
no further output of any kind appears, and eventually the queue/batch
system kills it once it goes beyond the allocated time.  There are no
error messages in any logs or rsl files or anywhere else I've seen.  On
rerunning the job, it nearly always runs to completion without problem.

Anyone seen this sort of thing?  Until recently, I didn't see this as a
major problem, because by far the most important data was within the
first 24 hours.  Lately our business needs are making the longer-lead
forecasts more valuable than they were.  I suppose I can put in
additional job-monitoring machinery, and attempt to "restart the
restart" if it hangs up, but obviously I'd like to minimize the
incidence in the first place.

I've got WRF 3.5.1 running experimentally, but not enough yet to see if
that helps.  We try not to change our WRF version too frequently (or
other components), since this is an operational system, and even slight
changes can change behavior, skewing statistics, etc.  But if it helped,
I'd consider it.

Since it always seems to happen at the same point (after restart, after
first wrfouts, before additional time stepping), I doubt it's a hardware
or systems issue.  Maybe some infrequently triggered race condition or
similar?

Other details: 
	PGI 10.6
                netcdf-4.1.3
                mvapich2-1.7

Thoughts?  Thanks,
Mike
------------------------------------Original
Message------------------------------------

This message is intended for the exclusive attention of the recipient(s) indicated.  Any information contained herein is strictly confidential and privileged.  If you are not the intended recipient, please notify us by return e-mail and delete this message from your computer system. Any unauthorized use, reproduction, alteration, filing or sending of this message and/or any attached files may lead to legal action being taken against the party(ies) responsible for said unauthorized use. Any opinion expressed herein is solely that of the author(s) and does not necessarily represent the opinion of the Company. The sender does not guarantee the integrity, speed or safety of this message, and does not accept responsibility for any possible damage arising from the interception, incorporation of viruses, or any other damage as a result of manipulation.