[Wrf-users] WRF is "hanging"

Jatin Kala J.Kala at murdoch.edu.au
Wed Mar 30 19:58:19 MDT 2011


 

Hi all,

 

Thanks for all the replies.

 

It turns out that there was something "funny" with our openmpi library.
Our system admins have partly fixed the issue, so, now WRF only hangs on
certain numbers of CPUs per node. (Don't know why yet!).  They are still
working on it.

 

Cheers,

Jatin 

 

 

 

 

From: Feng Liu [mailto:FLiu at azmag.gov] 
Sent: Tuesday, 29 March 2011 6:01 AM
To: Jatin Kala; wrf-users at ucar.edu
Cc: Jatin Kala
Subject: RE: WRF is "hanging"

 

Jatin and Don, 

If the problem can not resolved by reducing time_step, I suspect and
would say that it is highly resulted from your system. I experienced the
similar issue before and took a lot of time to research what happened to
system. We also asked wrfhelp for the solution but in vain, please see
replies from wrfhelp immediate after my reply. We have a cluster to run
different models such as WRF, WRF/Chem, CMAQ, CAMx, etc. our cluster
equipped with 8 nodes each of which is dual quad-core, originally had a
100M switch and all models including WRF3.2.1 ran perfectly. In order to
improve computing efficiency we upgraded the switch to 1 Gb and the
CMAQ, CAMx models still run fine, however, WRF 3.2.1 model hang up more
often with increasing the number of processor involved in the computing.
But it seemed fairly random. When it hang up, no error message, no stop.
We checked MPI libraries, MPICH, compiler flags..., many things. So we
now leave it back to and on the 100 M switch. Everything works well and
no hang up happens to WRF model any more though cluster slows down.  Why
did CMAQ (parallel version) ran successfully with the 1 Gb switch and
full nodes, but WRF 3.2.1 did not? Before we re-order another 1 Gb or
more advanced switch and test it the question is still opened.    

For effective test, you may use a pilot program test.f attached.  If
your WRF hangs up, it should hang up too, but it lets you get quick
check. 

I hope it is helpful. I will keep a close eye on this issue. 

Thanks.

Feng

 

--------------------------------------------

Wrfhelp repliy:

 

 

Since we have not had report from other users, I am guessing the problem
has to do with your system than with the code. If you can get help from
your system support or the vendor, that might be helpful.

wrfhelp

 

On Jan 21, 2011, at 9:41 AM, Feng Liu wrote:

 

> Hi,

> I re-compiled WRF3.2.1. The "hung up" problem sometimes still happens,


> sometimes does not, it seems fairly random. It hangs more often with 

> increasing processor number. I also consulted with our IT staff but no


> solution so far. Your support is highly appreciated.

> Feng

> 

> 

> -----Original Message-----

> From: wrfhelp [mailto:wrfhelp at ucar.edu]

> Sent: Thursday, January 06, 2011 4:38 PM

> To: Feng Liu

> Subject: Re: job hang up without error message when I used

> 

> Could you work with your system support people and see if they can 

> help?

> wrfhelp

> 

> On Jan 5, 2011, at 8:15 PM, Feng Liu wrote:

> 

>> Hi,

>> I had the hang up problem with version 3.2 no matter multiple 

>> processors or single one used but when I modified namelist.input, it 

>> did work. For version 3.2.1 I think the course of this problem is 

>> different because it does work with master nodes.

>> On stability of computer you mentioned, you may be right. We updated 

>> network work card from 100M switch to 1 G. We had stable WRF running 

>> with old network card even though its performance was poor. However, 

>> I can run CMAQ4.7.1 with 64 processors ( we have 8 nodes each of 

>> which has 8 processors) successfully, and speedup factor is almost

>> 2.8 comparing against the system with old network card.

>> Thanks.

>> Feng

>> 

>> 

>> -----Original Message-----

>> From: wrfhelp [mailto:wrfhelp at ucar.edu]

>> Sent: Wednesday, January 05, 2011 7:24 PM

>> To: Feng Liu

>> Subject: Re: job hang up without error message when I used

>> 

>> Have you seen this problem with other versions of the model code 

>> before? Is your system stable?

>> Can you run other MPI jobs steadily on this system? What I am saying 

>> is that it is possible that it is problem with the computer, not the 

>> model code.

>> 

>> wrfhelp

>> 

>> On Jan 5, 2011, at 5:11 PM, Feng Liu wrote:

>> 

>>> Hi,

>>> Thanks for your response. But I am using WRF3.2.1 which has the same


>>> problem. I have no idea so far.

>>> Thanks.

>>> Feng

>>> 

>>> 

>>> -----Original Message-----

>>> From: wrfhelp [mailto:wrfhelp at ucar.edu]

>>> Sent: Wednesday, January 05, 2011 4:20 PM

>>> To: Feng Liu

>>> Subject: Re: job hang up without error message when I used

>>> 

>>> Mike was using 3.2 at the time, and the fix has been included in 

>>> 3.2.1.

>>> 

>>> wrfhelp

>>> 

>>> On Jan 5, 2011, at 1:30 PM, Feng Liu wrote:

>>> 

>>>> Hi,

>>>> I can run WRF3.2.1 successfully if I only use master node with 8 

>>>> processors. However, my jobs (with MPI) hand up when I was using 

>>>> more 16 processors or more than two nodes, no error message, no 

>>>> crashes. This problem was described by Michael Zulauf as below:

>>>> (see http://mailman.ucar.edu/pipermail/wrf-users/2010/001745.html

>>>> 

>>>> "My jobs sporadically (but usually eventually) hang up, most often 

>>>> after a new wrfout file is opened.  No error messages, no crashes

>>>> - the processes continue, but _all_ output stops.  I eventually 

>>>> just have to kill the job.  The wrfouts are small, and all output 

>>>> looks good up until the failed wrfout."

>>>> 

>>>> Mike mentioned he got a modified code from wrfhelp and seemed to 

>>>> fix this issue. I also need to know which code need to be modified 

>>>> and what is the problem related to? Thanks for support on fixing 

>>>> this problem.

>>>> Feng

>>>> 

>>> 

>>> wrfhelp

 

From: wrf-users-bounces at ucar.edu [mailto:wrf-users-bounces at ucar.edu] On
Behalf Of Jatin Kala
Sent: Saturday, March 26, 2011 12:19 AM
To: wrf-users at ucar.edu
Subject: Re: [Wrf-users] WRF is "hanging"

 

Thanks for the suggestion Feng, but this is not related to namelist
inputs. The namelist I am running worked fine on  a different machine.

The issue here is that WRF simply hangs and does nothing at
initialisation of Grid 2. Ie, the rsl.out and rsl.error files print out:

 

d01 2009-10-01_00:00:00  alloc_space_field: domain            2,
84045408 b

 ytes allocated

 d01 2009-10-01_00:00:00  alloc_space_field: domain            2,
3084672 b

 ytes allocated

 d01 2009-10-01_00:00:00 *** Initializing nest domain # 2 from an input
file. **

 *

 d01 2009-10-01_00:00:00 med_initialdata_input: calling input_input

 

and that's it. The rsl.error and rsl.out files do not keep growing in
size, there are no more prints, they just stop printing stuff. The job
however is still in the queue and does NOT error out, until the walltime
is elapsed. No wrfout_d0* files are created. 

 

Other people seem to have had this issue before:

 

http://mailman.ucar.edu/pipermail/wrf-users/2010/001749.html 

 

http://mailman.ucar.edu/pipermail/wrf-users/2010/001747.html 

 

 

Any help more than welcome.

 

Regards,

 

Jatin

 

 

 

From: Feng Liu [mailto:FLiu at azmag.gov] 
Sent: Saturday, 26 March 2011 9:04 AM
To: Jatin Kala; wrf-users at ucar.edu
Subject: RE: WRF is "hanging"

 

Hi Jatin,

I do not know exactly what is wrong for your case, but one thing you can
try is to reduce time_step in namelist.input by 3 times. Good luck.

Feng

 

 

From: wrf-users-bounces at ucar.edu [mailto:wrf-users-bounces at ucar.edu] On
Behalf Of Jatin Kala
Sent: Thursday, March 24, 2011 7:29 PM
To: wrf-users at ucar.edu
Subject: [Wrf-users] WRF is "hanging"

 

Dear WRF-users,

 

I have compiled WRF3.2 on our new supercomputing facility, and having
some trouble. Namely, WRF is just "hanging" at:

 

d01 2009-10-01_00:00:00  alloc_space_field: domain            2,
84045408 b

 ytes allocated

 d01 2009-10-01_00:00:00  alloc_space_field: domain            2,
3084672 b

 ytes allocated

 d01 2009-10-01_00:00:00 *** Initializing nest domain # 2 from an input
file. **

 *

 d01 2009-10-01_00:00:00 med_initialdata_input: calling input_input

 

 

The job remains in the queue, i.e, does not error out until walltime is
elapsed.

 

I have compiled with -O0 but that did not help. I have also compiled
with the updated "gen_allocs.c" form the WRF website, but that has not
helped either. I did do a "clean -a" before.

 

I have compiled WRF with the follows libs:

 

intel-compilers/2011.1.107

jasper/1.900.1

ncarg/5.2.1

mpi/intel/openmpi/1.4.2-qlc

netcdf/4.0.1/intel-2011.1.107

export WRFIO_NCD_LARGE_FILE_SUPPORT=1

 

Any help would be greatly appreciated!

 

Kind regards,

 

Jatin 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/wrf-users/attachments/20110331/28f73197/attachment-0001.html 


More information about the Wrf-users mailing list