[Wrf-users] WRF is "hanging"

Feng Liu FLiu at azmag.gov
Mon Mar 28 16:00:47 MDT 2011


Jatin and Don,
If the problem can not resolved by reducing time_step, I suspect and would say that it is highly resulted from your system. I experienced the similar issue before and took a lot of time to research what happened to system. We also asked wrfhelp for the solution but in vain, please see replies from wrfhelp immediate after my reply. We have a cluster to run different models such as WRF, WRF/Chem, CMAQ, CAMx, etc. our cluster equipped with 8 nodes each of which is dual quad-core, originally had a 100M switch and all models including WRF3.2.1 ran perfectly. In order to improve computing efficiency we upgraded the switch to 1 Gb and the CMAQ, CAMx models still run fine, however, WRF 3.2.1 model hang up more often with increasing the number of processor involved in the computing. But it seemed fairly random. When it hang up, no error message, no stop. We checked MPI libraries, MPICH, compiler flags..., many things. So we now leave it back to and on the 100 M switch. Everything works well and no hang up happens to WRF model any more though cluster slows down.  Why did CMAQ (parallel version) ran successfully with the 1 Gb switch and full nodes, but WRF 3.2.1 did not? Before we re-order another 1 Gb or more advanced switch and test it the question is still opened.
For effective test, you may use a pilot program test.f attached.  If your WRF hangs up, it should hang up too, but it lets you get quick check.
I hope it is helpful. I will keep a close eye on this issue.
Thanks.
Feng

--------------------------------------------
Wrfhelp repliy:


Since we have not had report from other users, I am guessing the problem has to do with your system than with the code. If you can get help from your system support or the vendor, that might be helpful.

wrfhelp



On Jan 21, 2011, at 9:41 AM, Feng Liu wrote:



> Hi,

> I re-compiled WRF3.2.1. The "hung up" problem sometimes still happens,

> sometimes does not, it seems fairly random. It hangs more often with

> increasing processor number. I also consulted with our IT staff but no

> solution so far. Your support is highly appreciated.

> Feng

>

>

> -----Original Message-----

> From: wrfhelp [mailto:wrfhelp at ucar.edu]<mailto:[mailto:wrfhelp at ucar.edu]>

> Sent: Thursday, January 06, 2011 4:38 PM

> To: Feng Liu

> Subject: Re: job hang up without error message when I used

>

> Could you work with your system support people and see if they can

> help?

> wrfhelp

>

> On Jan 5, 2011, at 8:15 PM, Feng Liu wrote:

>

>> Hi,

>> I had the hang up problem with version 3.2 no matter multiple

>> processors or single one used but when I modified namelist.input, it

>> did work. For version 3.2.1 I think the course of this problem is

>> different because it does work with master nodes.

>> On stability of computer you mentioned, you may be right. We updated

>> network work card from 100M switch to 1 G. We had stable WRF running

>> with old network card even though its performance was poor. However,

>> I can run CMAQ4.7.1 with 64 processors ( we have 8 nodes each of

>> which has 8 processors) successfully, and speedup factor is almost

>> 2.8 comparing against the system with old network card.

>> Thanks.

>> Feng

>>

>>

>> -----Original Message-----

>> From: wrfhelp [mailto:wrfhelp at ucar.edu]<mailto:[mailto:wrfhelp at ucar.edu]>

>> Sent: Wednesday, January 05, 2011 7:24 PM

>> To: Feng Liu

>> Subject: Re: job hang up without error message when I used

>>

>> Have you seen this problem with other versions of the model code

>> before? Is your system stable?

>> Can you run other MPI jobs steadily on this system? What I am saying

>> is that it is possible that it is problem with the computer, not the

>> model code.

>>

>> wrfhelp

>>

>> On Jan 5, 2011, at 5:11 PM, Feng Liu wrote:

>>

>>> Hi,

>>> Thanks for your response. But I am using WRF3.2.1 which has the same

>>> problem. I have no idea so far.

>>> Thanks.

>>> Feng

>>>

>>>

>>> -----Original Message-----

>>> From: wrfhelp [mailto:wrfhelp at ucar.edu]<mailto:[mailto:wrfhelp at ucar.edu]>

>>> Sent: Wednesday, January 05, 2011 4:20 PM

>>> To: Feng Liu

>>> Subject: Re: job hang up without error message when I used

>>>

>>> Mike was using 3.2 at the time, and the fix has been included in

>>> 3.2.1.

>>>

>>> wrfhelp

>>>

>>> On Jan 5, 2011, at 1:30 PM, Feng Liu wrote:

>>>

>>>> Hi,

>>>> I can run WRF3.2.1 successfully if I only use master node with 8

>>>> processors. However, my jobs (with MPI) hand up when I was using

>>>> more 16 processors or more than two nodes, no error message, no

>>>> crashes. This problem was described by Michael Zulauf as below:

>>>> (see http://mailman.ucar.edu/pipermail/wrf-users/2010/001745.html

>>>>

>>>> "My jobs sporadically (but usually eventually) hang up, most often

>>>> after a new wrfout file is opened.  No error messages, no crashes

>>>> - the processes continue, but _all_ output stops.  I eventually

>>>> just have to kill the job.  The wrfouts are small, and all output

>>>> looks good up until the failed wrfout."

>>>>

>>>> Mike mentioned he got a modified code from wrfhelp and seemed to

>>>> fix this issue. I also need to know which code need to be modified

>>>> and what is the problem related to? Thanks for support on fixing

>>>> this problem.

>>>> Feng

>>>>

>>>

>>> wrfhelp

From: wrf-users-bounces at ucar.edu [mailto:wrf-users-bounces at ucar.edu] On Behalf Of Jatin Kala
Sent: Saturday, March 26, 2011 12:19 AM
To: wrf-users at ucar.edu
Subject: Re: [Wrf-users] WRF is "hanging"

Thanks for the suggestion Feng, but this is not related to namelist inputs. The namelist I am running worked fine on  a different machine.
The issue here is that WRF simply hangs and does nothing at initialisation of Grid 2. Ie, the rsl.out and rsl.error files print out:

d01 2009-10-01_00:00:00  alloc_space_field: domain            2,     84045408 b
 ytes allocated
 d01 2009-10-01_00:00:00  alloc_space_field: domain            2,      3084672 b
 ytes allocated
 d01 2009-10-01_00:00:00 *** Initializing nest domain # 2 from an input file. **
 *
 d01 2009-10-01_00:00:00 med_initialdata_input: calling input_input

and that's it. The rsl.error and rsl.out files do not keep growing in size, there are no more prints, they just stop printing stuff. The job however is still in the queue and does NOT error out, until the walltime is elapsed. No wrfout_d0* files are created.

Other people seem to have had this issue before:

http://mailman.ucar.edu/pipermail/wrf-users/2010/001749.html

http://mailman.ucar.edu/pipermail/wrf-users/2010/001747.html


Any help more than welcome.

Regards,

Jatin



From: Feng Liu [mailto:FLiu at azmag.gov]
Sent: Saturday, 26 March 2011 9:04 AM
To: Jatin Kala; wrf-users at ucar.edu
Subject: RE: WRF is "hanging"

Hi Jatin,
I do not know exactly what is wrong for your case, but one thing you can try is to reduce time_step in namelist.input by 3 times. Good luck.
Feng


From: wrf-users-bounces at ucar.edu [mailto:wrf-users-bounces at ucar.edu] On Behalf Of Jatin Kala
Sent: Thursday, March 24, 2011 7:29 PM
To: wrf-users at ucar.edu
Subject: [Wrf-users] WRF is "hanging"

Dear WRF-users,

I have compiled WRF3.2 on our new supercomputing facility, and having some trouble. Namely, WRF is just "hanging" at:

d01 2009-10-01_00:00:00  alloc_space_field: domain            2,     84045408 b
 ytes allocated
 d01 2009-10-01_00:00:00  alloc_space_field: domain            2,      3084672 b
 ytes allocated
 d01 2009-10-01_00:00:00 *** Initializing nest domain # 2 from an input file. **
 *
 d01 2009-10-01_00:00:00 med_initialdata_input: calling input_input


The job remains in the queue, i.e, does not error out until walltime is elapsed.

I have compiled with -O0 but that did not help. I have also compiled with the updated "gen_allocs.c" form the WRF website, but that has not helped either. I did do a "clean -a" before.

I have compiled WRF with the follows libs:

intel-compilers/2011.1.107
jasper/1.900.1
ncarg/5.2.1
mpi/intel/openmpi/1.4.2-qlc
netcdf/4.0.1/intel-2011.1.107
export WRFIO_NCD_LARGE_FILE_SUPPORT=1

Any help would be greatly appreciated!

Kind regards,

Jatin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/wrf-users/attachments/20110328/77173821/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.f
Type: application/octet-stream
Size: 397 bytes
Desc: test.f
Url : http://mailman.ucar.edu/pipermail/wrf-users/attachments/20110328/77173821/attachment-0001.obj 


More information about the Wrf-users mailing list