[Wrf-users] WRF Stops Running for 64+ Processors (UNCLASSIFIED)

Dumais, Robert E Jr CIV (US) robert.e.dumais.civ at mail.mil
Tue Jun 26 12:31:29 MDT 2012


Classification: UNCLASSIFIED
Caveats: NONE

We are having the same general issues with our v 3.4 on a PSSC Powerwulf cluster. Running earlier versions (like 3.2.1) using the same architecture, PBS scripts, grids, input data, namelist, dimensionality- no problem. Using v 3.4 however, we get seg faults in gfortran build & seg faults/occasional hang up in intel compile version. In gfortran, seems very shortly after the first wrfout* is written. In Intel, doesn't always seem to occur right near wrfout writes, but this is a little unclear still. We are using the dmpar option for Linux 64 bit. We are also using WPS 3.4 to provide the input files for WRF v 3.4.

The really bizarre aspect: In the Intel version, anything more than 16 processors (for us, that is 2 nodes) crashes pretty soon into the simulation, but using 2 nodes gets us a lot further before the seg fault. I tried lowering the optimization for the compile to 02, and had the same behavior.  

Using the gfortran option, this same behavior happens except the crashes seem to always happen near to the initial wrfout file write for any PBS job asking for more than 2 nodes. Using just 2 nodes, the few jobs we have submitted (that failed with Intel) are reaching full completion with good looking results using gfortran and optimization 03. For both Intel and gfortran, the aborted runs appear to produce reasonable looking fields up to the point of sudden crash/seg fault. The clues in the RSL files are about nil.  We are using large file support option for netCDF when we build. 

I wonder if this is maybe an mpi problem, a physical or system software problem within the cluster itself (seems unlikely with older WRF versions running fine ), or perhaps still related somehow to the old "gen_malloc.c" subroutine bug that was discovered back in 3.2 I believe, and was thought to be fixed. 


                                                                  Bob Dumais

  

-----Original Message-----
From: wrf-users-bounces at ucar.edu [mailto:wrf-users-bounces at ucar.edu] On Behalf Of Chris Klich
Sent: Wednesday, June 20, 2012 12:22 PM
To: wrf-users
Subject: [Wrf-users] WRF Stops Running for 64+ Processors

Hi all, I am currently running WRF-Chem 3.4, but I've also had this problem with regular WRF 3.4.  I am running a 3 domain simulation (45, 15, 5km).  On different occasions, varying from writing the hour 0 domain 2 output, up to the first actual time step, WRF just stops running.  It does not give me an error, just one of the nodes says segmentation fault.  This only seems to happen when running more than 64 processors.  I am running using a PGI compiled version of WRF, and I do not believe it is a namelist or model issue, as the crash seems to happen differently at several different points in the model run.  However, sometimes, although rarely, the run will actually finish to completion with more than 64 nodes.

Has anyone had this happen before, or possibly know of a way to fix it?  I am using a computer center that I have to submit my run to, so it can get frustrating waiting almost a day for my run to start, only to have it fail within a few minutes.  Any help would be greatly appreciated.

Thanks,
Chris


Classification: UNCLASSIFIED
Caveats: NONE




More information about the Wrf-users mailing list