[Wrf-users] real.exe failing on huge domains

Mon Aug 31 16:41:04 MDT 2009

First - the basic question - has anybody been successful in WPS'ing
and real.exe'ing a large domain, on the order of 6075x6075x27 grid
points (approximately 1 billion)?

I've almost convinced (I say "almost" because I recognize that I, like
others, am capable of making stupid mistakes) myself that there is an
issue with real.exe which, for large grids, results in an error
message of the form:

=====================
 p_top_requested =     5000.000
 allowable grid%p_top in data   =     55000.00
 -------------- FATAL CALLED ---------------
 FATAL CALLED FROM FILE:  module_initialize_real.b  LINE:     526
 p_top_requested < grid%p_top possible from data
=====================

and I'm beginning to think that this is somehow related to memory
allocation issues.  I'm currently working on a 1km resolution case,
centered on Fairbanks, Alaska.  If I use a 3038x3038 horizontal grid,
it all works fine, but with a 6075x6075 grid, I get the above error.
In both cases, I've written an NCL script to print the min/max/avg
values of the PRES field in met_em*, and at the top level they both
come out to 1000 Pa and at the next level down they both come out to
2000 Pa.  So, I'm guessing my topmost pressure fields are fine.  So,
I'm guessing that the met_em file being fed to real.exe is good.

Further information:

- I've tried these cases under a number of varying conditions -
different resolutions, different machines (a Sun Opteron cluster and a
Cray XT5).  In all cases, however, I've been using the PGI compilers
(but I may try Pathscale on one of the machines to see if that makes a
difference).  I feel pretty good about having ruled out resolution,
physics, etc. as a problem, and feel like I've narrowed this down to
be a problem that's a function of domain size.

- With some guidance from John Michalakes and folks at Cray, I feel
pretty certain that I'm not running out of memory on the compute
nodes, though I'll be probing this a little more.  In one case (that
failed with the above problem) I put MPI Task 0 on a 32 GByte node all
by itself, then partitioned the other 255 tasks, 8 to an 8-core node
(two quad-core processors) each with 32 GBytes memory (4 GBytes per
task).

- Have tried this with WRFV3.0.1.1 and WRFV3.1

I'll continue to probe, and may need to start digging into the
real.exe source, but just wanted to know if anybody else has
experienced success or failure with this size of a problem.  I'm aware
that a Gordon Bell entry last year was performed with about 2 billion
grid points, but I think I remember someone telling me that the run
wasn't prepared with WPS.

Thanks,

Don Morton
-- 
Arctic Region Supercomputing Center
http://www.arsc.edu/~morton/