[Wrf-users] real.exe failing on huge domains

Gustafson, William I william.gustafson at pnl.gov
Fri Sep 4 14:19:07 MDT 2009


Don,

I don't think this will help much, but I too am suspicious that there is a
bug somewhere in real when used with MPI. I have not officially reported it
because I haven't been able to isolate it, have not been sure if it is a
machine specific issue or not, or if it is due to some custom modification
I've made to the code. I'm pretty sure it is not the latter though. For what
it is worth, the following is my experience on an Intel based Linux cluster
with dual quad-core chips using PGI v7.1-6 and MVAPICH v1.0.1 for MPI. My
version of WRF is based on the NCAR repository for v3.1 around early April
2009.

If I compile real.exe with MPI but run it on just one processor everything
seems to work fine. But, when I use multiple processors, say 4 or 8 for a
domain with 250x250 points, I start getting issues with random points
getting messed up in the wrfbdy file. The larger the domain the more likely
the problem even if the number of MPI tasks is held constant. Often, the
first couple days will be fine, but then one time will be messed up. The
exact problem time is generally not reproducible. Typically, I can get
around the problem by using a different number of processors and crossing my
fingers, but this does not always work. For very large domains I cannot
always do this, even with 16 GB per node, and then I get stuck. Since I run
with aerosol chemistry the memory footprint is much higher than a met-only
WRF run. The problem however is not tied to chemistry because it also
happens with WRF_CHEM=0.

The symptoms of our two problems are very different, so they may not be
related. In my case real.exe runs to completion but gives occasional bogus
values that typically only show up when wrf.exe crashes. In your case you
get a crash in real. So, take my comments with a grain of salt. In my case,
there is probably an uninitialized variable somewhere causing trouble in the
halo region. But, if the problem is an out-of-bounds issue, then something
could be getting corrupt leading to your error.

-Bill


On 9/3/09 12:53 PM, "Don Morton" <morton at arsc.edu> wrote:

> Howdy,
> 
> Just an update - after a lot of work, I got WRF to compile with the
> Pathscale compilers, and am experiencing the same problem described
> below.  With a "huge" domain (the threshold is somewhere between
> 3038x3038 and 5000x5000 horizontal points, with 28 levels), real.exe
> fails with the following.
> 
>> -------------- FATAL CALLED ---------------
>> FATAL CALLED FROM FILE:  module_initialize_real.b  LINE:     526
>> p_top_requested < grid%p_top possible from data
> 
> So, I believe at this point I've made a reasonable case that this is
> not an issue with a specific architecture, or solely with the PGI
> compilers.
> 
> I believe it may be time to go in and operate on the real.exe code!
> 
> By the way, one person asked me to run this in serial and provide
> output, but this problem is much, much too big for serial execution!
> 
> 
> 
> 
> On Aug 31, 2009, at 2:41 PM, Don Morton wrote:
> 
>> First - the basic question - has anybody been successful in WPS'ing
>> and real.exe'ing a large domain, on the order of 6075x6075x27 grid
>> points (approximately 1 billion)?
>> 
>> I've almost convinced (I say "almost" because I recognize that I, like
>> others, am capable of making stupid mistakes) myself that there is an
>> issue with real.exe which, for large grids, results in an error
>> message of the form:
>> 
>> =====================
>> p_top_requested =     5000.000
>> allowable grid%p_top in data   =     55000.00
>> -------------- FATAL CALLED ---------------
>> FATAL CALLED FROM FILE:  module_initialize_real.b  LINE:     526
>> p_top_requested < grid%p_top possible from data
>> =====================
>> 
>> and I'm beginning to think that this is somehow related to memory
>> allocation issues.  I'm currently working on a 1km resolution case,
>> centered on Fairbanks, Alaska.  If I use a 3038x3038 horizontal grid,
>> it all works fine, but with a 6075x6075 grid, I get the above error.
>> In both cases, I've written an NCL script to print the min/max/avg
>> values of the PRES field in met_em*, and at the top level they both
>> come out to 1000 Pa and at the next level down they both come out to
>> 2000 Pa.  So, I'm guessing my topmost pressure fields are fine.  So,
>> I'm guessing that the met_em file being fed to real.exe is good.
>> 
>> Further information:
>> 
>> - I've tried these cases under a number of varying conditions -
>> different resolutions, different machines (a Sun Opteron cluster and a
>> Cray XT5).  In all cases, however, I've been using the PGI compilers
>> (but I may try Pathscale on one of the machines to see if that makes a
>> difference).  I feel pretty good about having ruled out resolution,
>> physics, etc. as a problem, and feel like I've narrowed this down to
>> be a problem that's a function of domain size.
>> 
>> - With some guidance from John Michalakes and folks at Cray, I feel
>> pretty certain that I'm not running out of memory on the compute
>> nodes, though I'll be probing this a little more.  In one case (that
>> failed with the above problem) I put MPI Task 0 on a 32 GByte node all
>> by itself, then partitioned the other 255 tasks, 8 to an 8-core node
>> (two quad-core processors) each with 32 GBytes memory (4 GBytes per
>> task).
>> 
>> - Have tried this with WRFV3.0.1.1 and WRFV3.1
>> 
>> 
>> I'll continue to probe, and may need to start digging into the
>> real.exe source, but just wanted to know if anybody else has
>> experienced success or failure with this size of a problem.  I'm aware
>> that a Gordon Bell entry last year was performed with about 2 billion
>> grid points, but I think I remember someone telling me that the run
>> wasn't prepared with WPS.
>> 
>> Thanks,
>> 
>> Don Morton
>> -- 
>> Arctic Region Supercomputing Center
>> http://www.arsc.edu/~morton/
>> _______________________________________________
>> Wrf-users mailing list
>> Wrf-users at ucar.edu
>> http://mailman.ucar.edu/mailman/listinfo/wrf-users
> 
> ---
> Arctic Region Supercomputing Center
> http://www.arsc.edu/~morton/
> 
> 
> 
> 
> 
> _______________________________________________
> Wrf-users mailing list
> Wrf-users at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/wrf-users

___________________________________________
William I. Gustafson Jr., Ph.D.
Scientist
ATMOSPHERIC SCIENCES AND GLOBAL CHANGE DIVISION
 
Pacific Northwest National Laboratory
P.O. 999, MSIN K9-30
Richland, WA  99352
Tel: 509-372-6110
William.Gustafson at pnl.gov
http://www.pnl.gov/atmospheric/staff/staff_info.asp?staff_num=5716



More information about the Wrf-users mailing list