[Wrf-users] Results variability depending on processor count

Wed Sep 24 08:24:29 MDT 2008

"Gustafson, William I" <william.gustafson at pnl.gov> schrieb am 09/19/2008 
05:50:49 PM:

> Jan,
> 
> This issue is actually a pretty complicated one that is a bit machine 
and
> compiler dependent.

Bill,

Thanks for responding. Based on my most recent tests, I'm afraid that the 
isssue is not just machine and compiler dependent. It is also dependent on 
the input data. That is, one namelist.input will give bit-identical 
results regardless of the used processor count whereas another won't.

Example:

I have a two-domain nested configuration with 100x151 grid points per 
domain. I ran it with 1-8 processors and almost every run produced a 
slightly different output after 1 hour of integration. Only the 4 
processor and 8 processor runs agreed with each other. Then I changed this 
configuration, ceteris paribus, to 300x351 grid points per domain, in 
order to execute tests with up to 32 processors (there is a known problem 
with applying too many processors to a small domain in WRF 3.0.1.1, which 
is why I had to enlarge the domain). Surprisingly, this new configuration 
resulted in bit-identical results for all runs with 4-32 processors (I 
didn't perform 1-3 processor runs because of not enough memory).

This is on an Opteron x86_64 cluster, using PGI 6.2-5 compiler. WRF was 
compiled with the -Kieee option (IEEE-compliant floating point operations) 
for the above tests. Without the -Kieee option, I couldn't even observe an 
agreement between 4 and 8 processor runs in the small case. In the large 
case, there was again perfect agreement among all runs with varying 
processor counts, even though the results were different from the ones 
obtained with -Kieee.

I'm not concerned about the differences resulting from different compiler 
options - fair enough - just about the differences due to different 
processor counts, which I cannot explain.

> In fact, my understanding is that for any of us
> developing code to be released with WRF, we must be able to reproduce 
the
> wrfout files and get a bit-for-bit match when we change processor 
counts.

That is what I'd like to achieve, ideally between different clusters that 
share the x86_64 architecture. However, getting invariant results under 
varying processor counts in a single cluster would be a good first step.

> The reason for the differences are many. The most problematic is bugs in 
the
> code, e.g. an array not being given a value before being used, which 
leads
> to random results.

Wouldn't bugs cause easily reproducible deviations? Is it likely that 
varying the processor count triggers these kinds of bugs?

> Another possibility is how optimization is done. Each CPU
> has a set of registers that hold values used for calculation and
> intermediary results. These registers typically operate at a higher
> precision than the numbers held in memory. So, when numbers are passed 
from
> a register to memory and brought back to another register, a small 
amount of
> precision is lost.
> The implication is that if multiple calculations can be
> done entirely in the registers, one can gain a little accuracy. However, 
if
> during the same series of calculations one has to use memory space to 
hold
> an intermediary value, the result could differ at the end. Compilers 
often
> have options to prevent these differences by forcing round-off error to 
be
> handled consistently, e.g. with ifort one would add "-fp-model precise".

I think you are referring to the x87 architecture (which has 80 bit 
registers and 64 bit in-memory representation, according to what I have 
read). However, as far as I understand, the SSE/SSE2 architecture of 
Opteron is different and does not suffer from this particular problem. In 
any case, I used the -Kieee and -Mnobuiltin options of the PGI compiler 
just to be on the safe side (I thought). However, as reported above, the 
results still varied.

Right now I tend to believe that there is a bug somewhere in WRF code, 
based on the observations that it *can* produce bit-identical results 
regardless of the processor count for *some* namelists (and input data).

Regards,
Jan Ploski

--
Dipl.-Inform. (FH) Jan Ploski
OFFIS
FuE Bereich Energie | R&D Division Energy
Escherweg 2  - 26121 Oldenburg - Germany
Phone/Fax: +49 441 9722 - 184 / 202
E-Mail: Jan.Ploski at offis.de
URL: http://www.offis.de