[Wrf-users] Re: WRF ... mpi Buffering message
Peter Johnsen
pjj at cray.com
Mon Nov 12 08:37:32 MST 2007
Alan,
Typically for WRF we see this MPI situation occur during
data collection for forecast output. As the number of MPI
ranks increases, the I/O collector (usually MPI rank 0), may
not be able to keep up with the numerous senders. Usually
a MPICH_UNEX_BUFFER_SIZE setting of 480M (megabytes) or so
should be sufficient for a 1000x1000x47 grid on thousands
of XT cores.
Please feel free to contact me if you have further questions.
Peter
Peter Johnsen Cray, Inc.
Meteorologist, Applications Engineering
651-605-9173 pjj at cray.com
>
> Today's Topics:
>
> 1. WRF ... mpi Buffering message error (Alan Gadian)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 9 Nov 2007 16:16:08 +0000 (GMT)
> From: Alan Gadian <alan at env.leeds.ac.uk>
> Subject: [Wrf-users] WRF ... mpi Buffering message error
> To: wrf-users at ucar.edu
> Cc: Ralph Burton <ralph at see.leeds.ac.uk>, Paul Connolly
> <p.connolly at manchester.ac.uk>
> Message-ID: <Pine.LNX.4.64.0711091523510.9362 at see-gw-01>
> Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
>
>
> Hi,
>
> We are running WRF on 1024 dual processor cores (i.e. np=2048)
> on an XT4.
>
> We had the following error message:-
>
>
>> internal ABORT - process 0: Other MPI error, error stack:
>> MPIDI_PortalsU_Request_PUPE(317): exhausted unexpected receive queue
>> buffering increase via env. var. MPICH_UNEX_BUFFER_SIZE
>
>
> which, we are told means
>
> "The application is sending too many short, unexpected messages to
> a particular receiver."
>
> We have been advised that to work around the problem we should
>
> "Increase the amount of memory for MPI buffering using the
> MPICH_UNEX_BUFFER_SIZE variable(default is 60 MB) and/or decrease
> the short message threshold using the MPICH_MAX_SHORT_MSG_SIZE
> (default is 128000 bytes) variable. May want to set MPICH_DBMASK
> to 0x200 to get a traceback/coredump to learn where in
> application this problem is occurring."
>
> The question, is has anyone else had this problem. The code
> worked without any problem with 500 cores, and given the size
> of the problem, we think we can get good scalability up to
> 3000 cores. However, has anyone any advice on what is
> happenning or what the numbers we should be doing, and how
> dependent is it on the number of processors.
>
> Cheers
> Alan
>
> -----------------------------------
> Address: Alan Gadian, Environment, SEE,
> Leeds University, Leeds LS2 9JT. U.K.
> Email: alan at env.leeds.ac.uk.
> http://www.env.leeds.ac.uk/~alan
>
> Atmospheric Science Letters; the New Journal of R. Met. Soc.
> Free Sample:- http://www.interscience.wiley.com/asl-sample2007
> -----------------------------------
>
More information about the Wrf-users
mailing list