[Wrf-users] Re: WRF ... mpi Buffering message

Peter Johnsen pjj at cray.com
Mon Nov 12 08:37:32 MST 2007


Alan,

Typically for WRF we see this MPI situation occur during
data collection for forecast output.  As the number of MPI
ranks increases, the I/O collector (usually MPI rank 0), may
not be able to keep up with the numerous senders.  Usually
a MPICH_UNEX_BUFFER_SIZE setting of 480M (megabytes) or so
should be sufficient for a 1000x1000x47 grid on thousands
of XT cores.

Please feel free to contact me if you have further questions.

Peter

Peter Johnsen      Cray, Inc.
Meteorologist, Applications Engineering
651-605-9173       pjj at cray.com



> 
> Today's Topics:
> 
>    1. WRF ... mpi Buffering message error  (Alan Gadian)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Fri, 9 Nov 2007 16:16:08 +0000 (GMT)
> From: Alan Gadian <alan at env.leeds.ac.uk>
> Subject: [Wrf-users] WRF ... mpi Buffering message error 
> To: wrf-users at ucar.edu
> Cc: Ralph Burton <ralph at see.leeds.ac.uk>,	Paul Connolly
> 	<p.connolly at manchester.ac.uk>
> Message-ID: <Pine.LNX.4.64.0711091523510.9362 at see-gw-01>
> Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
> 
> 
> Hi,
> 
> We are running WRF on 1024 dual processor cores  (i.e. np=2048)
> on an XT4.
> 
> We had the following error message:-
> 
> 
>> internal ABORT - process 0: Other MPI error, error stack:
>> MPIDI_PortalsU_Request_PUPE(317): exhausted unexpected receive queue
>> buffering increase via env. var. MPICH_UNEX_BUFFER_SIZE
> 
> 
> which, we are told means
> 
> "The application is sending too many short, unexpected messages to
> a particular receiver."
> 
> We have been advised that  to work around the problem we should
> 
> "Increase the amount of memory for MPI buffering using the
> MPICH_UNEX_BUFFER_SIZE variable(default is 60 MB) and/or decrease
> the short message threshold using the MPICH_MAX_SHORT_MSG_SIZE
> (default is 128000 bytes) variable. May want to set MPICH_DBMASK
> to 0x200 to get a traceback/coredump to learn where in
> application this problem is occurring."
> 
> The question, is has anyone else had this problem.  The code
> worked without any problem with 500 cores, and given the size
> of the problem, we think we can get good scalability up to
> 3000 cores.  However, has anyone any advice on what is
> happenning or what the numbers we should be doing, and how
> dependent is it on the number of processors.
> 
> Cheers
> Alan
> 
> -----------------------------------
> Address: Alan Gadian, Environment, SEE, 
> Leeds University, Leeds LS2 9JT.  U.K.
> Email: alan at env.leeds.ac.uk. 
> http://www.env.leeds.ac.uk/~alan
> 
> Atmospheric Science Letters; the New Journal of R. Met. Soc. 
> Free Sample:-  http://www.interscience.wiley.com/asl-sample2007
> -----------------------------------
> 



More information about the Wrf-users mailing list