[Wrf-users] WRF Benchmark, Recommended RAM per core

Mon Dec 5 12:38:19 MST 2016

Hi,

A few tips on memory, threading, and domain sizes.

OMP_STACKSIZE - the most common problem when running OpenMP.  If an OpenMP 
application is having memory problems, this is the first thing you should 
check.  This environment variable sets the amount of memory per thread for 
stack.  The default is usually far larger than you need, and WRF usually 
doesn't use much stack.

Experiment with smaller values.  This can give you a HUGE memory savings. 
Try OMP_STACKSIZE=128M and work down from there.  Note that the 
environment variable must be set on the processes that are running WRF. 
Some MPI implementations will require you to set some option to forward or 
set that variable.

   export OMP_STACKSIZE=128M  #  <---  bash/ksh/zsh/sh

   setenv OMP_STACKSIZE 128M  #  <---  tcsh/csh

HYPERTHREADING - It is rare for hyper-threading to speed up anything 
(except FV3).  You can usually disable it without BIOS access, though BIOS 
is the best way.  Also, some MPI implementations (such as Cray's) will let 
you turn on or off hyperthreading on a per-node basis with a simple 
runtime switch (eg. aprun -j1).

Tips on turning off hyperthreading at runtime:

http://nicolas.limare.net/pro/notes/2014/11/26_ht_freq_scal/

NUMA NODES - Make sure all threads of a given rank share the same NUMA 
node.  Otherwise, memory access latency and bandwidth will take a huge 
hit.  Most MPI implementations will take care of this for you.  Some 
implementations may require customization.

An explanation of NUMA nodes:

https://support.rackspace.com/how-to/numa-vnuma-and-cpu-scheduling/

NON-UNIFORM MEMORY USAGE - Keep track of WHICH rank is having memory 
problems.  Frequently, certain ranks will need far more memory than 
others, such as the I/O servers and rank 0.  You can move those ranks to 
another machine without having to spread out all of the compute ranks. 
For example, the HWRF in operations has one dedicated I/O server node with 
12 cores active, but uses all 24 cores on each other node.

L3 CACHE - Try to keep each domain patch small enough so a few variables 
will fit in L3 cache in each MPI rank.  For large grids, this may require 
so many MPI ranks that the communication overheads will dominate. It is a 
trade-off.  When calculating this, remember that the halos extend the 
patch size by ~2-5 points in each direction.

DOMAIN SIZES AND COUNTS

The last tip is about nesting and resolution choices.

WRF's upscale feedback communication requires a five dimensional transpose 
(IKJ-species-variable to K-species-variable-IJ). That is very expensive. 
This means the more nests you have, the more the communication will 
dominate.  In simpler profilers, the delay will appear to be in a later 
halo or global communication, but the actual slowdown is in the upscale 
and downscale communication, which filters down to later halos or global 
communication in the parent ranks that are not receiving data.

In the ARW (but not NMM), there is an extra step of allocating and 
reallocating temporary storage several times per timestep, per domain. 
Repeated deallocation and allocation reduces memory used by grid storage 
by about 10% for a 3:1 nesting ratio. This has a speed penalty, but the 
cost depends on your malloc implementation and kernel.  It was 30% of the 
runtime for a 27:9 km resolution simulation on Power6 AIX because every 
malloc/allocate allocated a new memory page.

Hence, you may see better runtimes with a few large, high-resolution 
domains, than with many levels of telescoping from a low-resolution 
domain.  Few large domains will have higher memory usage though.

Sincerely,
Sam Trahan

On Mon, 5 Dec 2016, Abdullah Kahraman wrote:

> Dear Julio,
> 
> * Amount of optimum RAM is a function of your domain size. The more grid points you have, the more RAM you need (without using virtual memory, to avoid
> slow processing).* Hyperthreading doesn't help in WRF. Even some say it is slightly slower. You can disable it from BIOS. 
> 
> Best,
> Abdullah.
> 
> -- 
> Abdullah Kahraman 
> Meteorological Engineer
> Turkish State Meteorological Service 
> Istanbul, Turkey 
> web: http://www.meteogreen.com 
> twitter.com/meteogreen 
> 
> On Mon, Dec 5, 2016 at 6:00 PM, Julio Castro <jcastro at gac.cl> wrote:
>
>       Dear Users,
>
>        
>
>        
>
>       Do you have a rough estimation for the recommended Gb RAM per core for running WRF?
>
>        
>
>       I have 24 cores (2 CPUs), 48 threads (using or not using hyper threading is also another  question)
>
>       64 Gb RAM
>
>       Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz
>
>        
>
>        
>
>       I'm using 48 threads with 6 WRF runs (8 threads per run). My guess is that with 60 % of the threads my machine almost run out of RAM.
>
>        
>
>       Many thanks
>
>        
>
>       Julio Castro M.
>
>       Ingeniero Civil, MSc., PhD.
>
>       Jefe del Área de Modelación y Calidad del Aire
>
>       cid:83969B68-5B60-49B4-AE8A-CFDE16B1D669
>
>       Padre Mariano 103, Of. 307
>       Providencia, Santiago
>       Fono: +56 2 27195 610
>       Fax: +56 2 22351 100
>       www.gac.cl
>
>        
>
>        
> 
> 
> _______________________________________________
> Wrf-users mailing list
> Wrf-users at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/wrf-users
> 
> 
> 
>