[Wrf-users] WRF Benchmark, Recommended RAM per core
Sam Trahan
samtrahan at samtrahan.com
Mon Dec 5 12:38:19 MST 2016
Hi,
A few tips on memory, threading, and domain sizes.
OMP_STACKSIZE - the most common problem when running OpenMP. If an OpenMP
application is having memory problems, this is the first thing you should
check. This environment variable sets the amount of memory per thread for
stack. The default is usually far larger than you need, and WRF usually
doesn't use much stack.
Experiment with smaller values. This can give you a HUGE memory savings.
Try OMP_STACKSIZE=128M and work down from there. Note that the
environment variable must be set on the processes that are running WRF.
Some MPI implementations will require you to set some option to forward or
set that variable.
export OMP_STACKSIZE=128M # <--- bash/ksh/zsh/sh
setenv OMP_STACKSIZE 128M # <--- tcsh/csh
HYPERTHREADING - It is rare for hyper-threading to speed up anything
(except FV3). You can usually disable it without BIOS access, though BIOS
is the best way. Also, some MPI implementations (such as Cray's) will let
you turn on or off hyperthreading on a per-node basis with a simple
runtime switch (eg. aprun -j1).
Tips on turning off hyperthreading at runtime:
http://nicolas.limare.net/pro/notes/2014/11/26_ht_freq_scal/
NUMA NODES - Make sure all threads of a given rank share the same NUMA
node. Otherwise, memory access latency and bandwidth will take a huge
hit. Most MPI implementations will take care of this for you. Some
implementations may require customization.
An explanation of NUMA nodes:
https://support.rackspace.com/how-to/numa-vnuma-and-cpu-scheduling/
NON-UNIFORM MEMORY USAGE - Keep track of WHICH rank is having memory
problems. Frequently, certain ranks will need far more memory than
others, such as the I/O servers and rank 0. You can move those ranks to
another machine without having to spread out all of the compute ranks.
For example, the HWRF in operations has one dedicated I/O server node with
12 cores active, but uses all 24 cores on each other node.
L3 CACHE - Try to keep each domain patch small enough so a few variables
will fit in L3 cache in each MPI rank. For large grids, this may require
so many MPI ranks that the communication overheads will dominate. It is a
trade-off. When calculating this, remember that the halos extend the
patch size by ~2-5 points in each direction.
DOMAIN SIZES AND COUNTS
The last tip is about nesting and resolution choices.
WRF's upscale feedback communication requires a five dimensional transpose
(IKJ-species-variable to K-species-variable-IJ). That is very expensive.
This means the more nests you have, the more the communication will
dominate. In simpler profilers, the delay will appear to be in a later
halo or global communication, but the actual slowdown is in the upscale
and downscale communication, which filters down to later halos or global
communication in the parent ranks that are not receiving data.
In the ARW (but not NMM), there is an extra step of allocating and
reallocating temporary storage several times per timestep, per domain.
Repeated deallocation and allocation reduces memory used by grid storage
by about 10% for a 3:1 nesting ratio. This has a speed penalty, but the
cost depends on your malloc implementation and kernel. It was 30% of the
runtime for a 27:9 km resolution simulation on Power6 AIX because every
malloc/allocate allocated a new memory page.
Hence, you may see better runtimes with a few large, high-resolution
domains, than with many levels of telescoping from a low-resolution
domain. Few large domains will have higher memory usage though.
Sincerely,
Sam Trahan
On Mon, 5 Dec 2016, Abdullah Kahraman wrote:
> Dear Julio,
>
> * Amount of optimum RAM is a function of your domain size. The more grid points you have, the more RAM you need (without using virtual memory, to avoid
> slow processing).* Hyperthreading doesn't help in WRF. Even some say it is slightly slower. You can disable it from BIOS.
>
> Best,
> Abdullah.
>
> --
> Abdullah Kahraman
> Meteorological Engineer
> Turkish State Meteorological Service
> Istanbul, Turkey
> web: http://www.meteogreen.com
> twitter.com/meteogreen
>
> On Mon, Dec 5, 2016 at 6:00 PM, Julio Castro <jcastro at gac.cl> wrote:
>
> Dear Users,
>
>
>
>
>
> Do you have a rough estimation for the recommended Gb RAM per core for running WRF?
>
>
>
> I have 24 cores (2 CPUs), 48 threads (using or not using hyper threading is also another question)
>
> 64 Gb RAM
>
> Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz
>
>
>
>
>
> I'm using 48 threads with 6 WRF runs (8 threads per run). My guess is that with 60 % of the threads my machine almost run out of RAM.
>
>
>
> Many thanks
>
>
>
> Julio Castro M.
>
> Ingeniero Civil, MSc., PhD.
>
> Jefe del Área de Modelación y Calidad del Aire
>
> cid:83969B68-5B60-49B4-AE8A-CFDE16B1D669
>
> Padre Mariano 103, Of. 307
> Providencia, Santiago
> Fono: +56 2 27195 610
> Fax: +56 2 22351 100
> www.gac.cl
>
>
>
>
>
>
> _______________________________________________
> Wrf-users mailing list
> Wrf-users at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/wrf-users
>
>
>
>
More information about the Wrf-users
mailing list