[Wrf-users] The efficiency problem to run WRFV3.2.1 on a cluster with 8 nodes

Thu Nov 18 03:54:19 MST 2010

Hi,

There's really a lot of variables in what makes a job scale well or 
awfully.  Just some pointers to start with:

Many times the bottleneck is memory bandwidth. If you have multicore cpu, 
all cores are trying to access memory via the same bus.  Or if the machine 
has single memory bus for all cpu's.  Best to have a good switching memory 
system (like IBM) or separate memory bus for each cpu (AMD).  I do not 
know what is the current state of different chips.

CPU caching and cache sizes are important.  Different domains scale 
differently.

Check the CPU affinity of each process.  If a thread switches between 
cores, or worse, between cpu's or nodes, cache is lost.  Pin the threads 
to specific cpu.  This needs some coding, no options exist for this.

Regards,
Jaakko

On Wed, 17 Nov 2010, Andrew Porter wrote:
> Hi Feng,
>
>> I'm trying to run WRF model with parallelized version with 2, 4, 8, or 16 processors on a Linux cluster with 8 nodes (each node is formed by 2-quadcores). Runs got slower with increasing the number of processors (np)! It runs correctly on all nodes but so slow. When I switch to np=2, model is running on the master node only and faster. The overall time of the simulation is bigger than for the single node run... Is the problem associated with bandwidth? network card? I have no idea. Anyone have experienced the same problem? Thanks.
>
> Is that built in dm or dm+sm mode and how large is your model domain?
>
> If each node on the cluster is dual quad-core then (assuming the job
> scheduler is sensible) you'll only have off-node MPI communications for
> the '16 processor' job (is that 16 MPI processes?). Therefore I doubt
> that the problem is interconnect related.
>
> Cheers,
>
> Andy.