[Wrf-users] The efficiency problem to run WRFV3.2.1 on a cluster with 8 nodes
Jaakko Hyvätti
jaakko.hyvatti at iki.fi
Thu Nov 18 03:54:19 MST 2010
Hi,
There's really a lot of variables in what makes a job scale well or
awfully. Just some pointers to start with:
Many times the bottleneck is memory bandwidth. If you have multicore cpu,
all cores are trying to access memory via the same bus. Or if the machine
has single memory bus for all cpu's. Best to have a good switching memory
system (like IBM) or separate memory bus for each cpu (AMD). I do not
know what is the current state of different chips.
CPU caching and cache sizes are important. Different domains scale
differently.
Check the CPU affinity of each process. If a thread switches between
cores, or worse, between cpu's or nodes, cache is lost. Pin the threads
to specific cpu. This needs some coding, no options exist for this.
Regards,
Jaakko
On Wed, 17 Nov 2010, Andrew Porter wrote:
> Hi Feng,
>
>> I'm trying to run WRF model with parallelized version with 2, 4, 8, or 16 processors on a Linux cluster with 8 nodes (each node is formed by 2-quadcores). Runs got slower with increasing the number of processors (np)! It runs correctly on all nodes but so slow. When I switch to np=2, model is running on the master node only and faster. The overall time of the simulation is bigger than for the single node run... Is the problem associated with bandwidth? network card? I have no idea. Anyone have experienced the same problem? Thanks.
>
> Is that built in dm or dm+sm mode and how large is your model domain?
>
> If each node on the cluster is dual quad-core then (assuming the job
> scheduler is sensible) you'll only have off-node MPI communications for
> the '16 processor' job (is that 16 MPI processes?). Therefore I doubt
> that the problem is interconnect related.
>
> Cheers,
>
> Andy.
More information about the Wrf-users
mailing list