[Wrf-users] WRF Scaling

Mon Apr 3 12:45:39 MDT 2006

Hi Brian,

You can pass this on to Brice.

I don't have any experience with the WRF-NMM, but I think my experience
with the WRF-ARW can be extrapolated.  

First, you did not say how big your WRF domain is.  It could be that 32
processors is beyond the point of diminishing returns with regard to
scalability.  With the ARW, once you have decomposed the horizontal
domain into tiles having less than 50x50 grid points in each tile,
adding more processors generally doesn't give you much of a speedup.  

Second, you say that the system has a GigE interconnect.  Does it have a
single GigE interface on each box, or do you have dual GigE connections
on each node?  For MPI jobs, it is important to have a dedicated network
for the MPI traffic.  Perferably, for 16 nodes or more, you want
something with lower latency, like Myrinet.  But, if you are stuck with
GigE, you should have one interface dedicated to the MPI traffic, and
the other for general TCP/IP traffic. Otherwise, your inter-node
communications compete with other traffic, causing a communications
bottleneck that only gets worse as you add more processors.   

Also, even though the Intel Xeons have hyperthreading, my experience
with the AMD dual-core processors is that you should still have no more
than 1 WRF process per CPU.  So, if your compute nodes are dual
processor, you should only have 2 WRF processes per node.  If your 16
nodes are single processor boxes, you should not be using more than 16
or 17 mpi processes (17 if you are using an I/O node, 16 if you allow
all processes to do their own I/O...I don't know if the quilted I/O
option applies to NMM or not).  

Also, it looks like your run time is signficantly longer when not using
the headnode.  Did I interpret this correctly?  Does your master node
have more RAM than the compute nodes? Often, the rank 0 MPI node has to
have enough RAM to hold all of the fully dimensioned arrays, and if you
have limited RAM on the compute nodes, you could be running into memory
paging issues when one of your smaller nodes is the rank 0 node.  On our
cluster, we don't use the headnode for any part of a WRF job, because
all of the nodes generally need to be dedicated to the WRF job for
maximum efficiency, and our headnode does a lot of management tasks for
our cluster.

Let me know if you have any more questions about all of this.  

Regards,

Brent 

> -----Original Message-----
> From: wrf-users-bounces at ucar.edu 
> [mailto:wrf-users-bounces at ucar.edu] On Behalf Of 
> wrf-users-request at ucar.edu
> Sent: Monday, April 03, 2006 1:00 PM
> To: wrf-users at ucar.edu
> Subject: Wrf-users Digest, Vol 20, Issue 1
> 
> Send Wrf-users mailing list submissions to
> 	wrf-users at ucar.edu
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://mailman.ucar.edu/mailman/listinfo/wrf-users
> or, via email, send a message with subject or body 'help' to
> 	wrf-users-request at ucar.edu
> 
> You can reach the person managing the list at
> 	wrf-users-owner at ucar.edu
> 
> When replying, please edit your Subject line so it is more 
> specific than "Re: Contents of Wrf-users digest..."
> 
> 
> Today's Topics:
> 
>    1. mpirun giving unexpected results (Brian.Hoeth at noaa.gov)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Mon, 03 Apr 2006 12:01:14 -0500
> From: Brian.Hoeth at noaa.gov
> Subject: [Wrf-users] mpirun giving unexpected results
> To: wrf-users at ucar.edu
> Message-ID: <59e10599da.599da59e10 at noaa.gov>
> Content-Type: text/plain; charset=us-ascii
> 
> Hello,
> 
> The post below was sent to the online WRF Users Forum by one 
> of our software support group members (Brice), so I will just 
> cut and paste the post here to see if we get any replies here also.
> 
> Thanks,
> Brian Hoeth
> Spaceflight Meteorology Group
> Johnson Space Center
> Houston, TX
> 281-483-3246
> 
> 
> 
> The Spaceflight Meteorology Group here at Johnson Space Center has 
> recently acquired a small Linux-based cluster to run the WRF-NMM in 
> support of Space Shuttle operations. I am the software support lead 
> and have been running some 'bench' testing on the system. The results 
> of the tests have raised some questions that I would appreciate help 
> in answering.
> 
> I may not have the exact details of the configuration of the 
> model run 
> here, but the SMG folks will probably supply that if more information 
> is needed. The testing involved running the WRF-NMM at a 4km 
> resolution over an area around New Mexico, using the real data test 
> case, downloaded from the WRF-NMM user's site.
> 
> The cluster is composed of a head node with dual 
> hyper-threading Intel 
> Xeons at 3.2GHz and 16 subnodes with dual Intel Xeons at 3.2GHz. All 
> of the subnodes mount the headnodes home drive. 
> Communications between 
> the nodes is via Gigabit Ethernet.
> 
> The WRF-NMM package was installed using the PGI CDK 6.0 as was MPICH 
> and netCDF. One thing that I ran into in the installation was 
> differences between what I started out installing using the 
> 32-bit PGI 
> and then attempting to install the WRF, which chose to have itself 
> installed using the 64-bit. That was corrected and all of the 
> software 
> packages associated with the model (MPICH, netCDF, real-nmm.exe and 
> wrf.exe) are compiled with 64-bit support. The head node is running 
> RHEL AS 3.4 and the compute nodes are running RHEL WS 3.4.
> 
> Ok. That's the basic background to jump past all of those questions. 
> Additional information is that I have not tried any of the debugging 
> tools yet; I am using /usr/bin/time -v to gather timing data; 
> and I am 
> not using any scheduling applications, such as OPENPBS, just mpirun 
> and various combinations of machine and process files. I have 
> the time 
> results and the actual command lines captured and can supply that if 
> someone needs that. Last bit of 'background' is that I am not a long 
> term cluster development programmer (20+years programming in FORTRAN 
> and other things, but not clusters), nor a heavy Linux 
> administrator ( 
> though that is changing rapidly and several years experience in HPUX 
> administration). So now you know some measure of how many questions I 
> will ask before I understand the answers I get ;-) The SMG has had a 
> Beowulf cluster for a couple of years, but my group was giving it 
> minimal admin support. So I, like any good programmer, am looking 
> for 'prior art' and experience.
> 
> Here are some of the summarized results and then I will get the 
> questions:
> 
> WRF-NMM run with 1 process on head node and 31 processes on subnodes
> 'mpirun -np 32 ./wrf.exe'
> 13:21.32 wall time (all times from the headnode perspective)
> 
> WRF-NMM run with 3 processes on head node and 32 processes on subnodes
> 'mpirun -p4pg PI-35proc ./wrf.exe'
> 13:53.70 wall time
> 
> WRF-NMM run with 1 process on head node and 15 processes on subnodes
> 'mpirun -np 16 ./wrf.exe'
> 14:09.29 wall time
> 
> WRF-NMM run with 1 process on head node and 7 processes on subnodes
> 'mpirun -np 8 ./wrf.exe'
> 20:08.88 wall time
> 
> WRF-NMM run with NO processes on head node and 16 processes 
> on subnodes
> 'mpirun -np 16 -nolocal -machinefile wrf-16p.machines ./wrf.exe'
> 1:36:56 - an hour and a half of wall time
> 
> and finally, dual runs of the model with 1 process each on the head 
> node and 15 processes pushed out to separate banks of the 
> compute nodes
> 
> 'mpirun -np 16 -machinefile wrf-16p-plushead.machines ./wrf.exe'
> 17:27.70 wall time
> 'mpirun -np 16 -machinefile wrf-16p-test2.machines ./wrf.exe'
> 17:08.21 wall time
> 
> The results that call questions are the minimal difference between 16 
> and 32 processes, and, in fact, 8 processes and the huge 
> difference in 
> putting no processes on the head node. Taking the last case first, my 
> thought, based on some web research is that possibly the difference 
> between NFS and local writes could be influencing the time, but 
> question maybe a shared memory issue?
> 
> Going back to the base issue of the number of processes influence. 
> Does anyone have other experiences with the scaling of the WRF to 
> larger or smaller clusters (I did note one in an earlier post, but I 
> am unsure what to make of the results at this point)? And I did look 
> at the graph that was referred to, but we are a much smaller 
> shop than 
> most of the tests there. Can anybody suggest some tuning that 
> might be 
> useful or a tool that would assist in gaining a better understanding 
> of what is going on and what to expect if(when) the users 
> expand their 
> activities?
> 
> Pardon the length of this post, but I figured it was better 
> to get out 
> as many details up front as possible.
> 
> Thanks,
> 
> Brice 
> 
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> Wrf-users mailing list
> Wrf-users at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/wrf-users
> 
> 
> End of Wrf-users Digest, Vol 20, Issue 1
> ****************************************
>