[Wrf-users] mpirun giving unexpected results

Mon Apr 3 11:01:14 MDT 2006

Hello,

The post below was sent to the online WRF Users Forum by one of our 
software support group members (Brice), so I will just cut and paste 
the post here to see if we get any replies here also.

Thanks,
Brian Hoeth
Spaceflight Meteorology Group
Johnson Space Center 
Houston, TX
281-483-3246

The Spaceflight Meteorology Group here at Johnson Space Center has 
recently acquired a small Linux-based cluster to run the WRF-NMM in 
support of Space Shuttle operations. I am the software support lead 
and have been running some 'bench' testing on the system. The results 
of the tests have raised some questions that I would appreciate help 
in answering.

I may not have the exact details of the configuration of the model run 
here, but the SMG folks will probably supply that if more information 
is needed. The testing involved running the WRF-NMM at a 4km 
resolution over an area around New Mexico, using the real data test 
case, downloaded from the WRF-NMM user's site.

The cluster is composed of a head node with dual hyper-threading Intel 
Xeons at 3.2GHz and 16 subnodes with dual Intel Xeons at 3.2GHz. All 
of the subnodes mount the headnodes home drive. Communications between 
the nodes is via Gigabit Ethernet.

The WRF-NMM package was installed using the PGI CDK 6.0 as was MPICH 
and netCDF. One thing that I ran into in the installation was 
differences between what I started out installing using the 32-bit PGI 
and then attempting to install the WRF, which chose to have itself 
installed using the 64-bit. That was corrected and all of the software 
packages associated with the model (MPICH, netCDF, real-nmm.exe and 
wrf.exe) are compiled with 64-bit support. The head node is running 
RHEL AS 3.4 and the compute nodes are running RHEL WS 3.4.

Ok. That's the basic background to jump past all of those questions. 
Additional information is that I have not tried any of the debugging 
tools yet; I am using /usr/bin/time -v to gather timing data; and I am 
not using any scheduling applications, such as OPENPBS, just mpirun 
and various combinations of machine and process files. I have the time 
results and the actual command lines captured and can supply that if 
someone needs that. Last bit of 'background' is that I am not a long 
term cluster development programmer (20+years programming in FORTRAN 
and other things, but not clusters), nor a heavy Linux administrator ( 
though that is changing rapidly and several years experience in HPUX 
administration). So now you know some measure of how many questions I 
will ask before I understand the answers I get ;-) The SMG has had a 
Beowulf cluster for a couple of years, but my group was giving it 
minimal admin support. So I, like any good programmer, am looking 
for 'prior art' and experience.

Here are some of the summarized results and then I will get the 
questions:

WRF-NMM run with 1 process on head node and 31 processes on subnodes
'mpirun -np 32 ./wrf.exe'
13:21.32 wall time (all times from the headnode perspective)

WRF-NMM run with 3 processes on head node and 32 processes on subnodes
'mpirun -p4pg PI-35proc ./wrf.exe'
13:53.70 wall time

WRF-NMM run with 1 process on head node and 15 processes on subnodes
'mpirun -np 16 ./wrf.exe'
14:09.29 wall time

WRF-NMM run with 1 process on head node and 7 processes on subnodes
'mpirun -np 8 ./wrf.exe'
20:08.88 wall time

WRF-NMM run with NO processes on head node and 16 processes on subnodes
'mpirun -np 16 -nolocal -machinefile wrf-16p.machines ./wrf.exe'
1:36:56 - an hour and a half of wall time

and finally, dual runs of the model with 1 process each on the head 
node and 15 processes pushed out to separate banks of the compute nodes

'mpirun -np 16 -machinefile wrf-16p-plushead.machines ./wrf.exe'
17:27.70 wall time
'mpirun -np 16 -machinefile wrf-16p-test2.machines ./wrf.exe'
17:08.21 wall time

The results that call questions are the minimal difference between 16 
and 32 processes, and, in fact, 8 processes and the huge difference in 
putting no processes on the head node. Taking the last case first, my 
thought, based on some web research is that possibly the difference 
between NFS and local writes could be influencing the time, but 
question maybe a shared memory issue?

Going back to the base issue of the number of processes influence. 
Does anyone have other experiences with the scaling of the WRF to 
larger or smaller clusters (I did note one in an earlier post, but I 
am unsure what to make of the results at this point)? And I did look 
at the graph that was referred to, but we are a much smaller shop than 
most of the tests there. Can anybody suggest some tuning that might be 
useful or a tool that would assist in gaining a better understanding 
of what is going on and what to expect if(when) the users expand their 
activities?

Pardon the length of this post, but I figured it was better to get out 
as many details up front as possible.

Thanks,

Brice