[Wrf-users] mpirun giving unexpected results
Brian.Hoeth at noaa.gov
Brian.Hoeth at noaa.gov
Mon Apr 3 11:01:14 MDT 2006
Hello,
The post below was sent to the online WRF Users Forum by one of our
software support group members (Brice), so I will just cut and paste
the post here to see if we get any replies here also.
Thanks,
Brian Hoeth
Spaceflight Meteorology Group
Johnson Space Center
Houston, TX
281-483-3246
The Spaceflight Meteorology Group here at Johnson Space Center has
recently acquired a small Linux-based cluster to run the WRF-NMM in
support of Space Shuttle operations. I am the software support lead
and have been running some 'bench' testing on the system. The results
of the tests have raised some questions that I would appreciate help
in answering.
I may not have the exact details of the configuration of the model run
here, but the SMG folks will probably supply that if more information
is needed. The testing involved running the WRF-NMM at a 4km
resolution over an area around New Mexico, using the real data test
case, downloaded from the WRF-NMM user's site.
The cluster is composed of a head node with dual hyper-threading Intel
Xeons at 3.2GHz and 16 subnodes with dual Intel Xeons at 3.2GHz. All
of the subnodes mount the headnodes home drive. Communications between
the nodes is via Gigabit Ethernet.
The WRF-NMM package was installed using the PGI CDK 6.0 as was MPICH
and netCDF. One thing that I ran into in the installation was
differences between what I started out installing using the 32-bit PGI
and then attempting to install the WRF, which chose to have itself
installed using the 64-bit. That was corrected and all of the software
packages associated with the model (MPICH, netCDF, real-nmm.exe and
wrf.exe) are compiled with 64-bit support. The head node is running
RHEL AS 3.4 and the compute nodes are running RHEL WS 3.4.
Ok. That's the basic background to jump past all of those questions.
Additional information is that I have not tried any of the debugging
tools yet; I am using /usr/bin/time -v to gather timing data; and I am
not using any scheduling applications, such as OPENPBS, just mpirun
and various combinations of machine and process files. I have the time
results and the actual command lines captured and can supply that if
someone needs that. Last bit of 'background' is that I am not a long
term cluster development programmer (20+years programming in FORTRAN
and other things, but not clusters), nor a heavy Linux administrator (
though that is changing rapidly and several years experience in HPUX
administration). So now you know some measure of how many questions I
will ask before I understand the answers I get ;-) The SMG has had a
Beowulf cluster for a couple of years, but my group was giving it
minimal admin support. So I, like any good programmer, am looking
for 'prior art' and experience.
Here are some of the summarized results and then I will get the
questions:
WRF-NMM run with 1 process on head node and 31 processes on subnodes
'mpirun -np 32 ./wrf.exe'
13:21.32 wall time (all times from the headnode perspective)
WRF-NMM run with 3 processes on head node and 32 processes on subnodes
'mpirun -p4pg PI-35proc ./wrf.exe'
13:53.70 wall time
WRF-NMM run with 1 process on head node and 15 processes on subnodes
'mpirun -np 16 ./wrf.exe'
14:09.29 wall time
WRF-NMM run with 1 process on head node and 7 processes on subnodes
'mpirun -np 8 ./wrf.exe'
20:08.88 wall time
WRF-NMM run with NO processes on head node and 16 processes on subnodes
'mpirun -np 16 -nolocal -machinefile wrf-16p.machines ./wrf.exe'
1:36:56 - an hour and a half of wall time
and finally, dual runs of the model with 1 process each on the head
node and 15 processes pushed out to separate banks of the compute nodes
'mpirun -np 16 -machinefile wrf-16p-plushead.machines ./wrf.exe'
17:27.70 wall time
'mpirun -np 16 -machinefile wrf-16p-test2.machines ./wrf.exe'
17:08.21 wall time
The results that call questions are the minimal difference between 16
and 32 processes, and, in fact, 8 processes and the huge difference in
putting no processes on the head node. Taking the last case first, my
thought, based on some web research is that possibly the difference
between NFS and local writes could be influencing the time, but
question maybe a shared memory issue?
Going back to the base issue of the number of processes influence.
Does anyone have other experiences with the scaling of the WRF to
larger or smaller clusters (I did note one in an earlier post, but I
am unsure what to make of the results at this point)? And I did look
at the graph that was referred to, but we are a much smaller shop than
most of the tests there. Can anybody suggest some tuning that might be
useful or a tool that would assist in gaining a better understanding
of what is going on and what to expect if(when) the users expand their
activities?
Pardon the length of this post, but I figured it was better to get out
as many details up front as possible.
Thanks,
Brice
More information about the Wrf-users
mailing list