I have run into these kinds of issues a number of times. In one case, it was buggy implementation of MPI, in the scatterv() call, and switching to openmpi fixed the problem. In other cases, there were simply bad nodes on the machine. My own theory (may be completely wrong) is that these things hangs very frequently occur while the master task is scattering stuff to all the slaves. This is seems to be a good operation for stressing MPI and/or node communications. I have found that these kinds of problems are often (but not always) intermittent, and sometimes reducing the number of tasks will get it running (presumably because you're not stressing the underlying software and hardware infrastructure.<div>
<br></div><div>To date, I've never found these to be "WRF" problems.</div><div><br></div><div>Good luck!</div><div><br></div><div>Don<br><br><div class="gmail_quote">On Fri, Mar 25, 2011 at 11:19 PM, Jatin Kala <span dir="ltr"><<a href="mailto:J.Kala@murdoch.edu.au">J.Kala@murdoch.edu.au</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div lang="EN-AU" link="blue" vlink="purple">
<div>
<p class="MsoNormal"><span style="color:#1F497D">Thanks for the suggestion Feng,
but this is not related to namelist inputs. The namelist I am running worked
fine on a different machine.</span></p>
<p class="MsoNormal"><span style="color:#1F497D">The issue here is that WRF
simply hangs and does nothing at initialisation of Grid 2. Ie, the rsl.out and
rsl.error files print out:</span></p><div class="im">
<p class="MsoNormal"><span style="color:#1F497D"> </span></p>
<p class="MsoNormal">d01 2009-10-01_00:00:00 alloc_space_field:
domain
2, 84045408 b</p>
<p class="MsoNormal"> ytes allocated</p>
<p class="MsoNormal"> d01 2009-10-01_00:00:00 alloc_space_field:
domain
2, 3084672 b</p>
<p class="MsoNormal"> ytes allocated</p>
<p class="MsoNormal"> d01 2009-10-01_00:00:00 *** Initializing nest domain #
2 from an input file. **</p>
<p class="MsoNormal"> *</p>
<p class="MsoNormal"> d01 2009-10-01_00:00:00 med_initialdata_input: calling
input_input</p>
<p class="MsoNormal"> </p>
</div><p class="MsoNormal"><span style="color:#1F497D">and that’s it. The
rsl.error and rsl.out files do not keep growing in size, there are no more
prints, they just stop printing stuff. The job however is still in the queue
and does NOT error out, until the walltime is elapsed. No wrfout_d0* files are
created. </span></p>
<p class="MsoNormal"><span style="color:#1F497D"> </span></p>
<p class="MsoNormal"><span style="color:#1F497D">Other people seem to have had
this issue before:</span></p>
<p class="MsoNormal"><span style="color:#1F497D"> </span></p>
<p class="MsoNormal"><span style="color:#1F497D"><a href="http://mailman.ucar.edu/pipermail/wrf-users/2010/001749.html" target="_blank">http://mailman.ucar.edu/pipermail/wrf-users/2010/001749.html</a>
</span></p>
<p class="MsoNormal"><span style="color:#1F497D"> </span></p>
<p class="MsoNormal"><span style="color:#1F497D"><a href="http://mailman.ucar.edu/pipermail/wrf-users/2010/001747.html" target="_blank">http://mailman.ucar.edu/pipermail/wrf-users/2010/001747.html</a>
</span></p>
<p class="MsoNormal"><span style="color:#1F497D"> </span></p>
<p class="MsoNormal"><span style="color:#1F497D"> </span></p>
<p class="MsoNormal"><span style="color:#1F497D">Any help more than welcome.</span></p>
<p class="MsoNormal"><span style="color:#1F497D"> </span></p>
<p class="MsoNormal"><span style="color:#1F497D">Regards,</span></p>
<p class="MsoNormal"><span style="color:#1F497D"> </span></p>
<p class="MsoNormal"><span style="color:#1F497D">Jatin </span></p>
<p class="MsoNormal"><span style="color:#1F497D"> </span></p>
<p class="MsoNormal"><span style="color:#1F497D"> </span></p>
<p class="MsoNormal"><span style="color:#1F497D"> </span></p>
<div>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span lang="EN-US" style="font-size:10.0pt">From:</span></b><span lang="EN-US" style="font-size:10.0pt"> Feng Liu [mailto:<a href="mailto:FLiu@azmag.gov" target="_blank">FLiu@azmag.gov</a>] <br>
<b>Sent:</b> Saturday, 26 March 2011 9:04 AM<br>
<b>To:</b> Jatin Kala; <a href="mailto:wrf-users@ucar.edu" target="_blank">wrf-users@ucar.edu</a><br>
<b>Subject:</b> RE: WRF is "hanging"</span></p>
</div>
</div><div><div></div><div class="h5">
<p class="MsoNormal"> </p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">Hi Jatin,</span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">I do not know exactly
what is wrong for your case, but one thing you can try is to reduce time_step
in namelist.input by 3 times. Good luck.</span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">Feng</span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D"> </span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D"> </span></p>
<div>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span lang="EN-US" style="font-size:10.0pt">From:</span></b><span lang="EN-US" style="font-size:10.0pt"> <a href="mailto:wrf-users-bounces@ucar.edu" target="_blank">wrf-users-bounces@ucar.edu</a>
[mailto:<a href="mailto:wrf-users-bounces@ucar.edu" target="_blank">wrf-users-bounces@ucar.edu</a>] <b>On Behalf Of </b>Jatin Kala<br>
<b>Sent:</b> Thursday, March 24, 2011 7:29 PM<br>
<b>To:</b> <a href="mailto:wrf-users@ucar.edu" target="_blank">wrf-users@ucar.edu</a><br>
<b>Subject:</b> [Wrf-users] WRF is "hanging"</span></p>
</div>
</div>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal">Dear WRF-users,</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">I have compiled WRF3.2 on our new supercomputing facility,
and having some trouble. Namely, WRF is just “hanging” at:</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">d01 2009-10-01_00:00:00 alloc_space_field:
domain
2, 84045408 b</p>
<p class="MsoNormal"> ytes allocated</p>
<p class="MsoNormal"> d01 2009-10-01_00:00:00 alloc_space_field:
domain
2, 3084672 b</p>
<p class="MsoNormal"> ytes allocated</p>
<p class="MsoNormal"> d01 2009-10-01_00:00:00 *** Initializing nest domain #
2 from an input file. **</p>
<p class="MsoNormal"> *</p>
<p class="MsoNormal"> d01 2009-10-01_00:00:00 med_initialdata_input: calling
input_input</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">The job remains in the queue, i.e, does not error out until
walltime is elapsed.</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">I have compiled with –O0 but that did not help. I have
also compiled with the updated “gen_allocs.c” form the WRF website,
but that has not helped either. I did do a “clean –a” before.</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">I have compiled WRF with the follows libs:</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">intel-compilers/2011.1.107</p>
<p class="MsoNormal">jasper/1.900.1</p>
<p class="MsoNormal">ncarg/5.2.1</p>
<p class="MsoNormal">mpi/intel/openmpi/1.4.2-qlc</p>
<p class="MsoNormal">netcdf/4.0.1/intel-2011.1.107</p>
<p class="MsoNormal">export WRFIO_NCD_LARGE_FILE_SUPPORT=1</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">Any help would be greatly appreciated!</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">Kind regards,</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">Jatin </p>
</div></div></div>
</div>
<br>_______________________________________________<br>
Wrf-users mailing list<br>
<a href="mailto:Wrf-users@ucar.edu">Wrf-users@ucar.edu</a><br>
<a href="http://mailman.ucar.edu/mailman/listinfo/wrf-users" target="_blank">http://mailman.ucar.edu/mailman/listinfo/wrf-users</a><br>
<br></blockquote></div><br><br clear="all"><br>-- <br><div>Voice: 907 450 8679</div>Arctic Region Supercomputing Center<br><a href="http://weather.arsc.edu/" target="_blank">http://weather.arsc.edu/</a><div><a href="http://www.arsc.edu/~morton/" target="_blank">http://www.arsc.edu/~morton/</a></div>
<br>
</div>