<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 14 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:Tahoma;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
span.apple-style-span
        {mso-style-name:apple-style-span;}
span.EmailStyle18
        {mso-style-type:personal-reply;
        font-family:"Calibri","sans-serif";
        color:#1F497D;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri","sans-serif";}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Looks the me like OpenMPI is not installed on all the compute nodes within your cluster. Note the line:<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal>bash: orted: command not found<o:p></o:p></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>which says it can't run, because the executable daemon for OpenMPI doesn't exist (on the compute node).<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>You need to install OpenMPI on all the nodes. It looks like you're using a RocksCluster.org cluster (based on the naming of the compute nodes). If so, you could install OpenMPI in /share/apps/openmpi (or something similar under /share/apps). Everything in /share/apps is shared via NFS to all the nodes in the cluster. Alternatively, you could create an RPM of the OpenMPI bits, and install the RPM on all the nodes in your cluster.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>When running OpenMPI's configure, you could use something like this:<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'># configure --with-tm=/opt/lsf --prefix=/share/apps/openmpi<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Where you'll have to adjust the path to LSF to be the real path.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>When you run WRF using a properly installed orterun, you won't have to specify the -np or -hostfile. Just "orterun real.exe" or "orterun wrf.exe".<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Bart Brashers<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> Ahsan Ali [mailto:ahsanshah01@gmail.com] <br><b>Sent:</b> Tuesday, April 05, 2011 11:57 PM<br><b>To:</b> Bart Brashers<br><b>Subject:</b> WRF Problem running in Parallel<o:p></o:p></span></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Dear Bart<o:p></o:p></p><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal> It gives following error for each command. We have LSF installed but am not sure how to integrate WRF with LSF.<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p><div><div><p class=MsoNormal>[root@pmd02 em_real]# orterun -np 4 -hostfile hosts.txt real.exe <o:p></o:p></p></div><div><p class=MsoNormal>bash: orted: command not found<o:p></o:p></p></div><div><p class=MsoNormal>--------------------------------------------------------------------------<o:p></o:p></p></div><div><p class=MsoNormal>A daemon (pid 13139) died unexpectedly with status 127 while attempting<o:p></o:p></p></div><div><p class=MsoNormal>to launch so we are aborting.<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>There may be more information reported by the environment (see above).<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>This may be because the daemon was unable to find all the needed shared<o:p></o:p></p></div><div><p class=MsoNormal>libraries on the remote node. You may set your LD_LIBRARY_PATH to have the<o:p></o:p></p></div><div><p class=MsoNormal>location of the shared libraries on the remote nodes and this will<o:p></o:p></p></div><div><p class=MsoNormal>automatically be forwarded to the remote nodes.<o:p></o:p></p></div><div><p class=MsoNormal>--------------------------------------------------------------------------<o:p></o:p></p></div><div><p class=MsoNormal>--------------------------------------------------------------------------<o:p></o:p></p></div><div><p class=MsoNormal>orterun noticed that the job aborted, but has no info as to the process<o:p></o:p></p></div><div><p class=MsoNormal>that caused that situation.<o:p></o:p></p></div><div><p class=MsoNormal>--------------------------------------------------------------------------<o:p></o:p></p></div><div><p class=MsoNormal>bash: orted: command not found<o:p></o:p></p></div><div><p class=MsoNormal>--------------------------------------------------------------------------<o:p></o:p></p></div><div><p class=MsoNormal>orterun was unable to cleanly terminate the daemons on the nodes shown<o:p></o:p></p></div><div><p class=MsoNormal>below. Additional manual cleanup may be required - please refer to<o:p></o:p></p></div><div><p class=MsoNormal>the "orte-clean" tool for assistance.<o:p></o:p></p></div><div><p class=MsoNormal>--------------------------------------------------------------------------<o:p></o:p></p></div><div><p class=MsoNormal> compute-02-01 - daemon did not report back when launched<o:p></o:p></p></div><div><p class=MsoNormal> compute-02-02 - daemon did not report back when launched<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal><span class=apple-style-span><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'>Are you using any queuing system like SGE, Torque, PBS, etc.? In</span></span><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'><br><span class=apple-style-span>OpenMPI, the mpirun (and mpiexec) are really just links to orterun.</span><br><span class=apple-style-span>Orterun is smart enough to get the list of hostnames to use from the</span><br><span class=apple-style-span>queuing system.</span><br><br><br><br><span class=apple-style-span>If you're not using a queuing system, then you need to tell orterun</span><br><span class=apple-style-span>which machines to use. There's several ways, see `man orterun`. You</span><br><span class=apple-style-span>could do any of these:</span><br><br><br><br><span class=apple-style-span># cat hosts.txt</span><br><br><span class=apple-style-span>machine1</span><br><br><span class=apple-style-span>machine1</span><br><br><span class=apple-style-span>machine2</span><br><br><span class=apple-style-span>machine2</span><br><br><span class=apple-style-span># orterun -np 4 -hostfile hosts.txt wrf.exe</span><br><br><span class=apple-style-span># mpirun -np 4 -machinefile hosts.txt wrf.exe</span><br><br><span class=apple-style-span># orterun -np 4 -host machine1,machine1,machine2,machine2 wrf.exe</span><br><br><br><br><span class=apple-style-span>Bart Brashers</span></span><o:p></o:p></p></div><p class=MsoNormal>-- <br>Syed Ahsan Ali Bokhari <br>Electronic Engineer (EE)<o:p></o:p></p><div><p class=MsoNormal><br>Research & Development Division<br>Pakistan Meteorological Department H-8/4, Islamabad.<br>Phone # off +92518358714<o:p></o:p></p></div><div><p class=MsoNormal>Cell # +923155145014<o:p></o:p></p></div><p class=MsoNormal><o:p> </o:p></p></div></div></div>
<DIV><P><HR>
This message contains information that may be confidential, privileged or otherwise protected by law from disclosure. It is intended for the exclusive use of the Addressee(s). Unless you are the addressee or authorized agent of the addressee, you may not review, copy, distribute or disclose to anyone the message or any information contained within. If you have received this message in error, please contact the sender by electronic reply to email@environcorp.com and immediately delete all copies of the message.
</P></DIV>
</body></html>