[Wrf-users] WRF Problem running in Parallel

Bart Brashers bbrashers at Environcorp.com
Wed Apr 6 09:46:07 MDT 2011


Looks the me like OpenMPI is not installed on all the compute nodes
within your cluster.  Note the line:

 

bash: orted: command not found

 

which says it can't run, because the executable daemon for OpenMPI
doesn't exist (on the compute node).

 

You need to install OpenMPI on all the nodes.  It looks like you're
using a RocksCluster.org cluster (based on the naming of the compute
nodes).  If so, you could install OpenMPI in /share/apps/openmpi (or
something similar under /share/apps).  Everything in /share/apps is
shared via NFS to all the nodes in the cluster.  Alternatively, you
could create an RPM of the OpenMPI bits, and install the RPM on all the
nodes in your cluster.

 

When running OpenMPI's configure, you could use something like this:

 

# configure --with-tm=/opt/lsf --prefix=/share/apps/openmpi

 

Where you'll have to adjust the path to LSF to be the real path.

 

When you run WRF using a properly installed orterun, you won't have to
specify the -np or -hostfile.  Just "orterun real.exe" or "orterun
wrf.exe".

 

Bart Brashers

 

From: Ahsan Ali [mailto:ahsanshah01 at gmail.com] 
Sent: Tuesday, April 05, 2011 11:57 PM
To: Bart Brashers
Subject: WRF Problem running in Parallel

 

Dear Bart

 

 It gives following error for each command. We have LSF installed but am
not sure how to integrate WRF with LSF.

 

 

[root at pmd02 em_real]# orterun -np 4 -hostfile hosts.txt real.exe 

bash: orted: command not found

------------------------------------------------------------------------
--

A daemon (pid 13139) died unexpectedly with status 127 while attempting

to launch so we are aborting.

 

There may be more information reported by the environment (see above).

 

This may be because the daemon was unable to find all the needed shared

libraries on the remote node. You may set your LD_LIBRARY_PATH to have
the

location of the shared libraries on the remote nodes and this will

automatically be forwarded to the remote nodes.

------------------------------------------------------------------------
--

------------------------------------------------------------------------
--

orterun noticed that the job aborted, but has no info as to the process

that caused that situation.

------------------------------------------------------------------------
--

bash: orted: command not found

------------------------------------------------------------------------
--

orterun was unable to cleanly terminate the daemons on the nodes shown

below. Additional manual cleanup may be required - please refer to

the "orte-clean" tool for assistance.

------------------------------------------------------------------------
--

        compute-02-01 - daemon did not report back when launched

        compute-02-02 - daemon did not report back when launched

 

 

 

 

 

Are you using any queuing system like SGE, Torque, PBS, etc.?  In
OpenMPI, the mpirun (and mpiexec) are really just links to orterun.
Orterun is smart enough to get the list of hostnames to use from the
queuing system.



If you're not using a queuing system, then you need to tell orterun
which machines to use.  There's several ways, see `man orterun`.  You
could do any of these:



# cat hosts.txt

machine1

machine1

machine2

machine2

# orterun -np 4 -hostfile hosts.txt wrf.exe

# mpirun -np 4 -machinefile hosts.txt wrf.exe

# orterun -np 4 -host machine1,machine1,machine2,machine2 wrf.exe



Bart Brashers

-- 
Syed Ahsan Ali Bokhari 
Electronic Engineer (EE)


Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714

Cell # +923155145014

 



This message contains information that may be confidential, privileged or otherwise protected by law from disclosure. It is intended for the exclusive use of the Addressee(s). Unless you are the addressee or authorized agent of the addressee, you may not review, copy, distribute or disclose to anyone the message or any information contained within. If you have received this message in error, please contact the sender by electronic reply to email at environcorp.com and immediately delete all copies of the message.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/wrf-users/attachments/20110406/40ebfacf/attachment.html 


More information about the Wrf-users mailing list