[cam-users] RE: mpirun on boewulf cluster
Ghan, Steven J
Steve.Ghan@pnl.gov
Fri, 06 Sep 2002 10:22:06 -0700
I have found that the problem mentioned below has nothing to do with the
failure of the execution of CAM on a beowulf cluster. Rather, it is due to
the threading option -mp used to compile with pgf90. If -mp is used then the
thread library must be linked, that is
-L/usr/local/lib -lmpich -lpthread
where /usr/local/lib contains the portland group thread library
libpgthread.so
Otherwise CAM chokes on mpi_init.
I don't have enough experience to say whether the problem is unique to my
beowulf cluster.
-Steve Ghan
-----Original Message-----
From: Jim Rosinski [mailto:rosinski@cgd.ucar.edu]
Sent: Tuesday, September 03, 2002 5:11 PM
To: Ghan, Steven J
Cc: cam-users@ucar.edu
Subject: Re: mpirun on boewulf cluster
On Fri, 30 Aug 2002, Ghan, Steven J wrote:
> I've compiled cam2 to run spmd on a beowulf cluster (redhat 7.1, portland
> group compiler). But when I try to run
> mpirun -np 2 -machinefile machines cam < namelist > camout
> I get the dreaded broken pipe message to the terminal and the following
> message in camout:
>
> t_setoption: option disabled: Usr Sys
> t_setoption: option disabled: Usr Sys
>
> which is coming from cam2/models/utils/timing/t_setoption.c. It seems that
I
> can turn this problem off by defining DISABLE_TIMERS, but why should I
have
> to? The code runs fine without spmd. Any ideas on other solutions?
Broken pipes often happen when one process dies unexpectedly and another is
still trying to send data to or receive data from it. If you're using
mpich,
this can happen when mpi tries to route stdout to the master process, but
one
of the slaves has died. Though stranger things have happened, I doubt that
the problem you are encountering is actually occurring in any of the
utils/timing code. To check for sure, I'd suggest running mpirun with -p4pg
hostfile -p4norem, then firing up master and slaves by hand. That should at
least eliminate the "broken pipe" nonsense.
Jim Rosinski