[Wrf-users] wrf.exe Stops without errors when running wih 2 or more procs. through MPI

Maite Merino mmerino at am.ub.es
Mon Feb 2 05:47:41 MST 2009

Dear wrf-users,

I hope you can help me because I think I'm stuck with a wrf problem in
parallel mode.
I'm trying to work with WRF in a cluster with MPI and ifort. I have not
problem in compiling it in serial mode or dm. In serial mode, geogrid,
ungrib, metgrib, real and wrf worked perfectly.
But in dm mode wrf.exe is not working, and I have no clue why.

Once compiled WRFV3 and WPS in dm with Linux+ifort without any problem, all
WPS programs worked fine and real.exe works well in 1 process executed as:
real also works ok executed through mpirun with more procs (I've tried up to
4). In this cluster, we send the mpirun comand through a couple of scripts
called in this case real.sh and qsub_real.sh. We use the same scripts
structure for running other programs there, such as MM5.

But with wrf.exe it does not work good. Compiled in serial mode, wrf.exe
works fine. Compiled in dm with Linux+ifort, if I execute it as ./wrf.exe it
works ok. It also works fine when if I use the respective scripts wrf.sh and
qsub_wrf.sh to execute it through a mpirun command asking to do it with 1
But when if I try to use it with 2 or more procs., it starts to do its tasks
but it never finish. I've tried to wait till 3 days for a simulation (that
with 1 proc. only takes a couple of hours) and finally I had also to cancel
it. When this happens, I cannot see any error or warning in the files rsl.*,
etc. It seems as it simply hungs and waits for something forever. It always
stops in the same place, after the message "WRF NUMBER OF TILES=1".
Neither me nor the informatic staff in my department have any idea of what
can be hapenning. I hope you can help.

For making you easy to revise it, I've made three diferent executions with
the Colorado(NAM) case that we worked with during July'08 tutorial in
Boulder. I attach you in this e-mail one zip file with the relevant output
files of each:

1.-pr_serial.zip => it has the relevant files of the execution as ./real.exe
and ./wrf.exe. It worked fine. I might be useful for you to compare with the

2.-pr_1procs.zip=> it has the relevant files of the execution as ./wrf.sh,
asking for only 1 proces through mpirun. I did not redo real.exe in this
case, I used the same wrfinput and wrfbdy files than before. You can see
there that wrf.sh worked fine.

3.-pr_4procs.zip => it has the relevant files of the execution as ./real.sh
and ./wrf.sh, asking for 4 processes through mpirun (of course I cleaned
before the previous executions outputs). You can see there that real worked
fine but that I had to finally cancel wrf because it lasted forever.

I also attach you a pdf file with all the information that the cluster can
give us about the pr_4procs.zip process execution, half a day after its
starting but before I cancelled it. You can find also there the
specifications of the cluster itself.

With all this information, could you please help me to discover what's
hapenning and how to solve it? Any answer, help or suggestion will be
certainly wellcomed.
I'll be looking forward to your reply.
Best regards,

Maite Merino

