[Wrf-users] wrf.exe Stops without errors when running wih 2 or more procs. through MPI

Jesús Lorenzana jesus.lorenzana at gmail.com
Tue Feb 3 02:47:10 MST 2009


Hi Maite,

Last week, I had same problem with our Linux / MPICH2 / ifort cluster. The
wrf.exe rans ok with 1 process but with 2 or more processes it hanged and it
never finished.
After a lot of time changing configurations, reinstalling MPICH2 software,
etc, I found that one of the computer that make up the cluster had the clock
5 minutes fast. After synchronizing the clock of all computers in the
cluster with a SNTP server on the Internet, wrf.exe runs fine.
I hope this idea can help you.

Best regards,

Jesus


2009/2/2 Maite Merino <mmerino at am.ub.es>

> Dear wrf-users,
>
>
> I hope you can help me because I think I'm stuck with a wrf problem in
> parallel mode.
> I'm trying to work with WRF in a cluster with MPI and ifort. I have not
> problem in compiling it in serial mode or dm. In serial mode, geogrid,
> ungrib, metgrib, real and wrf worked perfectly.
> But in dm mode wrf.exe is not working, and I have no clue why.
>
> Once compiled WRFV3 and WPS in dm with Linux+ifort without any problem, all
> WPS programs worked fine and real.exe works well in 1 process executed as:
> ./real.exe
> real also works ok executed through mpirun with more procs (I've tried up
> to
> 4). In this cluster, we send the mpirun comand through a couple of scripts
> called in this case real.sh and qsub_real.sh. We use the same scripts
> structure for running other programs there, such as MM5.
>
> But with wrf.exe it does not work good. Compiled in serial mode, wrf.exe
> works fine. Compiled in dm with Linux+ifort, if I execute it as ./wrf.exe
> it
> works ok. It also works fine when if I use the respective scripts wrf.sh
> and
> qsub_wrf.sh to execute it through a mpirun command asking to do it with 1
> process.
> But when if I try to use it with 2 or more procs., it starts to do its
> tasks
> but it never finish. I've tried to wait till 3 days for a simulation (that
> with 1 proc. only takes a couple of hours) and finally I had also to cancel
> it. When this happens, I cannot see any error or warning in the files
> rsl.*,
> etc. It seems as it simply hungs and waits for something forever. It always
> stops in the same place, after the message "WRF NUMBER OF TILES=1".
> Neither me nor the informatic staff in my department have any idea of what
> can be hapenning. I hope you can help.
>
> For making you easy to revise it, I've made three diferent executions with
> the Colorado(NAM) case that we worked with during July'08 tutorial in
> Boulder. I attach you in this e-mail one zip file with the relevant output
> files of each:
>
> 1.-pr_serial.zip => it has the relevant files of the execution as
> ./real.exe
> and ./wrf.exe. It worked fine. I might be useful for you to compare with
> the
> others.
>
> 2.-pr_1procs.zip=> it has the relevant files of the execution as ./wrf.sh,
> asking for only 1 proces through mpirun. I did not redo real.exe in this
> case, I used the same wrfinput and wrfbdy files than before. You can see
> there that wrf.sh worked fine.
>
> 3.-pr_4procs.zip => it has the relevant files of the execution as ./real.sh
> and ./wrf.sh, asking for 4 processes through mpirun (of course I cleaned
> before the previous executions outputs). You can see there that real worked
> fine but that I had to finally cancel wrf because it lasted forever.
>
> I also attach you a pdf file with all the information that the cluster can
> give us about the pr_4procs.zip process execution, half a day after its
> starting but before I cancelled it. You can find also there the
> specifications of the cluster itself.
>
> With all this information, could you please help me to discover what's
> hapenning and how to solve it? Any answer, help or suggestion will be
> certainly wellcomed.
> I'll be looking forward to your reply.
> Best regards,
>
> Maite Merino
> _______________________________________________
> Wrf-users mailing list
> Wrf-users at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/wrf-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/wrf-users/attachments/20090203/9e3184a6/attachment-0001.html


More information about the Wrf-users mailing list