[Wrf-users] wrf.exe Stops without errors when running wih 2 or more procs. through MPI

Maite Merino mmerino at am.ub.es
Tue Feb 3 04:07:36 MST 2009


Hi Jesus,

Thank you very much for your suggestion. I've checked it and all the 
nodes are syncronized between them and with the main node. So I'm 
afraid that's not my case.
Anyway, your comment it's a very interesting detail to consider that 
might be also crucial to other people, so I'm glad you mentioned it.

Best Regards,

Maite




On Tue, 3 Feb 2009, Jesús Lorenzana wrote:

> Hi Maite,
>
> Last week, I had same problem with our Linux / MPICH2 / ifort cluster. The
> wrf.exe rans ok with 1 process but with 2 or more processes it hanged and it
> never finished.
> After a lot of time changing configurations, reinstalling MPICH2 software,
> etc, I found that one of the computer that make up the cluster had the clock
> 5 minutes fast. After synchronizing the clock of all computers in the
> cluster with a SNTP server on the Internet, wrf.exe runs fine.
> I hope this idea can help you.
>
> Best regards,
>
> Jesus
>
>
> 2009/2/2 Maite Merino <mmerino at am.ub.es>
>
>> Dear wrf-users,
>>
>>
>> I hope you can help me because I think I'm stuck with a wrf problem in
>> parallel mode.
>> I'm trying to work with WRF in a cluster with MPI and ifort. I have not
>> problem in compiling it in serial mode or dm. In serial mode, geogrid,
>> ungrib, metgrib, real and wrf worked perfectly.
>> But in dm mode wrf.exe is not working, and I have no clue why.
>>
>> Once compiled WRFV3 and WPS in dm with Linux+ifort without any problem, all
>> WPS programs worked fine and real.exe works well in 1 process executed as:
>> ./real.exe
>> real also works ok executed through mpirun with more procs (I've tried up
>> to
>> 4). In this cluster, we send the mpirun comand through a couple of scripts
>> called in this case real.sh and qsub_real.sh. We use the same scripts
>> structure for running other programs there, such as MM5.
>>
>> But with wrf.exe it does not work good. Compiled in serial mode, wrf.exe
>> works fine. Compiled in dm with Linux+ifort, if I execute it as ./wrf.exe
>> it
>> works ok. It also works fine when if I use the respective scripts wrf.sh
>> and
>> qsub_wrf.sh to execute it through a mpirun command asking to do it with 1
>> process.
>> But when if I try to use it with 2 or more procs., it starts to do its
>> tasks
>> but it never finish. I've tried to wait till 3 days for a simulation (that
>> with 1 proc. only takes a couple of hours) and finally I had also to cancel
>> it. When this happens, I cannot see any error or warning in the files
>> rsl.*,
>> etc. It seems as it simply hungs and waits for something forever. It always
>> stops in the same place, after the message "WRF NUMBER OF TILES=1".
>> Neither me nor the informatic staff in my department have any idea of what
>> can be hapenning. I hope you can help.
>>
>> For making you easy to revise it, I've made three diferent executions with
>> the Colorado(NAM) case that we worked with during July'08 tutorial in
>> Boulder. I attach you in this e-mail one zip file with the relevant output
>> files of each:
>>
>> 1.-pr_serial.zip => it has the relevant files of the execution as
>> ./real.exe
>> and ./wrf.exe. It worked fine. I might be useful for you to compare with
>> the
>> others.
>>
>> 2.-pr_1procs.zip=> it has the relevant files of the execution as ./wrf.sh,
>> asking for only 1 proces through mpirun. I did not redo real.exe in this
>> case, I used the same wrfinput and wrfbdy files than before. You can see
>> there that wrf.sh worked fine.
>>
>> 3.-pr_4procs.zip => it has the relevant files of the execution as ./real.sh
>> and ./wrf.sh, asking for 4 processes through mpirun (of course I cleaned
>> before the previous executions outputs). You can see there that real worked
>> fine but that I had to finally cancel wrf because it lasted forever.
>>
>> I also attach you a pdf file with all the information that the cluster can
>> give us about the pr_4procs.zip process execution, half a day after its
>> starting but before I cancelled it. You can find also there the
>> specifications of the cluster itself.
>>
>> With all this information, could you please help me to discover what's
>> hapenning and how to solve it? Any answer, help or suggestion will be
>> certainly wellcomed.
>> I'll be looking forward to your reply.
>> Best regards,
>>
>> Maite Merino
>> _______________________________________________
>> Wrf-users mailing list
>> Wrf-users at ucar.edu
>> http://mailman.ucar.edu/mailman/listinfo/wrf-users
>>
>

________________________________________________________________

    Maria Teresa (Maite) Merino Espasa    Maite.Merino at am.ub.es

    Departament d'Astronomia i Meteorologia
    Facultat de Fisica
    Universitat de Barcelona
    Avinguda Diagonal, 647                Tel: +34 93 403 92 33
    E-08028 Barcelona                     Fax: +34 93 402 11 33
    SPAIN
________________________________________________________________


More information about the Wrf-users mailing list