[Wrf-users] wrf.exe on a RedHat 5 compatible cluster

Aaron Sims aaron_sims at ncsu.edu
Fri Apr 23 10:49:20 MDT 2010


Hard to say what is going on there without more information. But, judging from this error alone, it looks like the program cant find the input file from the very outset.  If indeed this file exists, it could be that the directory that holds this file is not mounted to all the machines in your cluster, which would cause the MPI processes to abort.  It could also be issues with firewalls or permission issues for mounted filesystems. Try to run an mpi job on just one processor on the same machine you are launching the job from.  If this succeeds, then I would think the mount points would be suspect.

Hope this helps,
Aaron

==>  rsl.error.0000<==
     -------------- FATAL CALLED ---------------
     FATAL CALLED FROM FILE:<stdin>  LINE: 71
     program wrf: error opening wrfinput_d01 for reading ierr= -1021
     -------------------------------------------
     [0] MPI Abort by user Aborting program !
     [0] Aborting program!



On 4/23/2010 2:51 AM, Lampros Mountrakis wrote:
> I am trying to run wrf.exe 3.1.1 to a RedHat 5 compatible cluster and
> all I get is errors. I tried several compilation options, such as
> dmpar/dmsm and static/dynamic and all of them fail. The common options
> consist of the em_real case, MPICH1 and the Intel compiler.
>
> The very same case provides reasonable output in a RedHat 4 based cluster.
>
> " ulimit -s unlimited " is present at the running script, before the
> mpirun, as well as the assignment of the parameters, which I found on
> several topics, having similar problems:
>
>      export MPICH_UNEX_BUFFER_SIZE=1024M
>      export P4_GLOBMEMSIZE=536870912
>      export MP_STACK_SIZE=64000000
>      export KMP_STACKSIZE=2048M
>
> The most common errors are the following:
>
>      std error
>      rm_l_4_18119: (1065.371648) net_send: could not write to fd=5, errno = 32
>      rm_l_15_12081: (1052.342272) net_send: could not write to fd=5, errno = 32
>      rm_l_6_27731: (1064.724480) net_send: could not write to fd=5, errno = 32
>      rm_l_10_12047: (1063.745536) net_send: could not write to fd=5, errno = 32
>      rm_l_14_12071: (1052.841984) net_send: could not write to fd=5, errno = 32
>      rm_l_13_12065: (1053.071360) net_send: could not write to fd=5, errno = 32
>      rm_l_7_12029: (1064.433664) net_send: could not write to fd=5, errno = 32
>      rm_l_9_12041: (1063.974912) net_send: could not write to fd=5, errno = 32
>      rm_l_11_12053: (1058.514944) net_send: could not write to fd=5, errno = 32
>      rm_l_12_12059: (1053.300736) net_send: could not write to fd=5, errno = 32
>      rm_l_2_21220: (1071.020032) net_send: could not write to fd=5, errno = 32
>
>
>
>
>
>      ==>  rsl.error.0000<==
>      -------------- FATAL CALLED ---------------
>      FATAL CALLED FROM FILE:<stdin>  LINE: 71
>      program wrf: error opening wrfinput_d01 for reading ierr= -1021
>      -------------------------------------------
>      [0] MPI Abort by user Aborting program !
>      [0] Aborting program!
>
>
>
>      ==>  rsl.out.0000<==
>      -------------- FATAL CALLED ---------------
>      FATAL CALLED FROM FILE:<stdin>  LINE: 71
>      program wrf: error opening wrfinput_d01 for reading ierr= -1021
>      -------------------------------------------
>      taskid: 0 hostname: wn024.grid.auth.gr
>      p0_30948: p4_error: : 1
>      p0_30948: (33.830912) net_send: could not write to fd=5, errno = 32
>
>
>      ==>  rsl.out.0001<==
>      alloc_space_field: domain 1, 58257184 bytes allocated
>      -------------- FATAL CALLED ---------------
>      FATAL CALLED FROM FILE:<stdin>  LINE: 71
>      program wrf: error opening wrfinput_d01 for reading ierr= -1021
>      -------------------------------------------
>      taskid: 1 hostname: wn024.grid.auth.gr
>
>
>
>
> > From time to time I get
>
>      starting wrf task 7 of 16
>      starting wrf task 9 of 16
>      starting wrf task 13 of 16
>      starting wrf task 0 of 16
>      starting wrf task 1 of 16
>      starting wrf task 2 of 16
>      starting wrf task 3 of 16
>      starting wrf task 4 of 16
>      starting wrf task 5 of 16
>      starting wrf task 6 of 16
>      starting wrf task 8 of 16
>      starting wrf task 15 of 16
>      starting wrf task 10 of 16
>      starting wrf task 11 of 16
>      starting wrf task 12 of 16
>      starting wrf task 14 of 16
>
>      Killed by signal 2.
>      Killed by signal 2.
>      Killed by signal 2.
>      Killed by signal 2.
>      Killed by signal 2.
>
>
>
> If you have something to suggest, or some kind of solution, I would be grateful.
> Thank you for your time.
>
> __
> Lampros
> _______________________________________________
> Wrf-users mailing list
> Wrf-users at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/wrf-users
>    


More information about the Wrf-users mailing list