[Wrf-users] wrf.exe on a RedHat 5 compatible cluster

Lampros Mountrakis lmount at grid.auth.gr
Fri Apr 23 00:51:51 MDT 2010


I am trying to run wrf.exe 3.1.1 to a RedHat 5 compatible cluster and
all I get is errors. I tried several compilation options, such as
dmpar/dmsm and static/dynamic and all of them fail. The common options
consist of the em_real case, MPICH1 and the Intel compiler.

The very same case provides reasonable output in a RedHat 4 based cluster.

" ulimit -s unlimited " is present at the running script, before the
mpirun, as well as the assignment of the parameters, which I found on
several topics, having similar problems:

    export MPICH_UNEX_BUFFER_SIZE=1024M
    export P4_GLOBMEMSIZE=536870912
    export MP_STACK_SIZE=64000000
    export KMP_STACKSIZE=2048M

The most common errors are the following:

    std error
    rm_l_4_18119: (1065.371648) net_send: could not write to fd=5, errno = 32
    rm_l_15_12081: (1052.342272) net_send: could not write to fd=5, errno = 32
    rm_l_6_27731: (1064.724480) net_send: could not write to fd=5, errno = 32
    rm_l_10_12047: (1063.745536) net_send: could not write to fd=5, errno = 32
    rm_l_14_12071: (1052.841984) net_send: could not write to fd=5, errno = 32
    rm_l_13_12065: (1053.071360) net_send: could not write to fd=5, errno = 32
    rm_l_7_12029: (1064.433664) net_send: could not write to fd=5, errno = 32
    rm_l_9_12041: (1063.974912) net_send: could not write to fd=5, errno = 32
    rm_l_11_12053: (1058.514944) net_send: could not write to fd=5, errno = 32
    rm_l_12_12059: (1053.300736) net_send: could not write to fd=5, errno = 32
    rm_l_2_21220: (1071.020032) net_send: could not write to fd=5, errno = 32





    ==> rsl.error.0000 <==
    -------------- FATAL CALLED ---------------
    FATAL CALLED FROM FILE: <stdin> LINE: 71
    program wrf: error opening wrfinput_d01 for reading ierr= -1021
    -------------------------------------------
    [0] MPI Abort by user Aborting program !
    [0] Aborting program!



    ==> rsl.out.0000 <==
    -------------- FATAL CALLED ---------------
    FATAL CALLED FROM FILE: <stdin> LINE: 71
    program wrf: error opening wrfinput_d01 for reading ierr= -1021
    -------------------------------------------
    taskid: 0 hostname: wn024.grid.auth.gr
    p0_30948: p4_error: : 1
    p0_30948: (33.830912) net_send: could not write to fd=5, errno = 32


    ==> rsl.out.0001 <==
    alloc_space_field: domain 1, 58257184 bytes allocated
    -------------- FATAL CALLED ---------------
    FATAL CALLED FROM FILE: <stdin> LINE: 71
    program wrf: error opening wrfinput_d01 for reading ierr= -1021
    -------------------------------------------
    taskid: 1 hostname: wn024.grid.auth.gr




>From time to time I get

    starting wrf task 7 of 16
    starting wrf task 9 of 16
    starting wrf task 13 of 16
    starting wrf task 0 of 16
    starting wrf task 1 of 16
    starting wrf task 2 of 16
    starting wrf task 3 of 16
    starting wrf task 4 of 16
    starting wrf task 5 of 16
    starting wrf task 6 of 16
    starting wrf task 8 of 16
    starting wrf task 15 of 16
    starting wrf task 10 of 16
    starting wrf task 11 of 16
    starting wrf task 12 of 16
    starting wrf task 14 of 16

    Killed by signal 2.
    Killed by signal 2.
    Killed by signal 2.
    Killed by signal 2.
    Killed by signal 2.



If you have something to suggest, or some kind of solution, I would be grateful.
Thank you for your time.

__
Lampros


More information about the Wrf-users mailing list