Hi <br><br>I am running a pretty big domain on 56 processors and simulating 5 days run, namelist.input file is specified at the bottom of this mail.<br>With only 6 hours of simulation time remaining, wrf.exe dies with these messages after running for nearly 8 hours<br>
<br><i>rank 59 in job 1  garl-fire15.local_39996   caused collective abort of all ranks<br>  exit status of rank 59: killed by signal 9<br>rank 58 in job 1  garl-fire15.local_39996   caused collective abort of all ranks<br>
  exit status of rank 58: killed by signal 9<br>rank 3 in job 1  garl-fire15.local_39996   caused collective abort of all ranks<br>  exit status of rank 3: killed by signal 9<br>rank 42 in job 1  garl-fire15.local_39996   caused collective abort of all ranks<br>
  exit status of rank 42: return code 1<br>rank 40 in job 1  garl-fire15.local_39996   caused collective abort of all ranks<br>  exit status of rank 40: return code 1<br>rank 45 in job 1  garl-fire15.local_39996   caused collective abort of all ranks<br>
  exit status of rank 45: return code 1<br>rank 44 in job 1  garl-fire15.local_39996   caused collective abort of all ranks<br>  exit status of rank 44: killed by signal 9<br>rank 21 in job 1  garl-fire15.local_39996   caused collective abort of all ranks<br>
  exit status of rank 21: return code 1<br><br></i><br><br>In the log file rsl.error.0021 these errors are reported, <br><br><i>Fatal error in MPI_Wait: Other MPI error, error stack:<br>MPI_Wait(156).............................: MPI_Wait(request=0x6816130, status0x7fbfff1e10) failed<br>
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()<br>MPIDI_CH3I_Progress_handle_sock_event(420):<br>MPIDU_Socki_handle_read(637)..............: connection failure (set=0,sock=15,errno=104:Connection reset by peer)[cli_21]: aborting job:<br>
Fatal error in MPI_Wait: Other MPI error, error stack:<br>MPI_Wait(156).............................: MPI_Wait(request=0x6816130, status0x7fbfff1e10) failed<br>MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()<br>
MPIDI_CH3I_Progress_handle_sock_event(420):<br>MPIDU_Socki_handle_read(637)..............: connection failure (set=0,sock=15,errno=104:Connection reset by peer)</i><br><br><br>Any clue what could be going wrong?<br>I am using mpich2-1.0.8<br>
<br><br>Thanks in advance<br>Preeti<br><br><br><br><br><b>namelist.input</b><br><br> &amp;time_control<br> run_days                            = 5,<br> run_hours                           = 0,<br> run_minutes                         = 0,<br>
 run_seconds                         = 0,<br> start_year                          = 2009, 2009, 2000,<br> start_month                         = 05,   05,   01,<br> start_day                           = 22,   22,   24,<br>
 start_hour                          = 00,   00,   12,<br> start_minute                        = 00,   00,   00,<br> start_second                        = 00,   00,   00,<br> end_year                            = 2009, 2009, 2000,<br>
 end_month                           = 05,   05,   01,<br> end_day                             = 27,   27,   25,<br> end_hour                            = 00,   00,   12,<br> end_minute                          = 00,   00,   00,<br>
 end_second                          = 00,   00,   00,<br> interval_seconds                    = 21600<br> input_from_file                     = .true.,.false.,.true.,<br> history_interval                    = 90,   30,   60,<br>
 frames_per_outfile                  = 10,   10, 1000,<br> restart                             = .false.,<br> restart_interval                    = 9000,<br> io_form_history                     = 2<br> io_form_restart                     = 2<br>
 io_form_input                       = 2<br> io_form_boundary                    = 2<br> debug_level                         = 0<br> nocolons                 = .false.<br> auxinput1_inname                    =&quot;met_em.d&lt;domain&gt;.&lt;date&gt;&quot;<br>
/<br><br> &amp;domains<br> time_step                           = 90,<br> time_step_fract_num                 = 0,<br> time_step_fract_den                 = 1,<br> max_dom                             = 2,<br> s_we                                = 1,     1,     1,<br>
 e_we                                = 538,   151,   94,<br> s_sn                                = 1,     1,     1,<br> e_sn                                = 366,   193,   91,<br> s_vert                              = 1,     1,     1,<br>
 e_vert                              = 28,    28,    28,<br> num_metgrid_levels                  = 27<br> dx                                  = 15000, 5000,  3333.33,<br> dy                                  = 15000, 5000,  3333.33,<br>
 grid_id                             = 1,     2,     3,<br> parent_id                           = 0,     1,     2,<br> i_parent_start                      = 1,     299,   30,<br> j_parent_start                      = 1,     151,   30,<br>
 parent_grid_ratio                   = 1,     3,     3,<br> parent_time_step_ratio              = 1,     3,     3,<br> feedback                            = 1,<br> smooth_option                       = 0,<br> corral_dist                         = 2,<br>
 /<br><br> &amp;physics<br> mp_physics                          = 3,     3,     3,<br> ra_lw_physics                       = 1,     1,     1,<br> ra_sw_physics                       = 1,     1,     1,<br> radt                                = 30,    30,    30,<br>
 sf_sfclay_physics                   = 1,     1,     1,<br> sf_surface_physics                  = 2,     2,     2,<br> bl_pbl_physics                      = 1,     1,     1,<br> bldt                                = 0,     0,     0,<br>
 cu_physics                          = 1,     1,     0,<br> cudt                                = 5,     5,     5,<br> isfflx                              = 1,<br> ifsnow                              = 0,<br> icloud                              = 1,<br>
 surface_input_source                = 1,<br> num_soil_layers                     = 4,<br> ucmcall                             = 0,<br> maxiens                             = 1,<br> maxens                              = 3,<br>
 maxens2                             = 3,<br> maxens3                             = 16,<br> ensdim                              = 144,<br> /<br><br> &amp;fdda<br> /<br><br> &amp;dynamics<br> w_damping                           = 0,<br>
 diff_opt                            = 1,<br> km_opt                              = 4,<br> diff_6th_opt                        = 0,<br> diff_6th_factor                     = 0.12,<br> base_temp                           = 290.<br>
 damp_opt                            = 0,<br> zdamp                               = 5000.,  5000.,  5000.,<br> dampcoef                            = 0.2,    0.2,    0.2<br> khdif                               = 0,      0,      0,<br>
 kvdif                               = 0,      0,      0,<br> non_hydrostatic                     = .true., .true., .true.,<br> pd_moist                            = .true., .true., .true.,<br> pd_scalar                           = .true., .true., .true.,<br>
 /<br><br> &amp;bdy_control<br> spec_bdy_width                      = 5,<br> spec_zone                           = 1,<br> relax_zone                          = 4,<br> specified                           = .true., .false.,.false.,<br>
 nested                              = .false., .true., .true.,<br> /<br><br> &amp;grib2<br> /<br><br> &amp;namelist_quilt<br> nio_tasks_per_group = 0,<br> nio_groups = 1,<br> /<br><br><br>