Hi <br><br>I am running a pretty big domain on 56 processors and simulating 5 days run, namelist.input file is specified at the bottom of this mail.<br>With only 6 hours of simulation time remaining, wrf.exe dies with these messages after running for nearly 8 hours<br>
<br><i>rank 59 in job 1 garl-fire15.local_39996 caused collective abort of all ranks<br> exit status of rank 59: killed by signal 9<br>rank 58 in job 1 garl-fire15.local_39996 caused collective abort of all ranks<br>
exit status of rank 58: killed by signal 9<br>rank 3 in job 1 garl-fire15.local_39996 caused collective abort of all ranks<br> exit status of rank 3: killed by signal 9<br>rank 42 in job 1 garl-fire15.local_39996 caused collective abort of all ranks<br>
exit status of rank 42: return code 1<br>rank 40 in job 1 garl-fire15.local_39996 caused collective abort of all ranks<br> exit status of rank 40: return code 1<br>rank 45 in job 1 garl-fire15.local_39996 caused collective abort of all ranks<br>
exit status of rank 45: return code 1<br>rank 44 in job 1 garl-fire15.local_39996 caused collective abort of all ranks<br> exit status of rank 44: killed by signal 9<br>rank 21 in job 1 garl-fire15.local_39996 caused collective abort of all ranks<br>
exit status of rank 21: return code 1<br><br></i><br><br>In the log file rsl.error.0021 these errors are reported, <br><br><i>Fatal error in MPI_Wait: Other MPI error, error stack:<br>MPI_Wait(156).............................: MPI_Wait(request=0x6816130, status0x7fbfff1e10) failed<br>
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()<br>MPIDI_CH3I_Progress_handle_sock_event(420):<br>MPIDU_Socki_handle_read(637)..............: connection failure (set=0,sock=15,errno=104:Connection reset by peer)[cli_21]: aborting job:<br>
Fatal error in MPI_Wait: Other MPI error, error stack:<br>MPI_Wait(156).............................: MPI_Wait(request=0x6816130, status0x7fbfff1e10) failed<br>MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()<br>
MPIDI_CH3I_Progress_handle_sock_event(420):<br>MPIDU_Socki_handle_read(637)..............: connection failure (set=0,sock=15,errno=104:Connection reset by peer)</i><br><br><br>Any clue what could be going wrong?<br>I am using mpich2-1.0.8<br>
<br><br>Thanks in advance<br>Preeti<br><br><br><br><br><b>namelist.input</b><br><br> &time_control<br> run_days = 5,<br> run_hours = 0,<br> run_minutes = 0,<br>
run_seconds = 0,<br> start_year = 2009, 2009, 2000,<br> start_month = 05, 05, 01,<br> start_day = 22, 22, 24,<br>
start_hour = 00, 00, 12,<br> start_minute = 00, 00, 00,<br> start_second = 00, 00, 00,<br> end_year = 2009, 2009, 2000,<br>
end_month = 05, 05, 01,<br> end_day = 27, 27, 25,<br> end_hour = 00, 00, 12,<br> end_minute = 00, 00, 00,<br>
end_second = 00, 00, 00,<br> interval_seconds = 21600<br> input_from_file = .true.,.false.,.true.,<br> history_interval = 90, 30, 60,<br>
frames_per_outfile = 10, 10, 1000,<br> restart = .false.,<br> restart_interval = 9000,<br> io_form_history = 2<br> io_form_restart = 2<br>
io_form_input = 2<br> io_form_boundary = 2<br> debug_level = 0<br> nocolons = .false.<br> auxinput1_inname ="met_em.d<domain>.<date>"<br>
/<br><br> &domains<br> time_step = 90,<br> time_step_fract_num = 0,<br> time_step_fract_den = 1,<br> max_dom = 2,<br> s_we = 1, 1, 1,<br>
e_we = 538, 151, 94,<br> s_sn = 1, 1, 1,<br> e_sn = 366, 193, 91,<br> s_vert = 1, 1, 1,<br>
e_vert = 28, 28, 28,<br> num_metgrid_levels = 27<br> dx = 15000, 5000, 3333.33,<br> dy = 15000, 5000, 3333.33,<br>
grid_id = 1, 2, 3,<br> parent_id = 0, 1, 2,<br> i_parent_start = 1, 299, 30,<br> j_parent_start = 1, 151, 30,<br>
parent_grid_ratio = 1, 3, 3,<br> parent_time_step_ratio = 1, 3, 3,<br> feedback = 1,<br> smooth_option = 0,<br> corral_dist = 2,<br>
/<br><br> &physics<br> mp_physics = 3, 3, 3,<br> ra_lw_physics = 1, 1, 1,<br> ra_sw_physics = 1, 1, 1,<br> radt = 30, 30, 30,<br>
sf_sfclay_physics = 1, 1, 1,<br> sf_surface_physics = 2, 2, 2,<br> bl_pbl_physics = 1, 1, 1,<br> bldt = 0, 0, 0,<br>
cu_physics = 1, 1, 0,<br> cudt = 5, 5, 5,<br> isfflx = 1,<br> ifsnow = 0,<br> icloud = 1,<br>
surface_input_source = 1,<br> num_soil_layers = 4,<br> ucmcall = 0,<br> maxiens = 1,<br> maxens = 3,<br>
maxens2 = 3,<br> maxens3 = 16,<br> ensdim = 144,<br> /<br><br> &fdda<br> /<br><br> &dynamics<br> w_damping = 0,<br>
diff_opt = 1,<br> km_opt = 4,<br> diff_6th_opt = 0,<br> diff_6th_factor = 0.12,<br> base_temp = 290.<br>
damp_opt = 0,<br> zdamp = 5000., 5000., 5000.,<br> dampcoef = 0.2, 0.2, 0.2<br> khdif = 0, 0, 0,<br>
kvdif = 0, 0, 0,<br> non_hydrostatic = .true., .true., .true.,<br> pd_moist = .true., .true., .true.,<br> pd_scalar = .true., .true., .true.,<br>
/<br><br> &bdy_control<br> spec_bdy_width = 5,<br> spec_zone = 1,<br> relax_zone = 4,<br> specified = .true., .false.,.false.,<br>
nested = .false., .true., .true.,<br> /<br><br> &grib2<br> /<br><br> &namelist_quilt<br> nio_tasks_per_group = 0,<br> nio_groups = 1,<br> /<br><br><br>