[Wrf-users] WRF 3.2 jobs hanging up sporadically on wrfout output

Jeff Steward jeffsteward at gmail.com
Tue May 4 14:44:44 MDT 2010


Hello Michael and all,

Would just like to add that we are seeing the same problem with dmpar/PGI
compiled WRF 3.2.  We submit hundreds of jobs, and it seems that on about 3
or 4% of the jobs, they hang with no output, no error message, no crashes.
The job ends up getting killed by the queueing system when it runs out of
walltime.  We are using PGI/Linux on Gentoo.  The workaround has been to
resubmit the hung jobs, which seems to fix the problem most (perhaps 96 to
97%?) of the time.

Previous versions of WRF 3.1 worked perfectly for us as well.

Let me know if any additional information is needed from us.

Best wishes,

Jeff

On Fri, Apr 30, 2010 at 2:01 PM, Zulauf, Michael <
Michael.Zulauf at iberdrolausa.com> wrote:

> Hi again, all. . .
>
> I'm reviving my plea for help from a couple weeks ago.  I'm still having
> issues with WRF 3.2 - and _only_ 3.2.
>
> I've tried different versions of the PGI compilers, different versions
> of support libraries, different optimization levels (all the way down to
> none), etc.  My jobs sporadically (but usually eventually) hang up, most
> often after a new wrfout file is opened.  No error messages, no crashes
> - the processes continue, but _all_ output stops.  I eventually just
> have to kill the job.  The wrfouts are small, and all output looks good
> up until the failed wrfout.
>
> The exact same hardware, OS, compilers, libraries, etc work for previous
> versions of WRF.
>
> Below is an example namelist.input (WPS seems to be running fine).  Any
> thoughts?
>
> Thanks,
> Mike
>
> ------------------------------------------------------------------------
> ----------------------------
> &time_control
>  run_days                            = 0,
>  run_hours                           = 24,
>  run_minutes                         = 0,
>  run_seconds                         = 0,
>  start_year                          = 2009,2009,2009,2009,
>  start_month                         = 12,12,12,12,
>  start_day                           = 14,14,14,14,
>  start_hour                          = 00,03,06,09,
>  start_minute                        = 00,   00,   00,   00,   00,   00,
>  start_second                        = 00,   00,   00,   00,   00,   00,
>
>  end_year                            = 2009,2009,2009,2009,
>  end_month                           = 12,12,12,12,
>  end_day                             = 15,15,15,14,
>  end_hour                            = 00,00,00,12,
>  end_minute                          = 00,   00,   00,   00,   00,   00,
>  end_second                          = 00,   00,   00,   00,   00,   00,
>  interval_seconds                    = 10800,
>  input_from_file                     =
> .true.,.true.,.true.,.true.,.true.,
>  fine_input_stream                   = 0, 2, 2, 2,
>  io_form_auxinput2                   = 2
>  history_interval                    = 60,60,60,20,
>  frames_per_outfile                  =  1,  1,  1,  1,  1,  1,
>  restart                             = .false.,
>  restart_interval                    = 1440,
>  io_form_history                     = 2
>  io_form_restart                     = 2
>  io_form_input                       = 2
>  io_form_boundary                    = 2
>  debug_level                         = 0
>  adjust_output_times                 = .true.
>  /
>
>  &domains
>  time_step                           = 163,
>  time_step_fract_num                 = 7,
>  time_step_fract_den                 = 11,
>  max_dom                             = 4,
>  s_we                                = 1,  1,  1,  1,  1, 1,
>  e_we                                =   142,244,280,382,
>  s_sn                                =  1,  1,  1,  1,  1, 1,
>  e_sn                                =   154,268,250,196,
>  s_vert                              =  1,  1,  1,  1,  1, 1,
>  e_vert                              = 31,  31,  31,  31,  31, 31,
>  num_metgrid_levels                  =  27 ,
>  eta_levels                          = 1.000, 0.993, 0.980, 0.966,
> 0.950, 0.933, 0.913, 0.892, 0.869, 0.844, 0.816, 0.786, 0.753, 0.718,
> 0.680, 0.639, 0.596, 0.550, 0.501, 0.451, 0.398, 0.345, 0.290, 0.236,
> 0.188, 0.145, 0.108, 0.075, 0.046, 0.021, 0.000,
>
>  p_top_requested                     = 5000,
>  dx                                  = 27000,9000,3000,1000,
>  dy                                  = 27000,9000,3000,1000,
>  grid_id        = 1,  2,  3,  4,  5,  6,
>  parent_id      = 1,  1,  2,  3,  4,  5,
>  i_parent_start                      =   1,31,91,92,
>  j_parent_start                      =   1,33,93,93,
>  parent_grid_ratio = 1,  3,  3,  3,  3,  3,
>  parent_time_step_ratio = 1,  3,  3,  3,  3, 3,
>  feedback                            = 0,
>  smooth_option                       = 2
>  use_adaptive_time_step              = .false.
>  step_to_output_time                 = .true.
>  target_cfl                          = 1.1,1.1,1.1,1.1,
>  max_step_increase_pct               = 5, 51, 51, 51, 51, 51
>  starting_time_step                  = 162, 54, 18, 6
>  max_time_step                       = 202.5, 67.5, 22.5, 7.5
>  min_time_step                       = 27, 9, 3, 1
>  adaptation_domain                   = 4
>  /
>
>  &physics
>  mp_physics                          = 5, 5, 5, 5,
>  ra_lw_physics                       = 1, 1, 1, 1,
>  ra_sw_physics                       = 1, 1, 1, 1,
>  radt                                = 30,    30,    30,    30,    30,
> 30,
>  sf_sfclay_physics                   = 1, 1, 1, 1,
>  sf_surface_physics                  = 1, 1, 1, 1,
>  bl_pbl_physics                      = 1, 1, 1, 1,
>  bldt                                = 0,     0,     0,     0,     0,
> 0,
>  cu_physics                          = 1,     1,     0,     0,     0,
> 0,
>  cudt                                = 5,     5,     5,     0,     0,
> 0,
>  cam_abs_freq_s                      = 21600,
>  levsiz                              = 59,
>  paerlev                             = 29,
>  cam_abs_dim1                        = 4,
>  cam_abs_dim2                        = 31,
>  isfflx                              = 1,
>  ifsnow                              = 0,
>  icloud                              = 1,
>  surface_input_source                = 1,
>  num_soil_layers                     = 5,
>  sf_urban_physics                    = 0,     0,     0,     0,
>  mp_zero_out                         = 0,
>  maxiens                             = 1,
>  maxens                              = 3,
>  maxens2                             = 3,
>  maxens3                             = 16,
>  ensdim                              = 144,
>  slope_rad                           = 0,
>  topo_shading                        = 0,
>  /
>
>  &fdda
>  grid_fdda                           = 1,     0,     0,
>  gfdda_inname                        = "wrffdda_d<domain>",
>  gfdda_interval_m                    = 180,   0,     0,
>  gfdda_end_h                         = 12,    0,     0,
>  io_form_gfdda                       = 2,
>  fgdt                                = 0,     0,     0,
>  if_no_pbl_nudging_uv                = 0,     0,     0,
>  if_no_pbl_nudging_t                 = 1,     0,     0,
>  if_no_pbl_nudging_q                 = 1,     0,     0,
>  if_zfac_uv                          = 0,     0,     0,
>  k_zfac_uv                          = 10,   10,    10,
>  if_zfac_t                           = 1,     0,     0,
>  k_zfac_t                           = 10,   10,    10,
>  if_zfac_q                           = 1,     0,     0,
>  k_zfac_q                           = 10,   10,    10,
>  guv                                 = 0.0001,     0.0001,     0.0001,
>  gt                                  = 0.0001,     0.0001,     0.0001,
>  gq                                  = 0.000001,   0.000001,   0.000001,
>  if_ramping                          = 0,
>  dtramp_min                          = 0.0,
> /
>
>  &dynamics
>  w_damping                           = 1,
>  diff_opt                            = 1,
>  km_opt                              = 4,
>  diff_6th_opt                        = 0,
>  diff_6th_factor                     = 0.12,
>  base_temp                           = 290.
>  damp_opt                            = 0,
>  zdamp                               = 5000.,  5000.,  5000.,
>  dampcoef                            = 0.01,   0.01,   0.01
>  khdif                               = 0,      0,      0,
>  kvdif                               = 0,      0,      0,
>  non_hydrostatic                     = .true., .true., .true.,
>  moist_adv_opt                       = 1,      1,      1,     1
>  scalar_adv_opt                      = 1,      1,      1,     1
>  use_baseparam_fr_nml                = .true.
>  /
>
>  &bdy_control
>  spec_bdy_width                      = 5,
>  spec_zone                           = 1,
>  relax_zone                          = 4,
>  specified                           = .true.,
> .false.,.false.,.false.,.false., .false.,
>  nested                              = .false., .true., .true.,.true.,
> .true., .true.,
>  /
>
>  &grib2
>  /
>
>  &namelist_quilt
>  nio_tasks_per_group = 0,
>  nio_groups = 1,
>  /
> ------------------------------------------------------------------------
> ----------------------------
>
> -----Original Message-----
> Date: Fri, 16 Apr 2010 10:11:22 -0700
> From: "Zulauf, Michael" <Michael.Zulauf at iberdrolausa.com>
> Subject: Re: [Wrf-users] WRF 3.2 jobs hanging up sporadically on
>        wrfout  output
> To: "Don Morton" <Don.Morton at alaska.edu>
> Cc: wrf-users at ucar.edu
> Message-ID:
>
> <B2A259FAA3CF26469FF9A7C7402C49970913EFE0 at POREXUW03.ppmenergy.us>
> Content-Type: text/plain; charset="us-ascii"
>
> Thanks for the response, Don.
>
> The specific RDMA suggestion isn't relevant to our case (our hardware
> doesn't support it), but you may be right that this is an optimizations
> related issue.  I'll probably try playing with optimizations next.  I've
> got the same settings as has worked for previous versions - but perhaps
> something in the new code has made one of the settings problematic.
>
> Regarding the suggestions I've been getting relating to
> WRFIO_NCD_LARGE_FILE_SUPPORT - I don't think that's the problem.  I'm
> splitting my output into single frame files to keep the file size small.
> I may try that also, just for the heck of it.
>
> Based on the sporadic nature of this (sometimes it happens, sometimes it
> doesn't, when it hangs seems fairly random), I suspect it's some type of
> timing issue like a race condition.  If I can't get it working, I may
> just drop back to 3.1.1, at least until 3.2.1 comes out.  ;-)
>
> Thanks all,
>
> Mike
>
>
>
>
>
> This message is intended for the exclusive attention of the address(es)
> indicated.  Any information contained herein is strictly confidential and
> privileged, especially as regards person data,
> which must not be disclosed.  If you are the intended recipient and have
> received it by mistake or learn about it in any other way, please notify us
> by return e-mail and delete this message from
>  your computer system. Any unauthorized use, reproduction, alteration,
> filing or sending of this message and/or any attached files to third parties
> may lead to legal proceedings being taken. Any
> opinion expressed herein is solely that of the author(s) and does not
> necessarily represent the opinion of Iberdrola. The sender does not
> guarantee the integrity, speed or safety of this
> message, not accept responsibility for any possible damage arising from the
> interception, incorporation of virus or any other manipulation carried out
> by third parties.
>
> _______________________________________________
> Wrf-users mailing list
> Wrf-users at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/wrf-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/wrf-users/attachments/20100504/f9982cad/attachment-0001.html 


More information about the Wrf-users mailing list