[Wrf-users] Max number of CPUs for WRF

Alex Fierro alexandre.o.fierro at gmail.com
Mon Mar 26 16:00:09 MDT 2012


Greetings:

I have ran WRF on the Oak Ridge Supercomputers (jaguarpf XT-5) on 2000
cores 2 years ago and ran into a similar problem, which in my case was
related to I/O quilting.

I had to select:

 &namelist_quilt

 nio_tasks_per_group =
2,

 nio_groups = 1,

and then everything went fine (for that particular case).

The simulations was ran for a 24-h period on a 4-km convective permitting
grid over CONUS with a grid size of 1200 x 800 x 35. The simulation scaled
well up to 8000 cores where, again, I/O caused  some issues. In the past, I
also believe Netcdf files had a hard-wired file size limit up to 2Gb (?)
similar to Vis5d files. Have you tried also using native (raw) binaries for
the output format?

Cheers and hope this helps,

Alexandre-
-- 
-------------------------------------------------------------
Alexandre Fierro, PhD
Research Scientist-
National Severe Storms Laboratory (NSSL/NOAA)
*The Cooperative Institute for Mesoscale Meteorological Studies* (OU/NOAA)
Los Alamos National Laboratory, Los Alamos, NM (LANL)

"Yesterday is History, Tomorrow is a Mystery and Today is a Gift; That is
why it is called the Present"

"There are only 10 types of people in the world:
Those who understand binary, and those who don't"

"My opinions are my own and not representative of OU, NSSL,
AOML, HRD, LANL or any affiliates."
         ^.^
       (o  o)
     /(   V   )\
   ---m---m----



On Mon, Mar 26, 2012 at 7:22 AM, Don Morton <Don.Morton at alaska.edu> wrote:

> Howdy,
>
> I suspect you have over-decomposed your Nest 2.
>
> Your Nest 2 has 151x196 = 29,596 horizontal grid points.  With 1152 tasks,
> each task only has about 26 grid points, or a 5x5 grid.  At this level of
> refinement, I believe you're getting into issues of not having enough grid
> points for halos, etc.
>
> Actually, even with 224 cores, you only have about 132 grid points, or an
> 11x11 grid in each task.  Some have suggested in the past that maybe once
> you get below about 15x15 grid points per task, your scalability starts to
> suffer.
>
> So, to re-answer your previous question, WRF will work with tens to
> hundreds of thousands of tasks, but you need to do this with sizable
> problems.  You can only decompose a given problem size so much until
>
> a) It just doesn't scale well anymore, and
> b) You over-refine it so much that it won't even run.  I suspect this is
> your problem with the 1152 tasks.
>
> Best Regards,
>
> Don Morton
>
>
> --
> Voice:  +1 907 450 8679
> Arctic Region Supercomputing Center
> http://weather.arsc.edu/
> http://people.arsc.edu/~morton/ <http://www.arsc.edu/%7Emorton/>
>
>
> On Mon, Mar 26, 2012 at 8:42 AM, brick <brickflying at gmail.com> wrote:
>
>> Hi
>>
>> Thanks for help.
>> Today I test wrf3.3 with 224 cores. It goes well. But when I increase
>> cores to 1120, wrf.exe didn't integrate after 6 hours and it also didn't
>> stop or return any error massage.
>> The rsl.out.0000 show that wrf.exe  stop at deal with domain2. Last 20
>> lines of  rsl.out.0000 is shown here.
>>
>> 769 Timing for main: time 2012-03-22_05:57:30 on domain   1:    0.10120
>> elapsed seconds.
>> 770 Timing for main: time 2012-03-22_05:58:00 on domain   1:    0.10280
>> elapsed seconds.
>> 771 Timing for main: time 2012-03-22_05:58:30 on domain   1:    0.10070
>> elapsed seconds.
>> 772 Timing for main: time 2012-03-22_05:59:00 on domain   1:    0.10150
>> elapsed seconds.
>> 773 Timing for main: time 2012-03-22_05:59:30 on domain   1:    0.10080
>> elapsed seconds.
>> 774 Timing for main: time 2012-03-22_06:00:00 on domain   1:    0.09900
>> elapsed seconds.
>> 775   *************************************
>> 776   Nesting domain
>> 777   ids,ide,jds,jde            1         151           1         196
>> 778   ims,ime,jms,jme           -4          15          -4          20
>> 779   ips,ipe,jps,jpe            1           5           1           6
>> 780   INTERMEDIATE domain
>> 781   ids,ide,jds,jde          243         278         150         194
>> 782   ims,ime,jms,jme          238         255         145         162
>> 783   ips,ipe,jps,jpe          241         245         148         152
>> 784   *************************************
>> 785  d01 2012-03-22_06:00:00  alloc_space_field: domain            2,
>> 786  18001632 bytes allocated
>> 787  d01 2012-03-22_06:00:00  alloc_space_field: domain            2,
>> 788   1941408 bytes allocated
>> 789  d01 2012-03-22_06:00:00 *** Initializing nest domain # 2 from an
>> input file. **
>> 790  *
>> 791  d01 2012-03-22_06:00:00 med_initialdata_input: calling input_input
>>
>> The namelist is :
>>   1  &time_control
>>   2    run_days = 0,
>>   3    run_hours = 72,
>>   4    run_minutes = 0,
>>   5    run_seconds = 0,
>>   6    start_year = 2012, 2012,
>>   7    start_month = 03,   03,
>>   8    start_day = 22,   22,
>>   9    start_hour = 00,   06,
>>  10    start_minute = 00, 00,
>>  11    start_second = 00, 00,
>>  12    end_year = 2012, 2012,
>>  13    end_month = 03,   03,
>>  14    end_day = 25,   23,
>>  15    end_hour = 00,   06,
>>  16    end_minute = 00, 00,
>>  17    end_second = 00, 00,
>>  18    interval_seconds = 21600,
>>  19    input_from_file = .true.,.true.,
>>  20    history_interval = 60, 60,
>>  21    frames_per_outfile = 13,13,
>>  22    restart = .false.,
>>  23    restart_interval = 36000,
>>  24    io_form_history = 2,
>>  25    io_form_restart = 2,
>>  26    io_form_input = 2,
>>  27    io_form_boundary = 2,
>>  28    debug_level = 0,
>>  29  /
>>  30
>>  31  &domains
>>  32    time_step = 30,
>>  33    time_step_fract_num = 0,
>>  34    time_step_fract_den = 1,
>>  35    max_dom = 2,
>>  36    s_we = 1, 1, 1,
>>  37    e_we = 441, 151,
>>  38    s_sn = 1, 1, 1,
>>  39    e_sn = 369, 196,
>>  40    s_vert = 1, 1, 1,
>>  41    e_vert = 51,51,
>>  42    p_top_requested = 5000,
>>  43    num_metgrid_levels = 27,
>>  44    num_metgrid_soil_levels = 4,
>>  45    dx = 5000, 1000,
>>  46    dy = 5000, 1000,
>>  47    grid_id = 1, 2, 3,
>>  48    parent_id = 0, 1, 2,
>>  49    i_parent_start = 0,     245,
>>  50    j_parent_start = 0,     152,
>>  51    parent_grid_ratio = 1,     5,
>>  52    parent_time_step_ratio = 1,     5,
>>  53    feedback = 0,
>>  54    smooth_option = 0,
>>  55  /
>>  56
>>  57  &physics
>>  58    mp_physics = 6,6,
>>  59    ra_lw_physics = 1, 1, 1,
>>  60    ra_sw_physics = 1, 1, 1,
>>  61    radt = 5,1,
>>  62    sf_sfclay_physics = 1,1,
>>  63    sf_surface_physics = 2, 2, 2,
>>  64    bl_pbl_physics = 1,1,
>>  65    bldt = 0, 0, 0,
>>  66    cu_physics = 0,0,
>>  67    cudt = 5, 5, 5,
>>  68    isfflx = 1,
>>  69    ifsnow = 0,
>>  70    icloud = 1,
>>  71    surface_input_source = 1,
>>  72    num_soil_layers = 4,
>>  73    sf_urban_physics = 0, 0, 0,
>>  74  /
>>  75
>>  76  &fdda
>>  77  /
>>  78
>>  79  &dynamics
>>  80    w_damping = 0,
>>  81    diff_opt = 1,
>>  82    km_opt = 4,
>>  83    diff_6th_opt = 0, 0, 0,
>>  84    diff_6th_factor = 0.12, 0.12, 0.12,
>>  85    base_temp = 290.,
>>  86    damp_opt = 1,
>>  87    zdamp = 5000,
>>   88    dampcoef = 0.01,
>>  89    khdif = 0, 0, 0,
>>  90    kvdif = 0, 0, 0,
>>  91    non_hydrostatic = .true., .true., .true.,
>>  92    moist_adv_opt = 1, 1, 1,
>>  93    scalar_adv_opt = 1, 1, 1,
>>  94  /
>>  95
>>  96  &bdy_control
>>  97    spec_bdy_width = 5,
>>  98    spec_zone = 1,
>>  99    relax_zone = 4,
>> 100    specified = .true., .false., .false.,
>>  101    nested = .false., .true., .true.,
>> 102  /
>> 103
>> 104  &grib2
>> 105  /
>> 106
>> 107  &namelist_quilt
>> 108    nio_tasks_per_group = 0,
>> 109    nio_groups = 1,
>> 110  imelist_quilt
>> 111 108    nio_tasks_per_group = 0,
>> 112 109    nio_groups = 1,
>> 113 110  /
>> 114
>>
>> Thanks a lot.
>>
>> brick
>>
>>
>>
>>
>> On Sat, Mar 24, 2012 at 12:39 AM, Welsh, Patrick T <pat.welsh at unf.edu>wrote:
>>
>>>  It runs fine with hundreds, ok with thousands.
>>>
>>> Pat
>>>
>>>
>>>
>>> On 3/23/12 4:12 AM, "brick" <brickflying at gmail.com> wrote:
>>>
>>> Hi All
>>>
>>> Is there a limit of core number that WRF could use? I plan test WRF with
>>> 2048 cores or more next week. Could WRF run with such huge number?
>>> Thanks a lot.
>>>
>>> brick
>>>
>>>
>>> --
>>>
>>>
>>
>
>
>
> _______________________________________________
> Wrf-users mailing list
> Wrf-users at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/wrf-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/wrf-users/attachments/20120326/3fe4e8dd/attachment-0001.html 


More information about the Wrf-users mailing list