[Met_help] [rt.rap.ucar.edu #91969] History for grid_stat memory usage

Wed Nov 6 16:57:50 MST 2019

----------------------------------------------------------------
  Initial Request
----------------------------------------------------------------

Greetings MET team
I am currently testing grid_stat  using MET v8.1 on the WCOSS at NCEP, and
I am running into some issues with the grid_stat memory usage.  I am trying
to submit a job using bsub to run grid_stat on multiple threads for several
elements, projection hours, regions, etc.  The max memory usage for a
single run of grid_stat is typically under 700 MB.

Resource usage summary:

    CPU time :                                   34.21 sec.
    Max Memory :                                 560 MB
    Average Memory :                             452.00 MB
    Total Requested Memory :                     1300.00 MB
    Delta Memory :                               740.00 MB
    Max Swap :                                   -
    Max Processes :                              4
    Max Threads :                                5
    Run time :                                   72 sec.
    Turnaround time :                            74 sec.

When I set the resources for bsub accordingly, I occasionally run out of
memory,  resulting in my job being killed by the system.   In these
instances, grid_stat seems to use 1000-1300 MB.

Sep  9 18:45:47 v65c51 kernel: WARNING  grid_stat invoked oom-killer:
gfp_mask=0xd0, order=0, oom_score_adj=0
Sep  9 18:45:47 v65c51 kernel: INFO  grid_stat cpuset=/ mems_allowed=0-1
Sep  9 18:45:47 v65c51 kernel: WARNING  CPU: 1 PID: 126686 Comm: grid_stat
Tainted: G        W  OE  ------------   3.10.0-862.20.2.el7.x86_64 #1
Sep  9 18:45:47 v65c51 kernel: WARNING  Hardware name: Dell Inc. PowerEdge
C6320/082F9M, BIOS 2.8.0 05/28/2018
Sep  9 18:45:47 v65c51 kernel: WARNING  Call Trace:
Sep  9 18:45:47 v65c51 kernel: WARNING  [<ffffffff97b138b4>]
dump_stack+0x19/0x1b
Sep  9 18:45:47 v65c51 kernel: WARNING  [<ffffffff97b0ea7f>]
dump_header+0x90/0x229
Sep  9 18:45:47 v65c51 kernel: WARNING  [<ffffffff9759a826>] ?
find_lock_task_mm+0x56/0xc0
Sep  9 18:45:47 v65c51 kernel: WARNING  [<ffffffff9760f878>] ?
try_get_mem_cgroup_from_mm+0x28/0x60
Sep  9 18:45:47 v65c51 kernel: WARNING  [<ffffffff9759acd4>]
oom_kill_process+0x254/0x3d0
Sep  9 18:45:47 v65c51 kernel: WARNING  [<ffffffff97613686>]
mem_cgroup_oom_synchronize+0x546/0x570
Sep  9 18:45:47 v65c51 kernel: WARNING  [<ffffffff97612b00>] ?
mem_cgroup_charge_common+0xc0/0xc0
Sep  9 18:45:47 v65c51 kernel: WARNING  [<ffffffff9759b564>]
pagefault_out_of_memory+0x14/0x90
Sep  9 18:45:47 v65c51 kernel: WARNING  [<ffffffff97b0cc21>]
mm_fault_error+0x6a/0x157
Sep  9 18:45:47 v65c51 kernel: WARNING  [<ffffffff97b20846>]
__do_page_fault+0x496/0x4f0
Sep  9 18:45:47 v65c51 kernel: WARNING  [<ffffffff97b208d5>]
do_page_fault+0x35/0x90
Sep  9 18:45:47 v65c51 kernel: WARNING  [<ffffffff97b1c758>]
page_fault+0x28/0x30
Sep  9 18:45:47 v65c51 kernel: INFO  Task in
/lsf/venus/job.8590410.125604.1568054702 killed as a result of limit of
/lsf/venus/job.8590410.125604.1568054702
Sep  9 18:45:47 v65c51 kernel: INFO  memory: usage 716800kB, limit
716800kB, failcnt 8745
Sep  9 18:45:47 v65c51 kernel: INFO  memory+swap: usage 716800kB, limit
716800kB, failcnt 0
Sep  9 18:45:47 v65c51 kernel: INFO  kmem: usage 0kB, limit
9007199254740988kB, failcnt 0
Sep  9 18:45:47 v65c51 kernel: INFO  Memory cgroup stats for
/lsf/venus/job.8590410.125604.1568054702: cache:20KB rss:716780KB
rss_huge:579584KB mapped_file:4KB swap:0KB inactive_anon:8KB
active_anon:716756KB inactive_file:0KB active_file:0KB unevictable:0KB
Sep  9 18:45:47 v65c51 kernel: INFO  [ pid ]   uid  tgid total_vm      rss
nr_ptes swapents oom_score_adj name
Sep  9 18:45:47 v65c51 kernel: INFO  [125604] 10394 125604     6180
 1533      15        0             0 res
Sep  9 18:45:47 v65c51 kernel: INFO  [125979] 10394 125979    28335
364      12        0             0 1568054683.8590
Sep  9 18:45:47 v65c51 kernel: INFO  [125981] 10394 125981    28403
504      13        0             0 hourly_met_veri
Sep  9 18:45:47 v65c51 kernel: INFO  [126686] 10394 126686   228332
 178457     412        0             0 grid_stat
Sep  9 18:45:47 v65c51 kernel: ERR  Memory cgroup out of memory: Kill
process 126686 (grid_stat) score 998 or sacrifice child
Sep  9 18:45:47 v65c51 kernel: ERR  Killed process 126686 (grid_stat)
total-vm:913328kB, anon-rss:710772kB, file-rss:0kB, shmem-rss:3056kB

TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with exit code 130.

Resource usage summary:

    CPU time :                                   53.24 sec.
    Max Memory :                                 700 MB
    Average Memory :                             298.00 MB
    Total Requested Memory :                     700.00 MB
    Delta Memory :                               0.00 MB
    Max Swap :                                   -
    Max Processes :                              5
    Max Threads :                                6
    Run time :                                   66 sec.
    Turnaround time :                            85 sec.

____________________________________________________________

Resource usage summary:

    CPU time :                                   33.58 sec.
    Max Memory :                                 1062 MB
    Average Memory :                             703.00 MB
    Total Requested Memory :                     1300.00 MB
    Delta Memory :                               238.00 MB
    Max Swap :                                   -
    Max Processes :                              4
    Max Threads :                                5
    Run time :                                   67 sec.
    Turnaround time :                            82 sec.

Repeating a job with the same resources requested will result in different
memory usage.  I have also repeated this issue with different inputs and
config files.
I realize the difference in resources is not that large, but I intend to
run multiple jobs and threads across multiple nodes and want to make sure
that I can do this repeatedly without error .  Please let me know if others
have encountered this issue and if there is a workaround for it, other than
just running with more resources.  If you need any more information from
me, please let me know.
Thanks
John

-- 
John Wagner
Verification Task Lead
COR Task Manager
NOAA/National Weather Service
Meteorological Development Laboratory
Digital Forecast Services Branch
SSMC2 Room 10106
Silver Spring, MD 20910
(301) 427-9471 (office)
(908) 902-4155 (cell/text)

----------------------------------------------------------------
  Complete Ticket History
----------------------------------------------------------------

Subject: grid_stat memory usage
From: John Halley Gotway
Time: Tue Sep 10 19:14:13 2019

John,

Hmmm, the behavior you describe is interesting.  I do have access to
the
development side of WCOSS.  Can you please point me to the Grid-Stat
script
or command which results in this inconsistent memory usage?  I'd like
to
look at the input files and configuration file to see if there's any
reasonable explanation why the memory usage would differ from run to
run.

I am actually in town visiting NCWCP this week in Greenbelt, MD.  I'll
be
there on Wednesday and Thursday.  I'm not sure if you sit at NCWCP or
in
Silver Springs, but if it works out, we could potentially meet up to
discuss.

Thanks,
John Halley Gotway

On Tue, Sep 10, 2019 at 11:17 AM Randy Bullock via RT
<met_help at ucar.edu>
wrote:

>
> Tue Sep 10 11:16:37 2019: Request 91969 was acted upon.
> Transaction: Given to johnhg (John Halley Gotway) by bullock
>        Queue: met_help
>      Subject: grid_stat memory usage
>        Owner: johnhg
>   Requestors: john.l.wagner at noaa.gov
>       Status: new
>  Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=91969 >
>
>
> This transaction appears to have no content
>

------------------------------------------------
Subject: grid_stat memory usage
From: John L Wagner - NOAA Federal
Time: Wed Sep 11 05:18:53 2019

John
My scripts are in
/gpfs/dell3/mdl/mdlverif/noscrub/usr/John.L.Wagner/qpfvs/processing.
I'm
using bsub_gridverif.sh to submit hourly_met_verif_urma.sh.
My run directories and mpmd files can be found in my ptmp area:
/gpfs/dell3/ptmp/John.L.Wagner
I work out of Silver Spring, but am actually working from home today.
A
face-to-face meeting doesn't seem likely this week.  We do frequently
use
google hangouts to virtually meet and screen share.  Always happy to
do
that when you get back from your trip.
Thanks
John

On Tue, Sep 10, 2019 at 9:14 PM John Halley Gotway via RT
<met_help at ucar.edu>
wrote:

> John,
>
> Hmmm, the behavior you describe is interesting.  I do have access to
the
> development side of WCOSS.  Can you please point me to the Grid-Stat
script
> or command which results in this inconsistent memory usage?  I'd
like to
> look at the input files and configuration file to see if there's any
> reasonable explanation why the memory usage would differ from run to
run.
>
> I am actually in town visiting NCWCP this week in Greenbelt, MD.
I'll be
> there on Wednesday and Thursday.  I'm not sure if you sit at NCWCP
or in
> Silver Springs, but if it works out, we could potentially meet up to
> discuss.
>
> Thanks,
> John Halley Gotway
>
> On Tue, Sep 10, 2019 at 11:17 AM Randy Bullock via RT
<met_help at ucar.edu>
> wrote:
>
> >
> > Tue Sep 10 11:16:37 2019: Request 91969 was acted upon.
> > Transaction: Given to johnhg (John Halley Gotway) by bullock
> >        Queue: met_help
> >      Subject: grid_stat memory usage
> >        Owner: johnhg
> >   Requestors: john.l.wagner at noaa.gov
> >       Status: new
> >  Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=91969 >
> >
> >
> > This transaction appears to have no content
> >
>
>

--
John Wagner
Verification Task Lead
COR Task Manager
NOAA/National Weather Service
Meteorological Development Laboratory
Digital Forecast Services Branch
SSMC2 Room 10106
Silver Spring, MD 20910
(301) 427-9471 (office)
(908) 902-4155 (cell/text)

------------------------------------------------
Subject: grid_stat memory usage
From: John Halley Gotway
Time: Thu Sep 19 13:47:03 2019

John,

Sorry for the long delay in getting back to this.  I logged onto WCOSS
this
morning and took a look at your setup.  I looked through the
configuration
of your Grid-Stat runs and didn't see any obvious reason why the
memory
usage would vary in a seemingly random way.

After looking through your setup, I do have a few suggestions to
consider.
All paths are listed relative to this top-level directory:
/gpfs/dell3/mdl/mdlverif/noscrub/usr/John.L.Wagner/qpfvs

(1) In met_util/grid_stat_config, set "mask.grid" to an empty list
instead
of just commenting it out.  Grid-Stat first reads its default config
file
and then overrides that with yours.  The default config file sets
"grid = [
"FULL" ];" to compute stats over the entire model domain.  Just
commenting
out "//   grid = [ "G212" ];" won't override the default setting.  So
I'd
suggest using:

*mask = { poly = [ "${MASKS}" ]; grid = [ ];}*

(2) In met_util/grid_stat_config, define "cnt_thresh" only for the
obs.
The cnt_thresh setting is used to filter the matched pairs which are
included in the continuous stats.  You have very specific bins listed
and
those settings are applied to both the fcst and obs data.  The
"cnt_logic"
is set to UNION meaning that the pair is included in the stats if the
fcst
OR the obs value falls in that range.  That doesn't nicely subdivide
your
pairs into buckets.  The same pair can end up in multiple buckets
since its
the forecast OR the observation.  Instead, specify the cnt_thresh only
for
the obs only, like this:

cnt_logic = INTERSECTION;

fcst = {
   field = [
      { name  = "APCP_06"; level    = "(*,*)" ;
        cnt_thresh = [ NA, NA, NA, NA, NA, NA, NA, NA ];
        cat_thresh = [ >=0.254, >=2.54, >=6.35, >=12.7, >=25.4,
>=50.8,
>=76.2, >=127 ];
      }
   ];
}
obs = {
   field = [
      { name = "APCP_06";
        level   = "(*,*)" ;
        cnt_thresh = [ >=0&&<0.254, >=0.254&&<2.54, >=2.54&&<6.35,
>=6.35&&<12.7, >=12.7&&<25.4, >=25.4&&<76.2, >=76.2&&<127, >=127 ];
        cat_thresh = [ >=0.254, >=2.54, >=6.35, >=12.7, >=25.4,
>=50.8,
>=76.2, >=127 ];
      }
   ];
}

(3) I see that for CONUS you're computing stats for 133 different
subregions:
   /mdlverif/noscrub/usr/mdl.verif/masks
I would recommend making sure that dimensions of the mask exactly
match the
dimensions of the forecast grid so as to avoid regridding those 133
masking
regions.  I'm pretty sure they do, but couldn't actually figure out
the
input model data.

So I don't see any obvious issues with memory.  However, there's
always the
possibility of a problem in the code, like a memory leak.  If you'd
like I
could try replicating a single run of Grid-Stat locally using all 133
of
your sub-regions... and profile the code to look for memory leaks.

But let me know if you have any more info on this issue.

Thanks,
John

On Wed, Sep 11, 2019 at 5:19 AM John L Wagner - NOAA Federal via RT <
met_help at ucar.edu> wrote:

>
> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=91969 >
>
> John
> My scripts are in
> /gpfs/dell3/mdl/mdlverif/noscrub/usr/John.L.Wagner/qpfvs/processing.
I'm
> using bsub_gridverif.sh to submit hourly_met_verif_urma.sh.
> My run directories and mpmd files can be found in my ptmp area:
> /gpfs/dell3/ptmp/John.L.Wagner
> I work out of Silver Spring, but am actually working from home
today.  A
> face-to-face meeting doesn't seem likely this week.  We do
frequently use
> google hangouts to virtually meet and screen share.  Always happy to
do
> that when you get back from your trip.
> Thanks
> John
>
> On Tue, Sep 10, 2019 at 9:14 PM John Halley Gotway via RT <
> met_help at ucar.edu>
> wrote:
>
> > John,
> >
> > Hmmm, the behavior you describe is interesting.  I do have access
to the
> > development side of WCOSS.  Can you please point me to the Grid-
Stat
> script
> > or command which results in this inconsistent memory usage?  I'd
like to
> > look at the input files and configuration file to see if there's
any
> > reasonable explanation why the memory usage would differ from run
to run.
> >
> > I am actually in town visiting NCWCP this week in Greenbelt, MD.
I'll be
> > there on Wednesday and Thursday.  I'm not sure if you sit at NCWCP
or in
> > Silver Springs, but if it works out, we could potentially meet up
to
> > discuss.
> >
> > Thanks,
> > John Halley Gotway
> >
> > On Tue, Sep 10, 2019 at 11:17 AM Randy Bullock via RT
<met_help at ucar.edu
> >
> > wrote:
> >
> > >
> > > Tue Sep 10 11:16:37 2019: Request 91969 was acted upon.
> > > Transaction: Given to johnhg (John Halley Gotway) by bullock
> > >        Queue: met_help
> > >      Subject: grid_stat memory usage
> > >        Owner: johnhg
> > >   Requestors: john.l.wagner at noaa.gov
> > >       Status: new
> > >  Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=91969
> >
> > >
> > >
> > > This transaction appears to have no content
> > >
> >
> >
>
> --
> John Wagner
> Verification Task Lead
> COR Task Manager
> NOAA/National Weather Service
> Meteorological Development Laboratory
> Digital Forecast Services Branch
> SSMC2 Room 10106
> Silver Spring, MD 20910
> (301) 427-9471 (office)
> (908) 902-4155 (cell/text)
>
>

------------------------------------------------
Subject: grid_stat memory usage
From: John L Wagner - NOAA Federal
Time: Fri Sep 20 05:46:40 2019

Thanks John, I appreciate you looking into this.  Hopefully, we'll get
the
WCOSS back later this morning and I will be able to test your
suggestions.
Otherwise, I likely won't be able to test until next Friday, as the
dev
machine will be down most of next week.  Either way, I'll let you know
how
it goes.
Thanks
John

On Thu, Sep 19, 2019 at 3:47 PM John Halley Gotway via RT
<met_help at ucar.edu>
wrote:

> John,
>
> Sorry for the long delay in getting back to this.  I logged onto
WCOSS this
> morning and took a look at your setup.  I looked through the
configuration
> of your Grid-Stat runs and didn't see any obvious reason why the
memory
> usage would vary in a seemingly random way.
>
> After looking through your setup, I do have a few suggestions to
consider.
> All paths are listed relative to this top-level directory:
> /gpfs/dell3/mdl/mdlverif/noscrub/usr/John.L.Wagner/qpfvs
>
> (1) In met_util/grid_stat_config, set "mask.grid" to an empty list
instead
> of just commenting it out.  Grid-Stat first reads its default config
file
> and then overrides that with yours.  The default config file sets
"grid = [
> "FULL" ];" to compute stats over the entire model domain.  Just
commenting
> out "//   grid = [ "G212" ];" won't override the default setting.
So I'd
> suggest using:
>
>
>
>
>
> *mask = { poly = [ "${MASKS}" ]; grid = [ ];}*
>
> (2) In met_util/grid_stat_config, define "cnt_thresh" only for the
obs.
> The cnt_thresh setting is used to filter the matched pairs which are
> included in the continuous stats.  You have very specific bins
listed and
> those settings are applied to both the fcst and obs data.  The
"cnt_logic"
> is set to UNION meaning that the pair is included in the stats if
the fcst
> OR the obs value falls in that range.  That doesn't nicely subdivide
your
> pairs into buckets.  The same pair can end up in multiple buckets
since its
> the forecast OR the observation.  Instead, specify the cnt_thresh
only for
> the obs only, like this:
>
> cnt_logic = INTERSECTION;
>
> fcst = {
>    field = [
>       { name  = "APCP_06"; level    = "(*,*)" ;
>         cnt_thresh = [ NA, NA, NA, NA, NA, NA, NA, NA ];
>         cat_thresh = [ >=0.254, >=2.54, >=6.35, >=12.7, >=25.4,
>=50.8,
> >=76.2, >=127 ];
>       }
>    ];
> }
> obs = {
>    field = [
>       { name = "APCP_06";
>         level   = "(*,*)" ;
>         cnt_thresh = [ >=0&&<0.254, >=0.254&&<2.54, >=2.54&&<6.35,
> >=6.35&&<12.7, >=12.7&&<25.4, >=25.4&&<76.2, >=76.2&&<127, >=127 ];
>         cat_thresh = [ >=0.254, >=2.54, >=6.35, >=12.7, >=25.4,
>=50.8,
> >=76.2, >=127 ];
>       }
>    ];
> }
>
> (3) I see that for CONUS you're computing stats for 133 different
> subregions:
>    /mdlverif/noscrub/usr/mdl.verif/masks
> I would recommend making sure that dimensions of the mask exactly
match the
> dimensions of the forecast grid so as to avoid regridding those 133
masking
> regions.  I'm pretty sure they do, but couldn't actually figure out
the
> input model data.
>
> So I don't see any obvious issues with memory.  However, there's
always the
> possibility of a problem in the code, like a memory leak.  If you'd
like I
> could try replicating a single run of Grid-Stat locally using all
133 of
> your sub-regions... and profile the code to look for memory leaks.
>
> But let me know if you have any more info on this issue.
>
> Thanks,
> John
>
> On Wed, Sep 11, 2019 at 5:19 AM John L Wagner - NOAA Federal via RT
<
> met_help at ucar.edu> wrote:
>
> >
> > <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=91969 >
> >
> > John
> > My scripts are in
> >
/gpfs/dell3/mdl/mdlverif/noscrub/usr/John.L.Wagner/qpfvs/processing.
I'm
> > using bsub_gridverif.sh to submit hourly_met_verif_urma.sh.
> > My run directories and mpmd files can be found in my ptmp area:
> > /gpfs/dell3/ptmp/John.L.Wagner
> > I work out of Silver Spring, but am actually working from home
today.  A
> > face-to-face meeting doesn't seem likely this week.  We do
frequently use
> > google hangouts to virtually meet and screen share.  Always happy
to do
> > that when you get back from your trip.
> > Thanks
> > John
> >
> > On Tue, Sep 10, 2019 at 9:14 PM John Halley Gotway via RT <
> > met_help at ucar.edu>
> > wrote:
> >
> > > John,
> > >
> > > Hmmm, the behavior you describe is interesting.  I do have
access to
> the
> > > development side of WCOSS.  Can you please point me to the Grid-
Stat
> > script
> > > or command which results in this inconsistent memory usage?  I'd
like
> to
> > > look at the input files and configuration file to see if there's
any
> > > reasonable explanation why the memory usage would differ from
run to
> run.
> > >
> > > I am actually in town visiting NCWCP this week in Greenbelt, MD.
I'll
> be
> > > there on Wednesday and Thursday.  I'm not sure if you sit at
NCWCP or
> in
> > > Silver Springs, but if it works out, we could potentially meet
up to
> > > discuss.
> > >
> > > Thanks,
> > > John Halley Gotway
> > >
> > > On Tue, Sep 10, 2019 at 11:17 AM Randy Bullock via RT <
> met_help at ucar.edu
> > >
> > > wrote:
> > >
> > > >
> > > > Tue Sep 10 11:16:37 2019: Request 91969 was acted upon.
> > > > Transaction: Given to johnhg (John Halley Gotway) by bullock
> > > >        Queue: met_help
> > > >      Subject: grid_stat memory usage
> > > >        Owner: johnhg
> > > >   Requestors: john.l.wagner at noaa.gov
> > > >       Status: new
> > > >  Ticket <URL:
> https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=91969
> > >
> > > >
> > > >
> > > > This transaction appears to have no content
> > > >
> > >
> > >
> >
> > --
> > John Wagner
> > Verification Task Lead
> > COR Task Manager
> > NOAA/National Weather Service
> > Meteorological Development Laboratory
> > Digital Forecast Services Branch
> > SSMC2 Room 10106
> > Silver Spring, MD 20910
> > (301) 427-9471 (office)
> > (908) 902-4155 (cell/text)
> >
> >
>
>

--
John Wagner
Verification Task Lead
COR Task Manager
NOAA/National Weather Service
Meteorological Development Laboratory
Digital Forecast Services Branch
SSMC2 Room 10106
Silver Spring, MD 20910
(301) 427-9471 (office)
(908) 902-4155 (cell/text)

------------------------------------------------