[Met_help] [rt.rap.ucar.edu #98698] History for Bad stat files from grid_stat

Fri Mar 12 11:01:02 MST 2021

----------------------------------------------------------------
  Initial Request
----------------------------------------------------------------

Good morning MET help desk
I continue to have issues with the stat files that come from grid_stat on
occasion.  Please see the attached stat file.  I am getting lines with
extra characters or missing characters.  I wrote a script that attempts to
clean the stat files so that stat_analysis can read them, which is why the
bad lines are either at the top or bottom of the file.
I'm certain that its something in the way I'm running grid_stat on WCOSS
with mpirun that is causing the issue.  If I rerun the same case, I am
unable to produce the same bad file.  I can't figure out what may be
causing it though.  If you have any insights on anything in the writing of
these files that may cause an error when running in parallel and likely
writing multiple files at once (in different directories), please let me
know.
Thanks
John

-- 
John Wagner
Verification Task Lead
NOAA/National Weather Service
Meteorological Development Laboratory
Digital Forecast Services Division
SSMC2 Room 10106
Silver Spring, MD 20910
(301) 427-9471 (office)
(908) 902-4155 (cell/text)

----------------------------------------------------------------
  Complete Ticket History
----------------------------------------------------------------

Subject: Bad stat files from grid_stat
From: John Halley Gotway
Time: Wed Feb 17 11:46:41 2021

Hi John,

This is John Halley Gotway. We've been talking about this issue over
the
last couple of days. Unfortunately, I don't know of any great way to
fix
it, or even begin to debug it since it's not reproducible when you
rerun. I
really doubt that it's a bug in the MET code that can be isolated and
fixed, and I suspect it has something to do with the environment, but
again, I don't know how to prove that. I definitely see how it would
be
very frustrating to have low confidence in the output that Grid-Stat
is
creating.

Running the sample file through MET's Stat-Analysis tool, did prompt
one
idea. When you do that, Stat-Analysis errors out:

/Volumes/d1/projects/MET/MET_development/MET-
develop/met/bin/stat_analysis
-lookin grid_stat_1680000L_20210116_000000V.stat -job filter -dump_row
/tmp/tmp.dump
ERROR  : AsciiHeader::read() -> trouble reading file:
ERROR  :
/Volumes/d1/projects/MET/MET_development/MET-
develop/met/share/met/table_files/met_header_columns_V9.1V9.txt

One option to consider is enhancing Grid-Stat to detect when it has
written
bad data. Rather than doing so silently, have Grid-Stat return a bad
status. And when it does, that could trigger the calling script to
resubmit
that job.

But we certainly wouldn't want to enable this logic all the time since
this
is the only time I've heard about this problem.
We could add a config file option to ConfigConstants like
"*validate_stat_output
= TRUE/FALSE*". If true, then after writing it's output but before
exiting,
Grid-Stat could leverage code from Stat-Analysis to read it's output
back
in. If any errors were detected reading it's own output, then Grid-
Stat
could return bad status.

So rather than silently creating bad output, it'd error out with bad
status. While it wouldn't fix the underlying problem, it'd provide an
easy
way for you to handle it. The downside of course is that it'd require
development on the MET side.

Another approach would be writing a script to do essentially the same
thing. Whatever script calls Grid-Stat could next run a Stat-Analysis
job
to read the Grid-Stat output. If that Stat-Analysis job fails, have
the run
script fail with bad status as well... or have the run script re-run
the
previous call to Grid-Stat.

What do you think about these ideas?

Thanks,
John

On Wed, Feb 17, 2021 at 10:02 AM George McCabe via RT
<met_help at ucar.edu>
wrote:

>
> Wed Feb 17 10:01:59 2021: Request 98698 was acted upon.
> Transaction: Given to johnhg (John Halley Gotway) by mccabe
>        Queue: met_help
>      Subject: Bad stat files from grid_stat
>        Owner: johnhg
>   Requestors: john.l.wagner at noaa.gov
>       Status: new
>  Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=98698 >
>
>
> This transaction appears to have no content
>

------------------------------------------------
Subject: Bad stat files from grid_stat
From: John Halley Gotway
Time: Wed Feb 17 12:04:56 2021

John,

I realize that using the Stat-Analysis "filter" job is a pretty weak
validation step! After fixing "V9.1V9.1" to just be "V9.1", the job no
longer complains about the content or errors out. So using this
approach,
we'd definitely need to add some logic to do more validation. At a
minimum:
(1) make sure all the timestamps follow an expected format:
YYYYMMDD[_HH[MMSS]]
(2) make sure all the lines have the expected number of columns

So calling Stat-Analysis from a script wouldn't find most of these
problems.

The file you sent contains CTC, CTS, and MCTC line types. It appears
the
corrupted data only appears in the CTS line type... perhaps that's a
clue?

John

On Wed, Feb 17, 2021 at 11:46 AM John Halley Gotway <johnhg at ucar.edu>
wrote:

> Hi John,
>
> This is John Halley Gotway. We've been talking about this issue over
the
> last couple of days. Unfortunately, I don't know of any great way to
fix
> it, or even begin to debug it since it's not reproducible when you
rerun. I
> really doubt that it's a bug in the MET code that can be isolated
and
> fixed, and I suspect it has something to do with the environment,
but
> again, I don't know how to prove that. I definitely see how it would
be
> very frustrating to have low confidence in the output that Grid-Stat
is
> creating.
>
> Running the sample file through MET's Stat-Analysis tool, did prompt
one
> idea. When you do that, Stat-Analysis errors out:
>
> /Volumes/d1/projects/MET/MET_development/MET-
develop/met/bin/stat_analysis
> -lookin grid_stat_1680000L_20210116_000000V.stat -job filter
-dump_row
> /tmp/tmp.dump
> ERROR  : AsciiHeader::read() -> trouble reading file:
> ERROR  :
> /Volumes/d1/projects/MET/MET_development/MET-
develop/met/share/met/table_files/met_header_columns_V9.1V9.txt
>
> One option to consider is enhancing Grid-Stat to detect when it has
> written bad data. Rather than doing so silently, have Grid-Stat
return a
> bad status. And when it does, that could trigger the calling script
to
> resubmit that job.
>
> But we certainly wouldn't want to enable this logic all the time
since
> this is the only time I've heard about this problem.
> We could add a config file option to ConfigConstants like
"*validate_stat_output
> = TRUE/FALSE*". If true, then after writing it's output but before
> exiting, Grid-Stat could leverage code from Stat-Analysis to read
it's
> output back in. If any errors were detected reading it's own output,
then
> Grid-Stat could return bad status.
>
> So rather than silently creating bad output, it'd error out with bad
> status. While it wouldn't fix the underlying problem, it'd provide
an easy
> way for you to handle it. The downside of course is that it'd
require
> development on the MET side.
>
> Another approach would be writing a script to do essentially the
same
> thing. Whatever script calls Grid-Stat could next run a Stat-
Analysis job
> to read the Grid-Stat output. If that Stat-Analysis job fails, have
the run
> script fail with bad status as well... or have the run script re-run
the
> previous call to Grid-Stat.
>
> What do you think about these ideas?
>
> Thanks,
> John
>
>
>
> On Wed, Feb 17, 2021 at 10:02 AM George McCabe via RT
<met_help at ucar.edu>
> wrote:
>
>>
>> Wed Feb 17 10:01:59 2021: Request 98698 was acted upon.
>> Transaction: Given to johnhg (John Halley Gotway) by mccabe
>>        Queue: met_help
>>      Subject: Bad stat files from grid_stat
>>        Owner: johnhg
>>   Requestors: john.l.wagner at noaa.gov
>>       Status: new
>>  Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=98698 >
>>
>>
>> This transaction appears to have no content
>>
>

------------------------------------------------
Subject: Bad stat files from grid_stat
From: John L Wagner - NOAA Federal
Time: Wed Feb 17 12:58:41 2021

Thanks for the response John.  Comments below.

On Wed, Feb 17, 2021 at 2:05 PM John Halley Gotway via RT
<met_help at ucar.edu>
wrote:

> John,
>
> I realize that using the Stat-Analysis "filter" job is a pretty weak
> validation step! After fixing "V9.1V9.1" to just be "V9.1", the job
no
> longer complains about the content or errors out. So using this
approach,
> we'd definitely need to add some logic to do more validation. At a
minimum:
> (1) make sure all the timestamps follow an expected format:
> YYYYMMDD[_HH[MMSS]]
> (2) make sure all the lines have the expected number of columns
>
> So calling Stat-Analysis from a script wouldn't find most of these
> problems.
>
> The file you sent contains CTC, CTS, and MCTC line types. It appears
the
> corrupted data only appears in the CTS line type... perhaps that's a
clue?
>

CTS is definitely a clue.  I have only noticed these issues for
elements
where I need the contingency table scores (wind direction, qpf06, and
sky
cover are the most common).  Elements that just use the continuous
scores
(temperature, dewpoint, etc) never seem to have an issue.

>
> John
>
> On Wed, Feb 17, 2021 at 11:46 AM John Halley Gotway
<johnhg at ucar.edu>
> wrote:
>
> > Hi John,
> >
> > This is John Halley Gotway. We've been talking about this issue
over the
> > last couple of days. Unfortunately, I don't know of any great way
to fix
> > it, or even begin to debug it since it's not reproducible when you
> rerun. I
> > really doubt that it's a bug in the MET code that can be isolated
and
> > fixed, and I suspect it has something to do with the environment,
but
> > again, I don't know how to prove that. I definitely see how it
would be
> > very frustrating to have low confidence in the output that Grid-
Stat is
> > creating.
>

I'm fairly certain this is a WCOSS issue.  It seems like we get slower
write speeds when the machine is busy (there has certainly been a lot
of
competition for resources lately).  It seems at times like one file is
not
finished writing before another file is opened for writing.  I'm not
sure
of the order that grid-stat writes files, but if the CTS file and the
stat
files are the last to be written, perhaps that's why they are the ones
most
often affected.
Of all of the jobs I've run on WCOSS over the years, grid-stat and
point-stat are the only ones that I've seen this kind of behavior
from.

> >
> > Running the sample file through MET's Stat-Analysis tool, did
prompt one
> > idea. When you do that, Stat-Analysis errors out:
> >
> >
> /Volumes/d1/projects/MET/MET_development/MET-
develop/met/bin/stat_analysis
> > -lookin grid_stat_1680000L_20210116_000000V.stat -job filter
-dump_row
> > /tmp/tmp.dump
> > ERROR  : AsciiHeader::read() -> trouble reading file:
> > ERROR  :
> >
> /Volumes/d1/projects/MET/MET_development/MET-
develop/met/share/met/table_files/met_header_columns_V9.1V9.txt
> >
> > One option to consider is enhancing Grid-Stat to detect when it
has
> > written bad data. Rather than doing so silently, have Grid-Stat
return a
> > bad status. And when it does, that could trigger the calling
script to
> > resubmit that job.
> >
> > But we certainly wouldn't want to enable this logic all the time
since
> > this is the only time I've heard about this problem.
> > We could add a config file option to ConfigConstants like
> "*validate_stat_output
> > = TRUE/FALSE*". If true, then after writing it's output but before
> > exiting, Grid-Stat could leverage code from Stat-Analysis to read
it's
> > output back in. If any errors were detected reading it's own
output, then
> > Grid-Stat could return bad status.
> >
> > So rather than silently creating bad output, it'd error out with
bad
> > status. While it wouldn't fix the underlying problem, it'd provide
an
> easy
> > way for you to handle it. The downside of course is that it'd
require
> > development on the MET side.
> >
> > Another approach would be writing a script to do essentially the
same
> > thing. Whatever script calls Grid-Stat could next run a Stat-
Analysis job
> > to read the Grid-Stat output. If that Stat-Analysis job fails,
have the
> run
> > script fail with bad status as well... or have the run script re-
run the
> > previous call to Grid-Stat.
>

I've attempted to write scripts that do this.  I've tried counting the
number of columns or characters in each line based on line type.
There are
too many variations based on the element and number of thresholds for
this
to be effective.  It also won't catch the variations of "V9.V9.1" that
I've
seen.
Is there a formatted read that stat-analysis uses?  If so, could you
point
me to it in the code?  It seems like doing a formatted read that
mimics
stat-analysis is the only effective way to catch these issues.  I have
no
clue if this script would be quicker than calling stat-analysis and
checking for errors.

> >
> > What do you think about these ideas?
> >
> > Thanks,
> > John
> >
> >
> >
> > On Wed, Feb 17, 2021 at 10:02 AM George McCabe via RT
<met_help at ucar.edu
> >
> > wrote:
> >
> >>
> >> Wed Feb 17 10:01:59 2021: Request 98698 was acted upon.
> >> Transaction: Given to johnhg (John Halley Gotway) by mccabe
> >>        Queue: met_help
> >>      Subject: Bad stat files from grid_stat
> >>        Owner: johnhg
> >>   Requestors: john.l.wagner at noaa.gov
> >>       Status: new
> >>  Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=98698 >
> >>
> >>
> >> This transaction appears to have no content
> >>
> >
>
>

--
John Wagner
Verification Task Lead
NOAA/National Weather Service
Meteorological Development Laboratory
Digital Forecast Services Division
SSMC2 Room 10106
Silver Spring, MD 20910
(301) 427-9471 (office)
(908) 902-4155 (cell/text)

------------------------------------------------