[Met_help] [rt.rap.ucar.edu #82515] History for question on dealing with calculation of bad statistics

Mon Oct 30 16:39:05 MDT 2017

----------------------------------------------------------------
  Initial Request
----------------------------------------------------------------

Hi,

I'm using pb2nc and point_stat to create matched pairs (mpr file) and stats
of Wind speed data between ASCAT data (point data from the prepbufr file)
and GFS data.  I'm also using wind thresholds, and calculating stats (cts
and cnt files), and then creating tables from the output of those files.

I have pb2nc and point_stat running in a python script, within a cronjob,
which I've been running since Sept 20th.  On Oct 8th, not all the data was
retrieved from the prepbufr file, and point_stat processing took place with
the smaller dataset( 1039 matched points, instead of >100,000 matched
points).  I've attached the 3 images (ASCAT, GFS and the difference of the
wind speed field) of the region where the data was located. Taking a look
at these images, there are 2 small regions where there are differences,
but, not really big differences.  Anyway, when plotting a time series of
the CSI, POD and FAR stats, I noticed that there was a big dip in the time
series.  I've attached that plot, too.

I took a closer look at the CSI values for the contingency table, and I saw
these values:
Total = 339
FY_OY = 14
FY_ON = 0
FN_OY = 66
FN_ON = 259
for a calculated CSI value of (14/80) = .175

My question is, if there was a full dataset, would the data have been
smoothed out more, and these differences not as noticeable?

Then, if it's the smaller dataset, is there a way to stop processing if
there is a limited number of data?  Or, would I need to add something into
my python processing script to alert me to such problems?

Thanks!

Roz

-- 
Rosalyn MacCracken
Support Scientist

Ocean Applilcations Branch
NOAA/NWS Ocean Prediction Center
NCWCP
5830 University Research Ct
College Park, MD  20740-3818

(p) 301-683-1551 <(301)%20683-1551>
rosalyn.maccracken at noaa.gov

----------------------------------------------------------------
  Complete Ticket History
----------------------------------------------------------------

Subject: question on dealing with calculation of bad statistics
From: John Halley Gotway
Time: Thu Oct 26 15:03:00 2017

Roz,

If I recall correctly, you're running Point-Stat to generate matched
pairs
files and then running stat-analysis to aggregate them and compute
contingency tables.  Those tools don't have any idea how many data
points
they "should" be processing.  They just process the data you pass too
them.

One option that might be helpful is the STAT-Analysis "-column_thresh"
option.   Here's some potential logic you could use:

(1) Run a STAT-Analysis "aggregate_stat" job to read the MPR lines,
apply
thresholds, and write a CTC output line using "-out_stat" command line
option to write a .stat file.
(2) Run a 2nd STAT-Analysis "aggregate_stat" job to read the CTC
output of
job run and write a CTS statistics output line.  But use this option
"-column_thresh TOTAL >1000".  That tells STAT-Analysis to only
process CTC
lines where the TOTAL column is at least 1000.

Hopefully you'll find that logic, or some variant of it, to be
helpful.

Another idea would be to add a config file option for Point-Stat and
Grid-Stat... something like "min_total = 1000;".  When processing a
verification task, if fewer than the required minimum number of
matched
pairs are found, we could skip it and not write any output.  Do you
think
that logic would be helpful for you?

Thanks,
John

On Thu, Oct 26, 2017 at 2:47 PM, Rosalyn MacCracken - NOAA Affiliate
via RT
<met_help at ucar.edu> wrote:

>
> Thu Oct 26 12:47:53 2017: Request 82515 was acted upon.
> Transaction: Ticket created by rosalyn.maccracken at noaa.gov
>        Queue: met_help
>      Subject: question on dealing with calculation of bad statistics
>        Owner: Nobody
>   Requestors: rosalyn.maccracken at noaa.gov
>       Status: new
>  Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=82515 >
>
>
> Hi,
>
> I'm using pb2nc and point_stat to create matched pairs (mpr file)
and stats
> of Wind speed data between ASCAT data (point data from the prepbufr
file)
> and GFS data.  I'm also using wind thresholds, and calculating stats
(cts
> and cnt files), and then creating tables from the output of those
files.
>
> I have pb2nc and point_stat running in a python script, within a
cronjob,
> which I've been running since Sept 20th.  On Oct 8th, not all the
data was
> retrieved from the prepbufr file, and point_stat processing took
place with
> the smaller dataset( 1039 matched points, instead of >100,000
matched
> points).  I've attached the 3 images (ASCAT, GFS and the difference
of the
> wind speed field) of the region where the data was located. Taking a
look
> at these images, there are 2 small regions where there are
differences,
> but, not really big differences.  Anyway, when plotting a time
series of
> the CSI, POD and FAR stats, I noticed that there was a big dip in
the time
> series.  I've attached that plot, too.
>
> I took a closer look at the CSI values for the contingency table,
and I saw
> these values:
> Total = 339
> FY_OY = 14
> FY_ON = 0
> FN_OY = 66
> FN_ON = 259
> for a calculated CSI value of (14/80) = .175
>
> My question is, if there was a full dataset, would the data have
been
> smoothed out more, and these differences not as noticeable?
>
> Then, if it's the smaller dataset, is there a way to stop processing
if
> there is a limited number of data?  Or, would I need to add
something into
> my python processing script to alert me to such problems?
>
> Thanks!
>
> Roz
>
>
>
> --
> Rosalyn MacCracken
> Support Scientist
>
> Ocean Applilcations Branch
> NOAA/NWS Ocean Prediction Center
> NCWCP
> 5830 University Research Ct
> College Park, MD  20740-3818
>
> (p) 301-683-1551 <(301)%20683-1551>
> rosalyn.maccracken at noaa.gov
>
>

------------------------------------------------
Subject: question on dealing with calculation of bad statistics
From: Rosalyn MacCracken - NOAA Affiliate
Time: Fri Oct 27 06:34:56 2017

Hi John,

I'm actually not using stat_analysis.  So, after I run point_stat, I
have
all those hourly output files, out to the 96 forecast hour.  We don't
have
the disk space to keep all of them after maybe 6 months (maybe
shorter...can't remember right now...3 months maybe?), so, what I do
is use
python to strip out the variables that I want, and write them to a csv
file.  We figured out that if I make a csv file, we can make a dynamic
table for our website.

So, my time series are made from reading in the hourly csv files, and
writing them to a file that I can use to create a time series plot.
Oh,
and all of those tables I showed yesterday in that one plot (with the
contingency table, the continuous stats table, and the one with the
thresholds), are created from the hourly csv files.  I also create a
daily
average, and will create a weekly average.  That's again, from reading
in
the csv files, and creating the averages.

You have to be creative when you don't have a ton of disk space has,
like
WCOSS has.  What would be nice (maybe for later releases of MET), is
for
the user to be able what stats they want outputed, just in case they
don't
use everything in the ctc file, or everything in the cts file, etc.

So, that min_total option...does that exist now?  That would be
useful.
But, here's the thing.  I agree with what you said about that MET just
takes the output and doesn't care how many points are there.  So, if
those
matched points are close in forecasted/observation value, the stats
will be
good.  I think what happened in this case, was that the wind speed
values
weren't close, ~5 knots difference, and the stats reflected that.  I'm
actually glad that that case surfaced, since that's the errors that
we're
looking for.

Roz

On Thu, Oct 26, 2017 at 5:03 PM, John Halley Gotway via RT <
met_help at ucar.edu> wrote:

> Roz,
>
> If I recall correctly, you're running Point-Stat to generate matched
pairs
> files and then running stat-analysis to aggregate them and compute
> contingency tables.  Those tools don't have any idea how many data
points
> they "should" be processing.  They just process the data you pass
too them.
>
> One option that might be helpful is the STAT-Analysis "-
column_thresh"
> option.   Here's some potential logic you could use:
>
> (1) Run a STAT-Analysis "aggregate_stat" job to read the MPR lines,
apply
> thresholds, and write a CTC output line using "-out_stat" command
line
> option to write a .stat file.
> (2) Run a 2nd STAT-Analysis "aggregate_stat" job to read the CTC
output of
> job run and write a CTS statistics output line.  But use this option
> "-column_thresh TOTAL >1000".  That tells STAT-Analysis to only
process CTC
> lines where the TOTAL column is at least 1000.
>
> Hopefully you'll find that logic, or some variant of it, to be
helpful.
>
> Another idea would be to add a config file option for Point-Stat and
> Grid-Stat... something like "min_total = 1000;".  When processing a
> verification task, if fewer than the required minimum number of
matched
> pairs are found, we could skip it and not write any output.  Do you
think
> that logic would be helpful for you?
>
> Thanks,
> John
>
> On Thu, Oct 26, 2017 at 2:47 PM, Rosalyn MacCracken - NOAA Affiliate
via RT
> <met_help at ucar.edu> wrote:
>
> >
> > Thu Oct 26 12:47:53 2017: Request 82515 was acted upon.
> > Transaction: Ticket created by rosalyn.maccracken at noaa.gov
> >        Queue: met_help
> >      Subject: question on dealing with calculation of bad
statistics
> >        Owner: Nobody
> >   Requestors: rosalyn.maccracken at noaa.gov
> >       Status: new
> >  Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=82515 >
> >
> >
> > Hi,
> >
> > I'm using pb2nc and point_stat to create matched pairs (mpr file)
and
> stats
> > of Wind speed data between ASCAT data (point data from the
prepbufr file)
> > and GFS data.  I'm also using wind thresholds, and calculating
stats (cts
> > and cnt files), and then creating tables from the output of those
files.
> >
> > I have pb2nc and point_stat running in a python script, within a
cronjob,
> > which I've been running since Sept 20th.  On Oct 8th, not all the
data
> was
> > retrieved from the prepbufr file, and point_stat processing took
place
> with
> > the smaller dataset( 1039 matched points, instead of >100,000
matched
> > points).  I've attached the 3 images (ASCAT, GFS and the
difference of
> the
> > wind speed field) of the region where the data was located. Taking
a look
> > at these images, there are 2 small regions where there are
differences,
> > but, not really big differences.  Anyway, when plotting a time
series of
> > the CSI, POD and FAR stats, I noticed that there was a big dip in
the
> time
> > series.  I've attached that plot, too.
> >
> > I took a closer look at the CSI values for the contingency table,
and I
> saw
> > these values:
> > Total = 339
> > FY_OY = 14
> > FY_ON = 0
> > FN_OY = 66
> > FN_ON = 259
> > for a calculated CSI value of (14/80) = .175
> >
> > My question is, if there was a full dataset, would the data have
been
> > smoothed out more, and these differences not as noticeable?
> >
> > Then, if it's the smaller dataset, is there a way to stop
processing if
> > there is a limited number of data?  Or, would I need to add
something
> into
> > my python processing script to alert me to such problems?
> >
> > Thanks!
> >
> > Roz
> >
> >
> >
> > --
> > Rosalyn MacCracken
> > Support Scientist
> >
> > Ocean Applilcations Branch
> > NOAA/NWS Ocean Prediction Center
> > NCWCP
> > 5830 University Research Ct
> > College Park, MD  20740-3818
> >
> > (p) 301-683-1551 <(301)%20683-1551>
> > rosalyn.maccracken at noaa.gov
> >
> >
>
>

--
Rosalyn MacCracken
Support Scientist

Ocean Applilcations Branch
NOAA/NWS Ocean Prediction Center
NCWCP
5830 University Research Ct
College Park, MD  20740-3818

(p) 301-683-1551
rosalyn.maccracken at noaa.gov

------------------------------------------------
Subject: question on dealing with calculation of bad statistics
From: John Halley Gotway
Time: Fri Oct 27 16:15:15 2017

Roz,

Thanks for describing the logic you use.  I get a sense for the
general
flow of data through your system, but don't fully understand the
specifics.

It does seems to me that you may find the STAT-Analysis tool to be
extremely useful.  I has that ability to read MPR lines from .stat
files,
aggregate them in very flexible ways, and derive a variety of output
statistics line type.

As for supporting user-specified subsets of output columns for the
line
types, there would be a lot of details involved there.  There's a lot
of
logic in MET for reading/writing the various line types.  If we talked
some
more about it and brainstormed, I suspect we could find some solutions
that
wouldn't require that functionality.

To answer your question, no, the min_total option does not currently
exist.  I mentioned it as a potential enhancement for MET.  I'll
create a
development issue in JiRA for it.  It'd probably be wise to define it
as a
threshold instead.  Perhaps someone would like both an upper and lower
bound:  n_obs_thresh = '>100&&<1000';

There's one really nice feature coming in met-6.1 in STAT-Analysis to
filter lines based on the difference of columns.  For example, this
job:
   stat_analysis -lookin met_out -job filter -line_type MPR -dump_row
big_errors.stat -fcst_var WIND -column_thresh abs(FCST-OBS) ge5

This job would look in a directory named "met_out" for files ending in
".stat".  It'll read all the MPR lines where FCST_VAR = WIND and only
keep
lines where the absolute value of FCST - OBS is greater than or equal
to
5.  Any lines it finds are written to an output file named
big_errors.stat.

So this job will help you identify points where there are large errors
in
your model.

Thanks,
John

On Fri, Oct 27, 2017 at 6:34 AM, Rosalyn MacCracken - NOAA Affiliate
via RT
<met_help at ucar.edu> wrote:

>
> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=82515 >
>
> Hi John,
>
> I'm actually not using stat_analysis.  So, after I run point_stat, I
have
> all those hourly output files, out to the 96 forecast hour.  We
don't have
> the disk space to keep all of them after maybe 6 months (maybe
> shorter...can't remember right now...3 months maybe?), so, what I do
is use
> python to strip out the variables that I want, and write them to a
csv
> file.  We figured out that if I make a csv file, we can make a
dynamic
> table for our website.
>
> So, my time series are made from reading in the hourly csv files,
and
> writing them to a file that I can use to create a time series plot.
Oh,
> and all of those tables I showed yesterday in that one plot (with
the
> contingency table, the continuous stats table, and the one with the
> thresholds), are created from the hourly csv files.  I also create a
daily
> average, and will create a weekly average.  That's again, from
reading in
> the csv files, and creating the averages.
>
> You have to be creative when you don't have a ton of disk space has,
like
> WCOSS has.  What would be nice (maybe for later releases of MET), is
for
> the user to be able what stats they want outputed, just in case they
don't
> use everything in the ctc file, or everything in the cts file, etc.
>
> So, that min_total option...does that exist now?  That would be
useful.
> But, here's the thing.  I agree with what you said about that MET
just
> takes the output and doesn't care how many points are there.  So, if
those
> matched points are close in forecasted/observation value, the stats
will be
> good.  I think what happened in this case, was that the wind speed
values
> weren't close, ~5 knots difference, and the stats reflected that.
I'm
> actually glad that that case surfaced, since that's the errors that
we're
> looking for.
>
> Roz
>
>
> On Thu, Oct 26, 2017 at 5:03 PM, John Halley Gotway via RT <
> met_help at ucar.edu> wrote:
>
> > Roz,
> >
> > If I recall correctly, you're running Point-Stat to generate
matched
> pairs
> > files and then running stat-analysis to aggregate them and compute
> > contingency tables.  Those tools don't have any idea how many data
points
> > they "should" be processing.  They just process the data you pass
too
> them.
> >
> > One option that might be helpful is the STAT-Analysis "-
column_thresh"
> > option.   Here's some potential logic you could use:
> >
> > (1) Run a STAT-Analysis "aggregate_stat" job to read the MPR
lines, apply
> > thresholds, and write a CTC output line using "-out_stat" command
line
> > option to write a .stat file.
> > (2) Run a 2nd STAT-Analysis "aggregate_stat" job to read the CTC
output
> of
> > job run and write a CTS statistics output line.  But use this
option
> > "-column_thresh TOTAL >1000".  That tells STAT-Analysis to only
process
> CTC
> > lines where the TOTAL column is at least 1000.
> >
> > Hopefully you'll find that logic, or some variant of it, to be
helpful.
> >
> > Another idea would be to add a config file option for Point-Stat
and
> > Grid-Stat... something like "min_total = 1000;".  When processing
a
> > verification task, if fewer than the required minimum number of
matched
> > pairs are found, we could skip it and not write any output.  Do
you think
> > that logic would be helpful for you?
> >
> > Thanks,
> > John
> >
> > On Thu, Oct 26, 2017 at 2:47 PM, Rosalyn MacCracken - NOAA
Affiliate via
> RT
> > <met_help at ucar.edu> wrote:
> >
> > >
> > > Thu Oct 26 12:47:53 2017: Request 82515 was acted upon.
> > > Transaction: Ticket created by rosalyn.maccracken at noaa.gov
> > >        Queue: met_help
> > >      Subject: question on dealing with calculation of bad
statistics
> > >        Owner: Nobody
> > >   Requestors: rosalyn.maccracken at noaa.gov
> > >       Status: new
> > >  Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=82515
> >
> > >
> > >
> > > Hi,
> > >
> > > I'm using pb2nc and point_stat to create matched pairs (mpr
file) and
> > stats
> > > of Wind speed data between ASCAT data (point data from the
prepbufr
> file)
> > > and GFS data.  I'm also using wind thresholds, and calculating
stats
> (cts
> > > and cnt files), and then creating tables from the output of
those
> files.
> > >
> > > I have pb2nc and point_stat running in a python script, within a
> cronjob,
> > > which I've been running since Sept 20th.  On Oct 8th, not all
the data
> > was
> > > retrieved from the prepbufr file, and point_stat processing took
place
> > with
> > > the smaller dataset( 1039 matched points, instead of >100,000
matched
> > > points).  I've attached the 3 images (ASCAT, GFS and the
difference of
> > the
> > > wind speed field) of the region where the data was located.
Taking a
> look
> > > at these images, there are 2 small regions where there are
differences,
> > > but, not really big differences.  Anyway, when plotting a time
series
> of
> > > the CSI, POD and FAR stats, I noticed that there was a big dip
in the
> > time
> > > series.  I've attached that plot, too.
> > >
> > > I took a closer look at the CSI values for the contingency
table, and I
> > saw
> > > these values:
> > > Total = 339
> > > FY_OY = 14
> > > FY_ON = 0
> > > FN_OY = 66
> > > FN_ON = 259
> > > for a calculated CSI value of (14/80) = .175
> > >
> > > My question is, if there was a full dataset, would the data have
been
> > > smoothed out more, and these differences not as noticeable?
> > >
> > > Then, if it's the smaller dataset, is there a way to stop
processing if
> > > there is a limited number of data?  Or, would I need to add
something
> > into
> > > my python processing script to alert me to such problems?
> > >
> > > Thanks!
> > >
> > > Roz
> > >
> > >
> > >
> > > --
> > > Rosalyn MacCracken
> > > Support Scientist
> > >
> > > Ocean Applilcations Branch
> > > NOAA/NWS Ocean Prediction Center
> > > NCWCP
> > > 5830 University Research Ct
> > > College Park, MD  20740-3818
> > >
> > > (p) 301-683-1551 <(301)%20683-1551> <(301)%20683-1551>
> > > rosalyn.maccracken at noaa.gov
> > >
> > >
> >
> >
>
>
> --
> Rosalyn MacCracken
> Support Scientist
>
> Ocean Applilcations Branch
> NOAA/NWS Ocean Prediction Center
> NCWCP
> 5830 University Research Ct
> College Park, MD  20740-3818
>
> (p) 301-683-1551
> rosalyn.maccracken at noaa.gov
>
>

------------------------------------------------
Subject: question on dealing with calculation of bad statistics
From: Rosalyn MacCracken - NOAA Affiliate
Time: Mon Oct 30 06:51:46 2017

Hi John,

That does sound like a nice enhancement for 6.1, and then, maybe later
working on an upper and lower threshold bound.

As for stat_analysis, yes, I think there are some very useful things
it can
do.  I only have two issues.  One is file/directory size after a while
of
running these things.  Maybe I could use stat_analysis as a first
step, get
the output I want, then, read it into a python script and subset what
I
really need to keep.  That might be easily doable.

The other issue I have is the directory structure of my 6 hourly
files.
Currently, the *.stat and *.txt output from point_stat goes into a
directory structure like:
<full_path>/$YYYY$MM$DD$HH
So, they are separated by yea, month, day and hour.  So, to aggregate
a
single day of files (like yesterday), I will need to look in the
directories: 2017102900 2017102906, 2017102912 and 2017102918 (4
separate
directories).  A week's worth of data would be 28 separate
directories, etc.

How do I use -lookin to find with 4 directories, or 28 directories,
etc?  I
guess 4 directories might not be so bad on the command line:
-lookin <full_path>/+year+mon+day+"00"  <full_path>/+year+mon+day+"06"
<full_path>/+year+mon+day+"12" <full_path>/+year+mon+day+"18"

but, a week or months worth of data could be a little much to put on a
single command line.  How would you do that?

I guess if I can figure out how to get stat_analysis to work with my
directory structure, then, I can take advantage of that tool.

Roz

On Fri, Oct 27, 2017 at 10:15 PM, John Halley Gotway via RT <
met_help at ucar.edu> wrote:

> Roz,
>
> Thanks for describing the logic you use.  I get a sense for the
general
> flow of data through your system, but don't fully understand the
specifics.
>
> It does seems to me that you may find the STAT-Analysis tool to be
> extremely useful.  I has that ability to read MPR lines from .stat
files,
> aggregate them in very flexible ways, and derive a variety of output
> statistics line type.
>
> As for supporting user-specified subsets of output columns for the
line
> types, there would be a lot of details involved there.  There's a
lot of
> logic in MET for reading/writing the various line types.  If we
talked some
> more about it and brainstormed, I suspect we could find some
solutions that
> wouldn't require that functionality.
>
> To answer your question, no, the min_total option does not currently
> exist.  I mentioned it as a potential enhancement for MET.  I'll
create a
> development issue in JiRA for it.  It'd probably be wise to define
it as a
> threshold instead.  Perhaps someone would like both an upper and
lower
> bound:  n_obs_thresh = '>100&&<1000';
>
> There's one really nice feature coming in met-6.1 in STAT-Analysis
to
> filter lines based on the difference of columns.  For example, this
job:
>    stat_analysis -lookin met_out -job filter -line_type MPR
-dump_row
> big_errors.stat -fcst_var WIND -column_thresh abs(FCST-OBS) ge5
>
> This job would look in a directory named "met_out" for files ending
in
> ".stat".  It'll read all the MPR lines where FCST_VAR = WIND and
only keep
> lines where the absolute value of FCST - OBS is greater than or
equal to
> 5.  Any lines it finds are written to an output file named
big_errors.stat.
>
> So this job will help you identify points where there are large
errors in
> your model.
>
> Thanks,
> John
>
>
> On Fri, Oct 27, 2017 at 6:34 AM, Rosalyn MacCracken - NOAA Affiliate
via RT
> <met_help at ucar.edu> wrote:
>
> >
> > <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=82515 >
> >
> > Hi John,
> >
> > I'm actually not using stat_analysis.  So, after I run point_stat,
I have
> > all those hourly output files, out to the 96 forecast hour.  We
don't
> have
> > the disk space to keep all of them after maybe 6 months (maybe
> > shorter...can't remember right now...3 months maybe?), so, what I
do is
> use
> > python to strip out the variables that I want, and write them to a
csv
> > file.  We figured out that if I make a csv file, we can make a
dynamic
> > table for our website.
> >
> > So, my time series are made from reading in the hourly csv files,
and
> > writing them to a file that I can use to create a time series
plot.  Oh,
> > and all of those tables I showed yesterday in that one plot (with
the
> > contingency table, the continuous stats table, and the one with
the
> > thresholds), are created from the hourly csv files.  I also create
a
> daily
> > average, and will create a weekly average.  That's again, from
reading in
> > the csv files, and creating the averages.
> >
> > You have to be creative when you don't have a ton of disk space
has, like
> > WCOSS has.  What would be nice (maybe for later releases of MET),
is for
> > the user to be able what stats they want outputed, just in case
they
> don't
> > use everything in the ctc file, or everything in the cts file,
etc.
> >
> > So, that min_total option...does that exist now?  That would be
useful.
> > But, here's the thing.  I agree with what you said about that MET
just
> > takes the output and doesn't care how many points are there.  So,
if
> those
> > matched points are close in forecasted/observation value, the
stats will
> be
> > good.  I think what happened in this case, was that the wind speed
values
> > weren't close, ~5 knots difference, and the stats reflected that.
I'm
> > actually glad that that case surfaced, since that's the errors
that we're
> > looking for.
> >
> > Roz
> >
> >
> > On Thu, Oct 26, 2017 at 5:03 PM, John Halley Gotway via RT <
> > met_help at ucar.edu> wrote:
> >
> > > Roz,
> > >
> > > If I recall correctly, you're running Point-Stat to generate
matched
> > pairs
> > > files and then running stat-analysis to aggregate them and
compute
> > > contingency tables.  Those tools don't have any idea how many
data
> points
> > > they "should" be processing.  They just process the data you
pass too
> > them.
> > >
> > > One option that might be helpful is the STAT-Analysis "-
column_thresh"
> > > option.   Here's some potential logic you could use:
> > >
> > > (1) Run a STAT-Analysis "aggregate_stat" job to read the MPR
lines,
> apply
> > > thresholds, and write a CTC output line using "-out_stat"
command line
> > > option to write a .stat file.
> > > (2) Run a 2nd STAT-Analysis "aggregate_stat" job to read the CTC
output
> > of
> > > job run and write a CTS statistics output line.  But use this
option
> > > "-column_thresh TOTAL >1000".  That tells STAT-Analysis to only
process
> > CTC
> > > lines where the TOTAL column is at least 1000.
> > >
> > > Hopefully you'll find that logic, or some variant of it, to be
helpful.
> > >
> > > Another idea would be to add a config file option for Point-Stat
and
> > > Grid-Stat... something like "min_total = 1000;".  When
processing a
> > > verification task, if fewer than the required minimum number of
matched
> > > pairs are found, we could skip it and not write any output.  Do
you
> think
> > > that logic would be helpful for you?
> > >
> > > Thanks,
> > > John
> > >
> > > On Thu, Oct 26, 2017 at 2:47 PM, Rosalyn MacCracken - NOAA
Affiliate
> via
> > RT
> > > <met_help at ucar.edu> wrote:
> > >
> > > >
> > > > Thu Oct 26 12:47:53 2017: Request 82515 was acted upon.
> > > > Transaction: Ticket created by rosalyn.maccracken at noaa.gov
> > > >        Queue: met_help
> > > >      Subject: question on dealing with calculation of bad
statistics
> > > >        Owner: Nobody
> > > >   Requestors: rosalyn.maccracken at noaa.gov
> > > >       Status: new
> > > >  Ticket <URL: https://rt.rap.ucar.edu/rt/
> Ticket/Display.html?id=82515
> > >
> > > >
> > > >
> > > > Hi,
> > > >
> > > > I'm using pb2nc and point_stat to create matched pairs (mpr
file) and
> > > stats
> > > > of Wind speed data between ASCAT data (point data from the
prepbufr
> > file)
> > > > and GFS data.  I'm also using wind thresholds, and calculating
stats
> > (cts
> > > > and cnt files), and then creating tables from the output of
those
> > files.
> > > >
> > > > I have pb2nc and point_stat running in a python script, within
a
> > cronjob,
> > > > which I've been running since Sept 20th.  On Oct 8th, not all
the
> data
> > > was
> > > > retrieved from the prepbufr file, and point_stat processing
took
> place
> > > with
> > > > the smaller dataset( 1039 matched points, instead of >100,000
matched
> > > > points).  I've attached the 3 images (ASCAT, GFS and the
difference
> of
> > > the
> > > > wind speed field) of the region where the data was located.
Taking a
> > look
> > > > at these images, there are 2 small regions where there are
> differences,
> > > > but, not really big differences.  Anyway, when plotting a time
series
> > of
> > > > the CSI, POD and FAR stats, I noticed that there was a big dip
in the
> > > time
> > > > series.  I've attached that plot, too.
> > > >
> > > > I took a closer look at the CSI values for the contingency
table,
> and I
> > > saw
> > > > these values:
> > > > Total = 339
> > > > FY_OY = 14
> > > > FY_ON = 0
> > > > FN_OY = 66
> > > > FN_ON = 259
> > > > for a calculated CSI value of (14/80) = .175
> > > >
> > > > My question is, if there was a full dataset, would the data
have been
> > > > smoothed out more, and these differences not as noticeable?
> > > >
> > > > Then, if it's the smaller dataset, is there a way to stop
processing
> if
> > > > there is a limited number of data?  Or, would I need to add
something
> > > into
> > > > my python processing script to alert me to such problems?
> > > >
> > > > Thanks!
> > > >
> > > > Roz
> > > >
> > > >
> > > >
> > > > --
> > > > Rosalyn MacCracken
> > > > Support Scientist
> > > >
> > > > Ocean Applilcations Branch
> > > > NOAA/NWS Ocean Prediction Center
> > > > NCWCP
> > > > 5830 University Research Ct
> > > > College Park, MD  20740-3818
> > > >
> > > > (p) 301-683-1551 <(301)%20683-1551> <(301)%20683-1551>
> > > > rosalyn.maccracken at noaa.gov
> > > >
> > > >
> > >
> > >
> >
> >
> > --
> > Rosalyn MacCracken
> > Support Scientist
> >
> > Ocean Applilcations Branch
> > NOAA/NWS Ocean Prediction Center
> > NCWCP
> > 5830 University Research Ct
> > College Park, MD  20740-3818
> >
> > (p) 301-683-1551
> > rosalyn.maccracken at noaa.gov
> >
> >
>
>

--
Rosalyn MacCracken
Support Scientist

Ocean Applilcations Branch
NOAA/NWS Ocean Prediction Center
NCWCP
5830 University Research Ct
College Park, MD  20740-3818

(p) 301-683-1551
rosalyn.maccracken at noaa.gov

------------------------------------------------
Subject: question on dealing with calculation of bad statistics
From: John Halley Gotway
Time: Mon Oct 30 09:33:13 2017

Roz,

I have a few points to make.

First, I'd suggest using "mpr = STAT" in the output_flag section of
the
Point-Stat configuration file.  You mentioned having both "_mpr.txt"
and
".stat" files in your output directory.  Point-Stat writes all the
output
to the ".stat" file and *duplicate* output to the ".txt" output files,
sorted by line type.  If you're writing both .stat and .txt output
files,
then the output is twice as large as it needs to be.  If you
incorporate
STAT-Analysis into your processing logic, you'd only need to write the
.stat output files.

Second, STAT-Analysis searches the "-lookin" directories
*recursively*.
Suppose all of your output is in directories named
"met_output/YYYYMMDDHH".  You could just use "-lookin met_output" and
it'd
search recursively through all the date subdirectories looking for
files
ending in ".stat".  However, that isn't a great idea because STAT-
Analysis
would spend a lot of time reading through data that it'll skip over
anyway.

Instead, you might consider using output directories for month and
day:
"met_output/YYYYMM/YYYYMMDD/YYYYMMDDHH".  Then processing one month's
of
data would be as simple as "-lookin met_output/YYYYMM".

Also, be aware that the "-lookin" option can take multiple arguments,
enabling you to you use wildcards.  So all the days in January 2017
would
be: -lookin 201701*

And lastly, be aware that you can use the job command options of
STAT-Analysis to filter your data down even more.  For example, lets
say
you've passed in data for all days in January, as shown above.  But
you
actually only want to process data from January 10th through January
25th.
In you job, you'd use the "-fcst_init_beg" and "-fcst_init_end"
options:
   -fcst_init_beg 20170110_00 -fcst_init_end 20170125_18

Hope that helps.

Thanks,
John

On Mon, Oct 30, 2017 at 6:51 AM, Rosalyn MacCracken - NOAA Affiliate
via RT
<met_help at ucar.edu> wrote:

>
> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=82515 >
>
> Hi John,
>
> That does sound like a nice enhancement for 6.1, and then, maybe
later
> working on an upper and lower threshold bound.
>
> As for stat_analysis, yes, I think there are some very useful things
it can
> do.  I only have two issues.  One is file/directory size after a
while of
> running these things.  Maybe I could use stat_analysis as a first
step, get
> the output I want, then, read it into a python script and subset
what I
> really need to keep.  That might be easily doable.
>
> The other issue I have is the directory structure of my 6 hourly
files.
> Currently, the *.stat and *.txt output from point_stat goes into a
> directory structure like:
> <full_path>/$YYYY$MM$DD$HH
> So, they are separated by yea, month, day and hour.  So, to
aggregate a
> single day of files (like yesterday), I will need to look in the
> directories: 2017102900 2017102906, 2017102912 and 2017102918 (4
separate
> directories).  A week's worth of data would be 28 separate
directories,
> etc.
>
> How do I use -lookin to find with 4 directories, or 28 directories,
etc?  I
> guess 4 directories might not be so bad on the command line:
> -lookin <full_path>/+year+mon+day+"00"
<full_path>/+year+mon+day+"06"
> <full_path>/+year+mon+day+"12" <full_path>/+year+mon+day+"18"
>
> but, a week or months worth of data could be a little much to put on
a
> single command line.  How would you do that?
>
> I guess if I can figure out how to get stat_analysis to work with my
> directory structure, then, I can take advantage of that tool.
>
> Roz
>
> On Fri, Oct 27, 2017 at 10:15 PM, John Halley Gotway via RT <
> met_help at ucar.edu> wrote:
>
> > Roz,
> >
> > Thanks for describing the logic you use.  I get a sense for the
general
> > flow of data through your system, but don't fully understand the
> specifics.
> >
> > It does seems to me that you may find the STAT-Analysis tool to be
> > extremely useful.  I has that ability to read MPR lines from .stat
files,
> > aggregate them in very flexible ways, and derive a variety of
output
> > statistics line type.
> >
> > As for supporting user-specified subsets of output columns for the
line
> > types, there would be a lot of details involved there.  There's a
lot of
> > logic in MET for reading/writing the various line types.  If we
talked
> some
> > more about it and brainstormed, I suspect we could find some
solutions
> that
> > wouldn't require that functionality.
> >
> > To answer your question, no, the min_total option does not
currently
> > exist.  I mentioned it as a potential enhancement for MET.  I'll
create a
> > development issue in JiRA for it.  It'd probably be wise to define
it as
> a
> > threshold instead.  Perhaps someone would like both an upper and
lower
> > bound:  n_obs_thresh = '>100&&<1000';
> >
> > There's one really nice feature coming in met-6.1 in STAT-Analysis
to
> > filter lines based on the difference of columns.  For example,
this job:
> >    stat_analysis -lookin met_out -job filter -line_type MPR
-dump_row
> > big_errors.stat -fcst_var WIND -column_thresh abs(FCST-OBS) ge5
> >
> > This job would look in a directory named "met_out" for files
ending in
> > ".stat".  It'll read all the MPR lines where FCST_VAR = WIND and
only
> keep
> > lines where the absolute value of FCST - OBS is greater than or
equal to
> > 5.  Any lines it finds are written to an output file named
> big_errors.stat.
> >
> > So this job will help you identify points where there are large
errors in
> > your model.
> >
> > Thanks,
> > John
> >
> >
> > On Fri, Oct 27, 2017 at 6:34 AM, Rosalyn MacCracken - NOAA
Affiliate via
> RT
> > <met_help at ucar.edu> wrote:
> >
> > >
> > > <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=82515 >
> > >
> > > Hi John,
> > >
> > > I'm actually not using stat_analysis.  So, after I run
point_stat, I
> have
> > > all those hourly output files, out to the 96 forecast hour.  We
don't
> > have
> > > the disk space to keep all of them after maybe 6 months (maybe
> > > shorter...can't remember right now...3 months maybe?), so, what
I do is
> > use
> > > python to strip out the variables that I want, and write them to
a csv
> > > file.  We figured out that if I make a csv file, we can make a
dynamic
> > > table for our website.
> > >
> > > So, my time series are made from reading in the hourly csv
files, and
> > > writing them to a file that I can use to create a time series
plot.
> Oh,
> > > and all of those tables I showed yesterday in that one plot
(with the
> > > contingency table, the continuous stats table, and the one with
the
> > > thresholds), are created from the hourly csv files.  I also
create a
> > daily
> > > average, and will create a weekly average.  That's again, from
reading
> in
> > > the csv files, and creating the averages.
> > >
> > > You have to be creative when you don't have a ton of disk space
has,
> like
> > > WCOSS has.  What would be nice (maybe for later releases of
MET), is
> for
> > > the user to be able what stats they want outputed, just in case
they
> > don't
> > > use everything in the ctc file, or everything in the cts file,
etc.
> > >
> > > So, that min_total option...does that exist now?  That would be
useful.
> > > But, here's the thing.  I agree with what you said about that
MET just
> > > takes the output and doesn't care how many points are there.
So, if
> > those
> > > matched points are close in forecasted/observation value, the
stats
> will
> > be
> > > good.  I think what happened in this case, was that the wind
speed
> values
> > > weren't close, ~5 knots difference, and the stats reflected
that.  I'm
> > > actually glad that that case surfaced, since that's the errors
that
> we're
> > > looking for.
> > >
> > > Roz
> > >
> > >
> > > On Thu, Oct 26, 2017 at 5:03 PM, John Halley Gotway via RT <
> > > met_help at ucar.edu> wrote:
> > >
> > > > Roz,
> > > >
> > > > If I recall correctly, you're running Point-Stat to generate
matched
> > > pairs
> > > > files and then running stat-analysis to aggregate them and
compute
> > > > contingency tables.  Those tools don't have any idea how many
data
> > points
> > > > they "should" be processing.  They just process the data you
pass too
> > > them.
> > > >
> > > > One option that might be helpful is the STAT-Analysis
> "-column_thresh"
> > > > option.   Here's some potential logic you could use:
> > > >
> > > > (1) Run a STAT-Analysis "aggregate_stat" job to read the MPR
lines,
> > apply
> > > > thresholds, and write a CTC output line using "-out_stat"
command
> line
> > > > option to write a .stat file.
> > > > (2) Run a 2nd STAT-Analysis "aggregate_stat" job to read the
CTC
> output
> > > of
> > > > job run and write a CTS statistics output line.  But use this
option
> > > > "-column_thresh TOTAL >1000".  That tells STAT-Analysis to
only
> process
> > > CTC
> > > > lines where the TOTAL column is at least 1000.
> > > >
> > > > Hopefully you'll find that logic, or some variant of it, to be
> helpful.
> > > >
> > > > Another idea would be to add a config file option for Point-
Stat and
> > > > Grid-Stat... something like "min_total = 1000;".  When
processing a
> > > > verification task, if fewer than the required minimum number
of
> matched
> > > > pairs are found, we could skip it and not write any output.
Do you
> > think
> > > > that logic would be helpful for you?
> > > >
> > > > Thanks,
> > > > John
> > > >
> > > > On Thu, Oct 26, 2017 at 2:47 PM, Rosalyn MacCracken - NOAA
Affiliate
> > via
> > > RT
> > > > <met_help at ucar.edu> wrote:
> > > >
> > > > >
> > > > > Thu Oct 26 12:47:53 2017: Request 82515 was acted upon.
> > > > > Transaction: Ticket created by rosalyn.maccracken at noaa.gov
> > > > >        Queue: met_help
> > > > >      Subject: question on dealing with calculation of bad
> statistics
> > > > >        Owner: Nobody
> > > > >   Requestors: rosalyn.maccracken at noaa.gov
> > > > >       Status: new
> > > > >  Ticket <URL: https://rt.rap.ucar.edu/rt/
> > Ticket/Display.html?id=82515
> > > >
> > > > >
> > > > >
> > > > > Hi,
> > > > >
> > > > > I'm using pb2nc and point_stat to create matched pairs (mpr
file)
> and
> > > > stats
> > > > > of Wind speed data between ASCAT data (point data from the
prepbufr
> > > file)
> > > > > and GFS data.  I'm also using wind thresholds, and
calculating
> stats
> > > (cts
> > > > > and cnt files), and then creating tables from the output of
those
> > > files.
> > > > >
> > > > > I have pb2nc and point_stat running in a python script,
within a
> > > cronjob,
> > > > > which I've been running since Sept 20th.  On Oct 8th, not
all the
> > data
> > > > was
> > > > > retrieved from the prepbufr file, and point_stat processing
took
> > place
> > > > with
> > > > > the smaller dataset( 1039 matched points, instead of
>100,000
> matched
> > > > > points).  I've attached the 3 images (ASCAT, GFS and the
difference
> > of
> > > > the
> > > > > wind speed field) of the region where the data was located.
Taking
> a
> > > look
> > > > > at these images, there are 2 small regions where there are
> > differences,
> > > > > but, not really big differences.  Anyway, when plotting a
time
> series
> > > of
> > > > > the CSI, POD and FAR stats, I noticed that there was a big
dip in
> the
> > > > time
> > > > > series.  I've attached that plot, too.
> > > > >
> > > > > I took a closer look at the CSI values for the contingency
table,
> > and I
> > > > saw
> > > > > these values:
> > > > > Total = 339
> > > > > FY_OY = 14
> > > > > FY_ON = 0
> > > > > FN_OY = 66
> > > > > FN_ON = 259
> > > > > for a calculated CSI value of (14/80) = .175
> > > > >
> > > > > My question is, if there was a full dataset, would the data
have
> been
> > > > > smoothed out more, and these differences not as noticeable?
> > > > >
> > > > > Then, if it's the smaller dataset, is there a way to stop
> processing
> > if
> > > > > there is a limited number of data?  Or, would I need to add
> something
> > > > into
> > > > > my python processing script to alert me to such problems?
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Roz
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Rosalyn MacCracken
> > > > > Support Scientist
> > > > >
> > > > > Ocean Applilcations Branch
> > > > > NOAA/NWS Ocean Prediction Center
> > > > > NCWCP
> > > > > 5830 University Research Ct
> > > > > College Park, MD  20740-3818
> > > > >
> > > > > (p) 301-683-1551 <(301)%20683-1551> <(301)%20683-1551>
> > > > > rosalyn.maccracken at noaa.gov
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Rosalyn MacCracken
> > > Support Scientist
> > >
> > > Ocean Applilcations Branch
> > > NOAA/NWS Ocean Prediction Center
> > > NCWCP
> > > 5830 University Research Ct
> > > College Park, MD  20740-3818
> > >
> > > (p) 301-683-1551
> > > rosalyn.maccracken at noaa.gov
> > >
> > >
> >
> >
>
>
> --
> Rosalyn MacCracken
> Support Scientist
>
> Ocean Applilcations Branch
> NOAA/NWS Ocean Prediction Center
> NCWCP
> 5830 University Research Ct
> College Park, MD  20740-3818
>
> (p) 301-683-1551
> rosalyn.maccracken at noaa.gov
>
>

------------------------------------------------
Subject: question on dealing with calculation of bad statistics
From: Rosalyn MacCracken - NOAA Affiliate
Time: Mon Oct 30 10:19:45 2017

Hi John,

That's all really good info!  You might want to consider adding those
examples with all that detail to the Users Guide, for people like me
that
don't get how to use stat_analysis.

So, I can see how I could use stat_analysis fairly easily if I save
that
file to a directory with month and day, like your example:
met_output/YYYYMM/YYYYMMDD/YYYYMMDDHH
and, then I wouldn't need to keep the other files after I plot them,
or
create my tables with them.  If I only outputed the *.stat file, I
would
have to go back through all my python scripts (and functions) and
change
the logic of how to create my images and tables.  That would be a
pain.
Oh, but, I copied save the *stat files to a directory like
stat_output/YYYYMM, without day and year, and that would be the
easiest to
search through, right?

Oh, so, the one thing I was worried about using the *.stat table, was
how
to search through it and the column headings with my python scripts. I
wouldn't have to worry about that if I'm only saving the *.stat files
to
use with stat_analyis, right?  And, my output from stat_anlaysis is
just
the same columns headings as the cts or cnt files, or even mpr files,
correct?

So, tell me, what are the most popular uses of stat_analysis?  Time
series
plots and to aggregate statistics to look at things like forecast lead
time, etc?  I think I've looked into using stat_analysis when I was
creating the initial processing, because I have a list of
stat_analysis
commands that I compiled.  Now, I'm wondering if I've basically
covered all
the most popular ways of using stat_analysis.

Ok...maybe I'll use this for the last bits of my processing....it
might
make the last of my scripts easier to implement.

Roz

On Mon, Oct 30, 2017 at 3:33 PM, John Halley Gotway via RT <
met_help at ucar.edu> wrote:

> Roz,
>
> I have a few points to make.
>
> First, I'd suggest using "mpr = STAT" in the output_flag section of
the
> Point-Stat configuration file.  You mentioned having both "_mpr.txt"
and
> ".stat" files in your output directory.  Point-Stat writes all the
output
> to the ".stat" file and *duplicate* output to the ".txt" output
files,
> sorted by line type.  If you're writing both .stat and .txt output
files,
> then the output is twice as large as it needs to be.  If you
incorporate
> STAT-Analysis into your processing logic, you'd only need to write
the
> .stat output files.
>
> Second, STAT-Analysis searches the "-lookin" directories
*recursively*.
> Suppose all of your output is in directories named
> "met_output/YYYYMMDDHH".  You could just use "-lookin met_output"
and it'd
> search recursively through all the date subdirectories looking for
files
> ending in ".stat".  However, that isn't a great idea because STAT-
Analysis
> would spend a lot of time reading through data that it'll skip over
anyway.
>
> Instead, you might consider using output directories for month and
day:
> "met_output/YYYYMM/YYYYMMDD/YYYYMMDDHH".  Then processing one
month's of
> data would be as simple as "-lookin met_output/YYYYMM".
>
> Also, be aware that the "-lookin" option can take multiple
arguments,
> enabling you to you use wildcards.  So all the days in January 2017
would
> be: -lookin 201701*
>
> And lastly, be aware that you can use the job command options of
> STAT-Analysis to filter your data down even more.  For example, lets
say
> you've passed in data for all days in January, as shown above.  But
you
> actually only want to process data from January 10th through January
25th.
> In you job, you'd use the "-fcst_init_beg" and "-fcst_init_end"
options:
>    -fcst_init_beg 20170110_00 -fcst_init_end 20170125_18
>
> Hope that helps.
>
> Thanks,
> John
>
> On Mon, Oct 30, 2017 at 6:51 AM, Rosalyn MacCracken - NOAA Affiliate
via RT
> <met_help at ucar.edu> wrote:
>
> >
> > <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=82515 >
> >
> > Hi John,
> >
> > That does sound like a nice enhancement for 6.1, and then, maybe
later
> > working on an upper and lower threshold bound.
> >
> > As for stat_analysis, yes, I think there are some very useful
things it
> can
> > do.  I only have two issues.  One is file/directory size after a
while of
> > running these things.  Maybe I could use stat_analysis as a first
step,
> get
> > the output I want, then, read it into a python script and subset
what I
> > really need to keep.  That might be easily doable.
> >
> > The other issue I have is the directory structure of my 6 hourly
files.
> > Currently, the *.stat and *.txt output from point_stat goes into a
> > directory structure like:
> > <full_path>/$YYYY$MM$DD$HH
> > So, they are separated by yea, month, day and hour.  So, to
aggregate a
> > single day of files (like yesterday), I will need to look in the
> > directories: 2017102900 2017102906, 2017102912 and 2017102918 (4
> separate
> > directories).  A week's worth of data would be 28 separate
directories,
> > etc.
> >
> > How do I use -lookin to find with 4 directories, or 28
directories,
> etc?  I
> > guess 4 directories might not be so bad on the command line:
> > -lookin <full_path>/+year+mon+day+"00"
<full_path>/+year+mon+day+"06"
> > <full_path>/+year+mon+day+"12" <full_path>/+year+mon+day+"18"
> >
> > but, a week or months worth of data could be a little much to put
on a
> > single command line.  How would you do that?
> >
> > I guess if I can figure out how to get stat_analysis to work with
my
> > directory structure, then, I can take advantage of that tool.
> >
> > Roz
> >
> > On Fri, Oct 27, 2017 at 10:15 PM, John Halley Gotway via RT <
> > met_help at ucar.edu> wrote:
> >
> > > Roz,
> > >
> > > Thanks for describing the logic you use.  I get a sense for the
general
> > > flow of data through your system, but don't fully understand the
> > specifics.
> > >
> > > It does seems to me that you may find the STAT-Analysis tool to
be
> > > extremely useful.  I has that ability to read MPR lines from
.stat
> files,
> > > aggregate them in very flexible ways, and derive a variety of
output
> > > statistics line type.
> > >
> > > As for supporting user-specified subsets of output columns for
the line
> > > types, there would be a lot of details involved there.  There's
a lot
> of
> > > logic in MET for reading/writing the various line types.  If we
talked
> > some
> > > more about it and brainstormed, I suspect we could find some
solutions
> > that
> > > wouldn't require that functionality.
> > >
> > > To answer your question, no, the min_total option does not
currently
> > > exist.  I mentioned it as a potential enhancement for MET.  I'll
> create a
> > > development issue in JiRA for it.  It'd probably be wise to
define it
> as
> > a
> > > threshold instead.  Perhaps someone would like both an upper and
lower
> > > bound:  n_obs_thresh = '>100&&<1000';
> > >
> > > There's one really nice feature coming in met-6.1 in STAT-
Analysis to
> > > filter lines based on the difference of columns.  For example,
this
> job:
> > >    stat_analysis -lookin met_out -job filter -line_type MPR
-dump_row
> > > big_errors.stat -fcst_var WIND -column_thresh abs(FCST-OBS) ge5
> > >
> > > This job would look in a directory named "met_out" for files
ending in
> > > ".stat".  It'll read all the MPR lines where FCST_VAR = WIND and
only
> > keep
> > > lines where the absolute value of FCST - OBS is greater than or
equal
> to
> > > 5.  Any lines it finds are written to an output file named
> > big_errors.stat.
> > >
> > > So this job will help you identify points where there are large
errors
> in
> > > your model.
> > >
> > > Thanks,
> > > John
> > >
> > >
> > > On Fri, Oct 27, 2017 at 6:34 AM, Rosalyn MacCracken - NOAA
Affiliate
> via
> > RT
> > > <met_help at ucar.edu> wrote:
> > >
> > > >
> > > > <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=82515
>
> > > >
> > > > Hi John,
> > > >
> > > > I'm actually not using stat_analysis.  So, after I run
point_stat, I
> > have
> > > > all those hourly output files, out to the 96 forecast hour.
We don't
> > > have
> > > > the disk space to keep all of them after maybe 6 months (maybe
> > > > shorter...can't remember right now...3 months maybe?), so,
what I do
> is
> > > use
> > > > python to strip out the variables that I want, and write them
to a
> csv
> > > > file.  We figured out that if I make a csv file, we can make a
> dynamic
> > > > table for our website.
> > > >
> > > > So, my time series are made from reading in the hourly csv
files, and
> > > > writing them to a file that I can use to create a time series
plot.
> > Oh,
> > > > and all of those tables I showed yesterday in that one plot
(with the
> > > > contingency table, the continuous stats table, and the one
with the
> > > > thresholds), are created from the hourly csv files.  I also
create a
> > > daily
> > > > average, and will create a weekly average.  That's again, from
> reading
> > in
> > > > the csv files, and creating the averages.
> > > >
> > > > You have to be creative when you don't have a ton of disk
space has,
> > like
> > > > WCOSS has.  What would be nice (maybe for later releases of
MET), is
> > for
> > > > the user to be able what stats they want outputed, just in
case they
> > > don't
> > > > use everything in the ctc file, or everything in the cts file,
etc.
> > > >
> > > > So, that min_total option...does that exist now?  That would
be
> useful.
> > > > But, here's the thing.  I agree with what you said about that
MET
> just
> > > > takes the output and doesn't care how many points are there.
So, if
> > > those
> > > > matched points are close in forecasted/observation value, the
stats
> > will
> > > be
> > > > good.  I think what happened in this case, was that the wind
speed
> > values
> > > > weren't close, ~5 knots difference, and the stats reflected
that.
> I'm
> > > > actually glad that that case surfaced, since that's the errors
that
> > we're
> > > > looking for.
> > > >
> > > > Roz
> > > >
> > > >
> > > > On Thu, Oct 26, 2017 at 5:03 PM, John Halley Gotway via RT <
> > > > met_help at ucar.edu> wrote:
> > > >
> > > > > Roz,
> > > > >
> > > > > If I recall correctly, you're running Point-Stat to generate
> matched
> > > > pairs
> > > > > files and then running stat-analysis to aggregate them and
compute
> > > > > contingency tables.  Those tools don't have any idea how
many data
> > > points
> > > > > they "should" be processing.  They just process the data you
pass
> too
> > > > them.
> > > > >
> > > > > One option that might be helpful is the STAT-Analysis
> > "-column_thresh"
> > > > > option.   Here's some potential logic you could use:
> > > > >
> > > > > (1) Run a STAT-Analysis "aggregate_stat" job to read the MPR
lines,
> > > apply
> > > > > thresholds, and write a CTC output line using "-out_stat"
command
> > line
> > > > > option to write a .stat file.
> > > > > (2) Run a 2nd STAT-Analysis "aggregate_stat" job to read the
CTC
> > output
> > > > of
> > > > > job run and write a CTS statistics output line.  But use
this
> option
> > > > > "-column_thresh TOTAL >1000".  That tells STAT-Analysis to
only
> > process
> > > > CTC
> > > > > lines where the TOTAL column is at least 1000.
> > > > >
> > > > > Hopefully you'll find that logic, or some variant of it, to
be
> > helpful.
> > > > >
> > > > > Another idea would be to add a config file option for Point-
Stat
> and
> > > > > Grid-Stat... something like "min_total = 1000;".  When
processing a
> > > > > verification task, if fewer than the required minimum number
of
> > matched
> > > > > pairs are found, we could skip it and not write any output.
Do you
> > > think
> > > > > that logic would be helpful for you?
> > > > >
> > > > > Thanks,
> > > > > John
> > > > >
> > > > > On Thu, Oct 26, 2017 at 2:47 PM, Rosalyn MacCracken - NOAA
> Affiliate
> > > via
> > > > RT
> > > > > <met_help at ucar.edu> wrote:
> > > > >
> > > > > >
> > > > > > Thu Oct 26 12:47:53 2017: Request 82515 was acted upon.
> > > > > > Transaction: Ticket created by rosalyn.maccracken at noaa.gov
> > > > > >        Queue: met_help
> > > > > >      Subject: question on dealing with calculation of bad
> > statistics
> > > > > >        Owner: Nobody
> > > > > >   Requestors: rosalyn.maccracken at noaa.gov
> > > > > >       Status: new
> > > > > >  Ticket <URL: https://rt.rap.ucar.edu/rt/
> > > Ticket/Display.html?id=82515
> > > > >
> > > > > >
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I'm using pb2nc and point_stat to create matched pairs
(mpr file)
> > and
> > > > > stats
> > > > > > of Wind speed data between ASCAT data (point data from the
> prepbufr
> > > > file)
> > > > > > and GFS data.  I'm also using wind thresholds, and
calculating
> > stats
> > > > (cts
> > > > > > and cnt files), and then creating tables from the output
of those
> > > > files.
> > > > > >
> > > > > > I have pb2nc and point_stat running in a python script,
within a
> > > > cronjob,
> > > > > > which I've been running since Sept 20th.  On Oct 8th, not
all the
> > > data
> > > > > was
> > > > > > retrieved from the prepbufr file, and point_stat
processing took
> > > place
> > > > > with
> > > > > > the smaller dataset( 1039 matched points, instead of
>100,000
> > matched
> > > > > > points).  I've attached the 3 images (ASCAT, GFS and the
> difference
> > > of
> > > > > the
> > > > > > wind speed field) of the region where the data was
located.
> Taking
> > a
> > > > look
> > > > > > at these images, there are 2 small regions where there are
> > > differences,
> > > > > > but, not really big differences.  Anyway, when plotting a
time
> > series
> > > > of
> > > > > > the CSI, POD and FAR stats, I noticed that there was a big
dip in
> > the
> > > > > time
> > > > > > series.  I've attached that plot, too.
> > > > > >
> > > > > > I took a closer look at the CSI values for the contingency
table,
> > > and I
> > > > > saw
> > > > > > these values:
> > > > > > Total = 339
> > > > > > FY_OY = 14
> > > > > > FY_ON = 0
> > > > > > FN_OY = 66
> > > > > > FN_ON = 259
> > > > > > for a calculated CSI value of (14/80) = .175
> > > > > >
> > > > > > My question is, if there was a full dataset, would the
data have
> > been
> > > > > > smoothed out more, and these differences not as
noticeable?
> > > > > >
> > > > > > Then, if it's the smaller dataset, is there a way to stop
> > processing
> > > if
> > > > > > there is a limited number of data?  Or, would I need to
add
> > something
> > > > > into
> > > > > > my python processing script to alert me to such problems?
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > Roz
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Rosalyn MacCracken
> > > > > > Support Scientist
> > > > > >
> > > > > > Ocean Applilcations Branch
> > > > > > NOAA/NWS Ocean Prediction Center
> > > > > > NCWCP
> > > > > > 5830 University Research Ct
> > > > > > College Park, MD  20740-3818
> > > > > >
> > > > > > (p) 301-683-1551 <(301)%20683-1551> <(301)%20683-1551>
> > > > > > rosalyn.maccracken at noaa.gov
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Rosalyn MacCracken
> > > > Support Scientist
> > > >
> > > > Ocean Applilcations Branch
> > > > NOAA/NWS Ocean Prediction Center
> > > > NCWCP
> > > > 5830 University Research Ct
> > > > College Park, MD  20740-3818
> > > >
> > > > (p) 301-683-1551
> > > > rosalyn.maccracken at noaa.gov
> > > >
> > > >
> > >
> > >
> >
> >
> > --
> > Rosalyn MacCracken
> > Support Scientist
> >
> > Ocean Applilcations Branch
> > NOAA/NWS Ocean Prediction Center
> > NCWCP
> > 5830 University Research Ct
> > College Park, MD  20740-3818
> >
> > (p) 301-683-1551
> > rosalyn.maccracken at noaa.gov
> >
> >
>
>

--
Rosalyn MacCracken
Support Scientist

Ocean Applilcations Branch
NOAA/NWS Ocean Prediction Center
NCWCP
5830 University Research Ct
College Park, MD  20740-3818

(p) 301-683-1551
rosalyn.maccracken at noaa.gov

------------------------------------------------
Subject: question on dealing with calculation of bad statistics
From: John Halley Gotway
Time: Mon Oct 30 10:55:28 2017

Roz,

I can't really answer your question about the most popular ways of
using
STAT-Analysis.  Most of the STAT-Analysis functionality (aggregating
statistics through time) is handled by METViewer.  In our application
of
MET to testing and evaluation projects in the DTC, we almost always
use
METViewer instead of STAT-Analysis.

But METViewer does not have logic for processing the individual MPR
lines,
like STAT-Analysis does.  I can think of two T&E projects where we
created
MPR lines and then used STAT-Analysis to aggregate them through time
and
compute contingency table counts (CTC lines) that we then loaded into
METViewer.

We also have used when creating that point statistics plot that Perry
Shafran was asking about last week during the MET+ tutorial.  We
computed
stats separately for each unique station ID through time with a job
like
this:
   stat_analysis -job aggregate_stat -line_type MPR -out_line_type CNT
-fcst_var TMP -fcst_lev Z2 -by OBS_SID -lookin met_dir

And then we ran an NCL script to plot the output of that STAT-Analysis
job.

I can say that STAT-Analysis is extremely flexible.  And since you're
processing MPR lines, I figured you'd find it useful.

But as the say goes, if it ain't broke, don't fix it!

John

On Mon, Oct 30, 2017 at 10:19 AM, Rosalyn MacCracken - NOAA Affiliate
via
RT <met_help at ucar.edu> wrote:

>
> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=82515 >
>
> Hi John,
>
> That's all really good info!  You might want to consider adding
those
> examples with all that detail to the Users Guide, for people like me
that
> don't get how to use stat_analysis.
>
> So, I can see how I could use stat_analysis fairly easily if I save
that
> file to a directory with month and day, like your example:
> met_output/YYYYMM/YYYYMMDD/YYYYMMDDHH
> and, then I wouldn't need to keep the other files after I plot them,
or
> create my tables with them.  If I only outputed the *.stat file, I
would
> have to go back through all my python scripts (and functions) and
change
> the logic of how to create my images and tables.  That would be a
pain.
> Oh, but, I copied save the *stat files to a directory like
> stat_output/YYYYMM, without day and year, and that would be the
easiest to
> search through, right?
>
> Oh, so, the one thing I was worried about using the *.stat table,
was how
> to search through it and the column headings with my python scripts.
I
> wouldn't have to worry about that if I'm only saving the *.stat
files to
> use with stat_analyis, right?  And, my output from stat_anlaysis is
just
> the same columns headings as the cts or cnt files, or even mpr
files,
> correct?
>
> So, tell me, what are the most popular uses of stat_analysis?  Time
series
> plots and to aggregate statistics to look at things like forecast
lead
> time, etc?  I think I've looked into using stat_analysis when I was
> creating the initial processing, because I have a list of
stat_analysis
> commands that I compiled.  Now, I'm wondering if I've basically
covered all
> the most popular ways of using stat_analysis.
>
> Ok...maybe I'll use this for the last bits of my processing....it
might
> make the last of my scripts easier to implement.
>
> Roz
>
> On Mon, Oct 30, 2017 at 3:33 PM, John Halley Gotway via RT <
> met_help at ucar.edu> wrote:
>
> > Roz,
> >
> > I have a few points to make.
> >
> > First, I'd suggest using "mpr = STAT" in the output_flag section
of the
> > Point-Stat configuration file.  You mentioned having both
"_mpr.txt" and
> > ".stat" files in your output directory.  Point-Stat writes all the
output
> > to the ".stat" file and *duplicate* output to the ".txt" output
files,
> > sorted by line type.  If you're writing both .stat and .txt output
files,
> > then the output is twice as large as it needs to be.  If you
incorporate
> > STAT-Analysis into your processing logic, you'd only need to write
the
> > .stat output files.
> >
> > Second, STAT-Analysis searches the "-lookin" directories
*recursively*.
> > Suppose all of your output is in directories named
> > "met_output/YYYYMMDDHH".  You could just use "-lookin met_output"
and
> it'd
> > search recursively through all the date subdirectories looking for
files
> > ending in ".stat".  However, that isn't a great idea because
> STAT-Analysis
> > would spend a lot of time reading through data that it'll skip
over
> anyway.
> >
> > Instead, you might consider using output directories for month and
day:
> > "met_output/YYYYMM/YYYYMMDD/YYYYMMDDHH".  Then processing one
month's of
> > data would be as simple as "-lookin met_output/YYYYMM".
> >
> > Also, be aware that the "-lookin" option can take multiple
arguments,
> > enabling you to you use wildcards.  So all the days in January
2017 would
> > be: -lookin 201701*
> >
> > And lastly, be aware that you can use the job command options of
> > STAT-Analysis to filter your data down even more.  For example,
lets say
> > you've passed in data for all days in January, as shown above.
But you
> > actually only want to process data from January 10th through
January
> 25th.
> > In you job, you'd use the "-fcst_init_beg" and "-fcst_init_end"
options:
> >    -fcst_init_beg 20170110_00 -fcst_init_end 20170125_18
> >
> > Hope that helps.
> >
> > Thanks,
> > John
> >
> > On Mon, Oct 30, 2017 at 6:51 AM, Rosalyn MacCracken - NOAA
Affiliate via
> RT
> > <met_help at ucar.edu> wrote:
> >
> > >
> > > <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=82515 >
> > >
> > > Hi John,
> > >
> > > That does sound like a nice enhancement for 6.1, and then, maybe
later
> > > working on an upper and lower threshold bound.
> > >
> > > As for stat_analysis, yes, I think there are some very useful
things it
> > can
> > > do.  I only have two issues.  One is file/directory size after a
while
> of
> > > running these things.  Maybe I could use stat_analysis as a
first step,
> > get
> > > the output I want, then, read it into a python script and subset
what I
> > > really need to keep.  That might be easily doable.
> > >
> > > The other issue I have is the directory structure of my 6 hourly
files.
> > > Currently, the *.stat and *.txt output from point_stat goes into
a
> > > directory structure like:
> > > <full_path>/$YYYY$MM$DD$HH
> > > So, they are separated by yea, month, day and hour.  So, to
aggregate a
> > > single day of files (like yesterday), I will need to look in the
> > > directories: 2017102900 2017102906, 2017102912 and 2017102918 (4
> > separate
> > > directories).  A week's worth of data would be 28 separate
directories,
> > > etc.
> > >
> > > How do I use -lookin to find with 4 directories, or 28
directories,
> > etc?  I
> > > guess 4 directories might not be so bad on the command line:
> > > -lookin <full_path>/+year+mon+day+"00"
<full_path>/+year+mon+day+"06"
> > > <full_path>/+year+mon+day+"12" <full_path>/+year+mon+day+"18"
> > >
> > > but, a week or months worth of data could be a little much to
put on a
> > > single command line.  How would you do that?
> > >
> > > I guess if I can figure out how to get stat_analysis to work
with my
> > > directory structure, then, I can take advantage of that tool.
> > >
> > > Roz
> > >
> > > On Fri, Oct 27, 2017 at 10:15 PM, John Halley Gotway via RT <
> > > met_help at ucar.edu> wrote:
> > >
> > > > Roz,
> > > >
> > > > Thanks for describing the logic you use.  I get a sense for
the
> general
> > > > flow of data through your system, but don't fully understand
the
> > > specifics.
> > > >
> > > > It does seems to me that you may find the STAT-Analysis tool
to be
> > > > extremely useful.  I has that ability to read MPR lines from
.stat
> > files,
> > > > aggregate them in very flexible ways, and derive a variety of
output
> > > > statistics line type.
> > > >
> > > > As for supporting user-specified subsets of output columns for
the
> line
> > > > types, there would be a lot of details involved there.
There's a lot
> > of
> > > > logic in MET for reading/writing the various line types.  If
we
> talked
> > > some
> > > > more about it and brainstormed, I suspect we could find some
> solutions
> > > that
> > > > wouldn't require that functionality.
> > > >
> > > > To answer your question, no, the min_total option does not
currently
> > > > exist.  I mentioned it as a potential enhancement for MET.
I'll
> > create a
> > > > development issue in JiRA for it.  It'd probably be wise to
define it
> > as
> > > a
> > > > threshold instead.  Perhaps someone would like both an upper
and
> lower
> > > > bound:  n_obs_thresh = '>100&&<1000';
> > > >
> > > > There's one really nice feature coming in met-6.1 in STAT-
Analysis to
> > > > filter lines based on the difference of columns.  For example,
this
> > job:
> > > >    stat_analysis -lookin met_out -job filter -line_type MPR
-dump_row
> > > > big_errors.stat -fcst_var WIND -column_thresh abs(FCST-OBS)
ge5
> > > >
> > > > This job would look in a directory named "met_out" for files
ending
> in
> > > > ".stat".  It'll read all the MPR lines where FCST_VAR = WIND
and only
> > > keep
> > > > lines where the absolute value of FCST - OBS is greater than
or equal
> > to
> > > > 5.  Any lines it finds are written to an output file named
> > > big_errors.stat.
> > > >
> > > > So this job will help you identify points where there are
large
> errors
> > in
> > > > your model.
> > > >
> > > > Thanks,
> > > > John
> > > >
> > > >
> > > > On Fri, Oct 27, 2017 at 6:34 AM, Rosalyn MacCracken - NOAA
Affiliate
> > via
> > > RT
> > > > <met_help at ucar.edu> wrote:
> > > >
> > > > >
> > > > > <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=82515 >
> > > > >
> > > > > Hi John,
> > > > >
> > > > > I'm actually not using stat_analysis.  So, after I run
point_stat,
> I
> > > have
> > > > > all those hourly output files, out to the 96 forecast hour.
We
> don't
> > > > have
> > > > > the disk space to keep all of them after maybe 6 months
(maybe
> > > > > shorter...can't remember right now...3 months maybe?), so,
what I
> do
> > is
> > > > use
> > > > > python to strip out the variables that I want, and write
them to a
> > csv
> > > > > file.  We figured out that if I make a csv file, we can make
a
> > dynamic
> > > > > table for our website.
> > > > >
> > > > > So, my time series are made from reading in the hourly csv
files,
> and
> > > > > writing them to a file that I can use to create a time
series plot.
> > > Oh,
> > > > > and all of those tables I showed yesterday in that one plot
(with
> the
> > > > > contingency table, the continuous stats table, and the one
with the
> > > > > thresholds), are created from the hourly csv files.  I also
create
> a
> > > > daily
> > > > > average, and will create a weekly average.  That's again,
from
> > reading
> > > in
> > > > > the csv files, and creating the averages.
> > > > >
> > > > > You have to be creative when you don't have a ton of disk
space
> has,
> > > like
> > > > > WCOSS has.  What would be nice (maybe for later releases of
MET),
> is
> > > for
> > > > > the user to be able what stats they want outputed, just in
case
> they
> > > > don't
> > > > > use everything in the ctc file, or everything in the cts
file, etc.
> > > > >
> > > > > So, that min_total option...does that exist now?  That would
be
> > useful.
> > > > > But, here's the thing.  I agree with what you said about
that MET
> > just
> > > > > takes the output and doesn't care how many points are there.
So,
> if
> > > > those
> > > > > matched points are close in forecasted/observation value,
the stats
> > > will
> > > > be
> > > > > good.  I think what happened in this case, was that the wind
speed
> > > values
> > > > > weren't close, ~5 knots difference, and the stats reflected
that.
> > I'm
> > > > > actually glad that that case surfaced, since that's the
errors that
> > > we're
> > > > > looking for.
> > > > >
> > > > > Roz
> > > > >
> > > > >
> > > > > On Thu, Oct 26, 2017 at 5:03 PM, John Halley Gotway via RT <
> > > > > met_help at ucar.edu> wrote:
> > > > >
> > > > > > Roz,
> > > > > >
> > > > > > If I recall correctly, you're running Point-Stat to
generate
> > matched
> > > > > pairs
> > > > > > files and then running stat-analysis to aggregate them and
> compute
> > > > > > contingency tables.  Those tools don't have any idea how
many
> data
> > > > points
> > > > > > they "should" be processing.  They just process the data
you pass
> > too
> > > > > them.
> > > > > >
> > > > > > One option that might be helpful is the STAT-Analysis
> > > "-column_thresh"
> > > > > > option.   Here's some potential logic you could use:
> > > > > >
> > > > > > (1) Run a STAT-Analysis "aggregate_stat" job to read the
MPR
> lines,
> > > > apply
> > > > > > thresholds, and write a CTC output line using "-out_stat"
command
> > > line
> > > > > > option to write a .stat file.
> > > > > > (2) Run a 2nd STAT-Analysis "aggregate_stat" job to read
the CTC
> > > output
> > > > > of
> > > > > > job run and write a CTS statistics output line.  But use
this
> > option
> > > > > > "-column_thresh TOTAL >1000".  That tells STAT-Analysis to
only
> > > process
> > > > > CTC
> > > > > > lines where the TOTAL column is at least 1000.
> > > > > >
> > > > > > Hopefully you'll find that logic, or some variant of it,
to be
> > > helpful.
> > > > > >
> > > > > > Another idea would be to add a config file option for
Point-Stat
> > and
> > > > > > Grid-Stat... something like "min_total = 1000;".  When
> processing a
> > > > > > verification task, if fewer than the required minimum
number of
> > > matched
> > > > > > pairs are found, we could skip it and not write any
output.  Do
> you
> > > > think
> > > > > > that logic would be helpful for you?
> > > > > >
> > > > > > Thanks,
> > > > > > John
> > > > > >
> > > > > > On Thu, Oct 26, 2017 at 2:47 PM, Rosalyn MacCracken - NOAA
> > Affiliate
> > > > via
> > > > > RT
> > > > > > <met_help at ucar.edu> wrote:
> > > > > >
> > > > > > >
> > > > > > > Thu Oct 26 12:47:53 2017: Request 82515 was acted upon.
> > > > > > > Transaction: Ticket created by
rosalyn.maccracken at noaa.gov
> > > > > > >        Queue: met_help
> > > > > > >      Subject: question on dealing with calculation of
bad
> > > statistics
> > > > > > >        Owner: Nobody
> > > > > > >   Requestors: rosalyn.maccracken at noaa.gov
> > > > > > >       Status: new
> > > > > > >  Ticket <URL: https://rt.rap.ucar.edu/rt/
> > > > Ticket/Display.html?id=82515
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I'm using pb2nc and point_stat to create matched pairs
(mpr
> file)
> > > and
> > > > > > stats
> > > > > > > of Wind speed data between ASCAT data (point data from
the
> > prepbufr
> > > > > file)
> > > > > > > and GFS data.  I'm also using wind thresholds, and
calculating
> > > stats
> > > > > (cts
> > > > > > > and cnt files), and then creating tables from the output
of
> those
> > > > > files.
> > > > > > >
> > > > > > > I have pb2nc and point_stat running in a python script,
within
> a
> > > > > cronjob,
> > > > > > > which I've been running since Sept 20th.  On Oct 8th,
not all
> the
> > > > data
> > > > > > was
> > > > > > > retrieved from the prepbufr file, and point_stat
processing
> took
> > > > place
> > > > > > with
> > > > > > > the smaller dataset( 1039 matched points, instead of
>100,000
> > > matched
> > > > > > > points).  I've attached the 3 images (ASCAT, GFS and the
> > difference
> > > > of
> > > > > > the
> > > > > > > wind speed field) of the region where the data was
located.
> > Taking
> > > a
> > > > > look
> > > > > > > at these images, there are 2 small regions where there
are
> > > > differences,
> > > > > > > but, not really big differences.  Anyway, when plotting
a time
> > > series
> > > > > of
> > > > > > > the CSI, POD and FAR stats, I noticed that there was a
big dip
> in
> > > the
> > > > > > time
> > > > > > > series.  I've attached that plot, too.
> > > > > > >
> > > > > > > I took a closer look at the CSI values for the
contingency
> table,
> > > > and I
> > > > > > saw
> > > > > > > these values:
> > > > > > > Total = 339
> > > > > > > FY_OY = 14
> > > > > > > FY_ON = 0
> > > > > > > FN_OY = 66
> > > > > > > FN_ON = 259
> > > > > > > for a calculated CSI value of (14/80) = .175
> > > > > > >
> > > > > > > My question is, if there was a full dataset, would the
data
> have
> > > been
> > > > > > > smoothed out more, and these differences not as
noticeable?
> > > > > > >
> > > > > > > Then, if it's the smaller dataset, is there a way to
stop
> > > processing
> > > > if
> > > > > > > there is a limited number of data?  Or, would I need to
add
> > > something
> > > > > > into
> > > > > > > my python processing script to alert me to such
problems?
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > Roz
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Rosalyn MacCracken
> > > > > > > Support Scientist
> > > > > > >
> > > > > > > Ocean Applilcations Branch
> > > > > > > NOAA/NWS Ocean Prediction Center
> > > > > > > NCWCP
> > > > > > > 5830 University Research Ct
> > > > > > > College Park, MD  20740-3818
> > > > > > >
> > > > > > > (p) 301-683-1551 <(301)%20683-1551> <(301)%20683-1551>
> > > > > > > rosalyn.maccracken at noaa.gov
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Rosalyn MacCracken
> > > > > Support Scientist
> > > > >
> > > > > Ocean Applilcations Branch
> > > > > NOAA/NWS Ocean Prediction Center
> > > > > NCWCP
> > > > > 5830 University Research Ct
> > > > > College Park, MD  20740-3818
> > > > >
> > > > > (p) 301-683-1551
> > > > > rosalyn.maccracken at noaa.gov
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Rosalyn MacCracken
> > > Support Scientist
> > >
> > > Ocean Applilcations Branch
> > > NOAA/NWS Ocean Prediction Center
> > > NCWCP
> > > 5830 University Research Ct
> > > College Park, MD  20740-3818
> > >
> > > (p) 301-683-1551
> > > rosalyn.maccracken at noaa.gov
> > >
> > >
> >
> >
>
>
> --
> Rosalyn MacCracken
> Support Scientist
>
> Ocean Applilcations Branch
> NOAA/NWS Ocean Prediction Center
> NCWCP
> 5830 University Research Ct
> College Park, MD  20740-3818
>
> (p) 301-683-1551
> rosalyn.maccracken at noaa.gov
>
>

------------------------------------------------
Subject: question on dealing with calculation of bad statistics
From: Rosalyn MacCracken - NOAA Affiliate
Time: Mon Oct 30 11:11:23 2017

Yeah, that makes sense about not fixing what isn't broken, but, I
might
have more options available to me if I did use it...And, I already
have the
command to aggregate through time and compute CTC counts....

It's so hard to know which is the way to go...well, I could just copy
a
subset of my *.stat data to a driectory and play around with it.
Maybe,
that would make the decision of which way to go a little easier...

Roz

On Mon, Oct 30, 2017 at 4:55 PM, John Halley Gotway via RT <
met_help at ucar.edu> wrote:

> Roz,
>
> I can't really answer your question about the most popular ways of
using
> STAT-Analysis.  Most of the STAT-Analysis functionality (aggregating
> statistics through time) is handled by METViewer.  In our
application of
> MET to testing and evaluation projects in the DTC, we almost always
use
> METViewer instead of STAT-Analysis.
>
> But METViewer does not have logic for processing the individual MPR
lines,
> like STAT-Analysis does.  I can think of two T&E projects where we
created
> MPR lines and then used STAT-Analysis to aggregate them through time
and
> compute contingency table counts (CTC lines) that we then loaded
into
> METViewer.
>
> We also have used when creating that point statistics plot that
Perry
> Shafran was asking about last week during the MET+ tutorial.  We
computed
> stats separately for each unique station ID through time with a job
like
> this:
>    stat_analysis -job aggregate_stat -line_type MPR -out_line_type
CNT
> -fcst_var TMP -fcst_lev Z2 -by OBS_SID -lookin met_dir
>
> And then we ran an NCL script to plot the output of that STAT-
Analysis job.
>
> I can say that STAT-Analysis is extremely flexible.  And since
you're
> processing MPR lines, I figured you'd find it useful.
>
> But as the say goes, if it ain't broke, don't fix it!
>
> John
>
> On Mon, Oct 30, 2017 at 10:19 AM, Rosalyn MacCracken - NOAA
Affiliate via
> RT <met_help at ucar.edu> wrote:
>
> >
> > <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=82515 >
> >
> > Hi John,
> >
> > That's all really good info!  You might want to consider adding
those
> > examples with all that detail to the Users Guide, for people like
me that
> > don't get how to use stat_analysis.
> >
> > So, I can see how I could use stat_analysis fairly easily if I
save that
> > file to a directory with month and day, like your example:
> > met_output/YYYYMM/YYYYMMDD/YYYYMMDDHH
> > and, then I wouldn't need to keep the other files after I plot
them, or
> > create my tables with them.  If I only outputed the *.stat file, I
would
> > have to go back through all my python scripts (and functions) and
change
> > the logic of how to create my images and tables.  That would be a
pain.
> > Oh, but, I copied save the *stat files to a directory like
> > stat_output/YYYYMM, without day and year, and that would be the
easiest
> to
> > search through, right?
> >
> > Oh, so, the one thing I was worried about using the *.stat table,
was how
> > to search through it and the column headings with my python
scripts. I
> > wouldn't have to worry about that if I'm only saving the *.stat
files to
> > use with stat_analyis, right?  And, my output from stat_anlaysis
is just
> > the same columns headings as the cts or cnt files, or even mpr
files,
> > correct?
> >
> > So, tell me, what are the most popular uses of stat_analysis?
Time
> series
> > plots and to aggregate statistics to look at things like forecast
lead
> > time, etc?  I think I've looked into using stat_analysis when I
was
> > creating the initial processing, because I have a list of
stat_analysis
> > commands that I compiled.  Now, I'm wondering if I've basically
covered
> all
> > the most popular ways of using stat_analysis.
> >
> > Ok...maybe I'll use this for the last bits of my processing....it
might
> > make the last of my scripts easier to implement.
> >
> > Roz
> >
> > On Mon, Oct 30, 2017 at 3:33 PM, John Halley Gotway via RT <
> > met_help at ucar.edu> wrote:
> >
> > > Roz,
> > >
> > > I have a few points to make.
> > >
> > > First, I'd suggest using "mpr = STAT" in the output_flag section
of the
> > > Point-Stat configuration file.  You mentioned having both
"_mpr.txt"
> and
> > > ".stat" files in your output directory.  Point-Stat writes all
the
> output
> > > to the ".stat" file and *duplicate* output to the ".txt" output
files,
> > > sorted by line type.  If you're writing both .stat and .txt
output
> files,
> > > then the output is twice as large as it needs to be.  If you
> incorporate
> > > STAT-Analysis into your processing logic, you'd only need to
write the
> > > .stat output files.
> > >
> > > Second, STAT-Analysis searches the "-lookin" directories
*recursively*.
> > > Suppose all of your output is in directories named
> > > "met_output/YYYYMMDDHH".  You could just use "-lookin
met_output" and
> > it'd
> > > search recursively through all the date subdirectories looking
for
> files
> > > ending in ".stat".  However, that isn't a great idea because
> > STAT-Analysis
> > > would spend a lot of time reading through data that it'll skip
over
> > anyway.
> > >
> > > Instead, you might consider using output directories for month
and day:
> > > "met_output/YYYYMM/YYYYMMDD/YYYYMMDDHH".  Then processing one
month's
> of
> > > data would be as simple as "-lookin met_output/YYYYMM".
> > >
> > > Also, be aware that the "-lookin" option can take multiple
arguments,
> > > enabling you to you use wildcards.  So all the days in January
2017
> would
> > > be: -lookin 201701*
> > >
> > > And lastly, be aware that you can use the job command options of
> > > STAT-Analysis to filter your data down even more.  For example,
lets
> say
> > > you've passed in data for all days in January, as shown above.
But you
> > > actually only want to process data from January 10th through
January
> > 25th.
> > > In you job, you'd use the "-fcst_init_beg" and "-fcst_init_end"
> options:
> > >    -fcst_init_beg 20170110_00 -fcst_init_end 20170125_18
> > >
> > > Hope that helps.
> > >
> > > Thanks,
> > > John
> > >
> > > On Mon, Oct 30, 2017 at 6:51 AM, Rosalyn MacCracken - NOAA
Affiliate
> via
> > RT
> > > <met_help at ucar.edu> wrote:
> > >
> > > >
> > > > <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=82515
>
> > > >
> > > > Hi John,
> > > >
> > > > That does sound like a nice enhancement for 6.1, and then,
maybe
> later
> > > > working on an upper and lower threshold bound.
> > > >
> > > > As for stat_analysis, yes, I think there are some very useful
things
> it
> > > can
> > > > do.  I only have two issues.  One is file/directory size after
a
> while
> > of
> > > > running these things.  Maybe I could use stat_analysis as a
first
> step,
> > > get
> > > > the output I want, then, read it into a python script and
subset
> what I
> > > > really need to keep.  That might be easily doable.
> > > >
> > > > The other issue I have is the directory structure of my 6
hourly
> files.
> > > > Currently, the *.stat and *.txt output from point_stat goes
into a
> > > > directory structure like:
> > > > <full_path>/$YYYY$MM$DD$HH
> > > > So, they are separated by yea, month, day and hour.  So, to
> aggregate a
> > > > single day of files (like yesterday), I will need to look in
the
> > > > directories: 2017102900 2017102906, 2017102912 and 2017102918
(4
> > > separate
> > > > directories).  A week's worth of data would be 28 separate
> directories,
> > > > etc.
> > > >
> > > > How do I use -lookin to find with 4 directories, or 28
directories,
> > > etc?  I
> > > > guess 4 directories might not be so bad on the command line:
> > > > -lookin <full_path>/+year+mon+day+"00"
> <full_path>/+year+mon+day+"06"
> > > > <full_path>/+year+mon+day+"12" <full_path>/+year+mon+day+"18"
> > > >
> > > > but, a week or months worth of data could be a little much to
put on
> a
> > > > single command line.  How would you do that?
> > > >
> > > > I guess if I can figure out how to get stat_analysis to work
with my
> > > > directory structure, then, I can take advantage of that tool.
> > > >
> > > > Roz
> > > >
> > > > On Fri, Oct 27, 2017 at 10:15 PM, John Halley Gotway via RT <
> > > > met_help at ucar.edu> wrote:
> > > >
> > > > > Roz,
> > > > >
> > > > > Thanks for describing the logic you use.  I get a sense for
the
> > general
> > > > > flow of data through your system, but don't fully understand
the
> > > > specifics.
> > > > >
> > > > > It does seems to me that you may find the STAT-Analysis tool
to be
> > > > > extremely useful.  I has that ability to read MPR lines from
.stat
> > > files,
> > > > > aggregate them in very flexible ways, and derive a variety
of
> output
> > > > > statistics line type.
> > > > >
> > > > > As for supporting user-specified subsets of output columns
for the
> > line
> > > > > types, there would be a lot of details involved there.
There's a
> lot
> > > of
> > > > > logic in MET for reading/writing the various line types.  If
we
> > talked
> > > > some
> > > > > more about it and brainstormed, I suspect we could find some
> > solutions
> > > > that
> > > > > wouldn't require that functionality.
> > > > >
> > > > > To answer your question, no, the min_total option does not
> currently
> > > > > exist.  I mentioned it as a potential enhancement for MET.
I'll
> > > create a
> > > > > development issue in JiRA for it.  It'd probably be wise to
define
> it
> > > as
> > > > a
> > > > > threshold instead.  Perhaps someone would like both an upper
and
> > lower
> > > > > bound:  n_obs_thresh = '>100&&<1000';
> > > > >
> > > > > There's one really nice feature coming in met-6.1 in STAT-
Analysis
> to
> > > > > filter lines based on the difference of columns.  For
example, this
> > > job:
> > > > >    stat_analysis -lookin met_out -job filter -line_type MPR
> -dump_row
> > > > > big_errors.stat -fcst_var WIND -column_thresh abs(FCST-OBS)
ge5
> > > > >
> > > > > This job would look in a directory named "met_out" for files
ending
> > in
> > > > > ".stat".  It'll read all the MPR lines where FCST_VAR = WIND
and
> only
> > > > keep
> > > > > lines where the absolute value of FCST - OBS is greater than
or
> equal
> > > to
> > > > > 5.  Any lines it finds are written to an output file named
> > > > big_errors.stat.
> > > > >
> > > > > So this job will help you identify points where there are
large
> > errors
> > > in
> > > > > your model.
> > > > >
> > > > > Thanks,
> > > > > John
> > > > >
> > > > >
> > > > > On Fri, Oct 27, 2017 at 6:34 AM, Rosalyn MacCracken - NOAA
> Affiliate
> > > via
> > > > RT
> > > > > <met_help at ucar.edu> wrote:
> > > > >
> > > > > >
> > > > > > <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=82515 >
> > > > > >
> > > > > > Hi John,
> > > > > >
> > > > > > I'm actually not using stat_analysis.  So, after I run
> point_stat,
> > I
> > > > have
> > > > > > all those hourly output files, out to the 96 forecast
hour.  We
> > don't
> > > > > have
> > > > > > the disk space to keep all of them after maybe 6 months
(maybe
> > > > > > shorter...can't remember right now...3 months maybe?), so,
what I
> > do
> > > is
> > > > > use
> > > > > > python to strip out the variables that I want, and write
them to
> a
> > > csv
> > > > > > file.  We figured out that if I make a csv file, we can
make a
> > > dynamic
> > > > > > table for our website.
> > > > > >
> > > > > > So, my time series are made from reading in the hourly csv
files,
> > and
> > > > > > writing them to a file that I can use to create a time
series
> plot.
> > > > Oh,
> > > > > > and all of those tables I showed yesterday in that one
plot (with
> > the
> > > > > > contingency table, the continuous stats table, and the one
with
> the
> > > > > > thresholds), are created from the hourly csv files.  I
also
> create
> > a
> > > > > daily
> > > > > > average, and will create a weekly average.  That's again,
from
> > > reading
> > > > in
> > > > > > the csv files, and creating the averages.
> > > > > >
> > > > > > You have to be creative when you don't have a ton of disk
space
> > has,
> > > > like
> > > > > > WCOSS has.  What would be nice (maybe for later releases
of MET),
> > is
> > > > for
> > > > > > the user to be able what stats they want outputed, just in
case
> > they
> > > > > don't
> > > > > > use everything in the ctc file, or everything in the cts
file,
> etc.
> > > > > >
> > > > > > So, that min_total option...does that exist now?  That
would be
> > > useful.
> > > > > > But, here's the thing.  I agree with what you said about
that MET
> > > just
> > > > > > takes the output and doesn't care how many points are
there.  So,
> > if
> > > > > those
> > > > > > matched points are close in forecasted/observation value,
the
> stats
> > > > will
> > > > > be
> > > > > > good.  I think what happened in this case, was that the
wind
> speed
> > > > values
> > > > > > weren't close, ~5 knots difference, and the stats
reflected that.
> > > I'm
> > > > > > actually glad that that case surfaced, since that's the
errors
> that
> > > > we're
> > > > > > looking for.
> > > > > >
> > > > > > Roz
> > > > > >
> > > > > >
> > > > > > On Thu, Oct 26, 2017 at 5:03 PM, John Halley Gotway via RT
<
> > > > > > met_help at ucar.edu> wrote:
> > > > > >
> > > > > > > Roz,
> > > > > > >
> > > > > > > If I recall correctly, you're running Point-Stat to
generate
> > > matched
> > > > > > pairs
> > > > > > > files and then running stat-analysis to aggregate them
and
> > compute
> > > > > > > contingency tables.  Those tools don't have any idea how
many
> > data
> > > > > points
> > > > > > > they "should" be processing.  They just process the data
you
> pass
> > > too
> > > > > > them.
> > > > > > >
> > > > > > > One option that might be helpful is the STAT-Analysis
> > > > "-column_thresh"
> > > > > > > option.   Here's some potential logic you could use:
> > > > > > >
> > > > > > > (1) Run a STAT-Analysis "aggregate_stat" job to read the
MPR
> > lines,
> > > > > apply
> > > > > > > thresholds, and write a CTC output line using "-
out_stat"
> command
> > > > line
> > > > > > > option to write a .stat file.
> > > > > > > (2) Run a 2nd STAT-Analysis "aggregate_stat" job to read
the
> CTC
> > > > output
> > > > > > of
> > > > > > > job run and write a CTS statistics output line.  But use
this
> > > option
> > > > > > > "-column_thresh TOTAL >1000".  That tells STAT-Analysis
to only
> > > > process
> > > > > > CTC
> > > > > > > lines where the TOTAL column is at least 1000.
> > > > > > >
> > > > > > > Hopefully you'll find that logic, or some variant of it,
to be
> > > > helpful.
> > > > > > >
> > > > > > > Another idea would be to add a config file option for
> Point-Stat
> > > and
> > > > > > > Grid-Stat... something like "min_total = 1000;".  When
> > processing a
> > > > > > > verification task, if fewer than the required minimum
number of
> > > > matched
> > > > > > > pairs are found, we could skip it and not write any
output.  Do
> > you
> > > > > think
> > > > > > > that logic would be helpful for you?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > John
> > > > > > >
> > > > > > > On Thu, Oct 26, 2017 at 2:47 PM, Rosalyn MacCracken -
NOAA
> > > Affiliate
> > > > > via
> > > > > > RT
> > > > > > > <met_help at ucar.edu> wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > Thu Oct 26 12:47:53 2017: Request 82515 was acted
upon.
> > > > > > > > Transaction: Ticket created by
rosalyn.maccracken at noaa.gov
> > > > > > > >        Queue: met_help
> > > > > > > >      Subject: question on dealing with calculation of
bad
> > > > statistics
> > > > > > > >        Owner: Nobody
> > > > > > > >   Requestors: rosalyn.maccracken at noaa.gov
> > > > > > > >       Status: new
> > > > > > > >  Ticket <URL: https://rt.rap.ucar.edu/rt/
> > > > > Ticket/Display.html?id=82515
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I'm using pb2nc and point_stat to create matched pairs
(mpr
> > file)
> > > > and
> > > > > > > stats
> > > > > > > > of Wind speed data between ASCAT data (point data from
the
> > > prepbufr
> > > > > > file)
> > > > > > > > and GFS data.  I'm also using wind thresholds, and
> calculating
> > > > stats
> > > > > > (cts
> > > > > > > > and cnt files), and then creating tables from the
output of
> > those
> > > > > > files.
> > > > > > > >
> > > > > > > > I have pb2nc and point_stat running in a python
script,
> within
> > a
> > > > > > cronjob,
> > > > > > > > which I've been running since Sept 20th.  On Oct 8th,
not all
> > the
> > > > > data
> > > > > > > was
> > > > > > > > retrieved from the prepbufr file, and point_stat
processing
> > took
> > > > > place
> > > > > > > with
> > > > > > > > the smaller dataset( 1039 matched points, instead of
>100,000
> > > > matched
> > > > > > > > points).  I've attached the 3 images (ASCAT, GFS and
the
> > > difference
> > > > > of
> > > > > > > the
> > > > > > > > wind speed field) of the region where the data was
located.
> > > Taking
> > > > a
> > > > > > look
> > > > > > > > at these images, there are 2 small regions where there
are
> > > > > differences,
> > > > > > > > but, not really big differences.  Anyway, when
plotting a
> time
> > > > series
> > > > > > of
> > > > > > > > the CSI, POD and FAR stats, I noticed that there was a
big
> dip
> > in
> > > > the
> > > > > > > time
> > > > > > > > series.  I've attached that plot, too.
> > > > > > > >
> > > > > > > > I took a closer look at the CSI values for the
contingency
> > table,
> > > > > and I
> > > > > > > saw
> > > > > > > > these values:
> > > > > > > > Total = 339
> > > > > > > > FY_OY = 14
> > > > > > > > FY_ON = 0
> > > > > > > > FN_OY = 66
> > > > > > > > FN_ON = 259
> > > > > > > > for a calculated CSI value of (14/80) = .175
> > > > > > > >
> > > > > > > > My question is, if there was a full dataset, would the
data
> > have
> > > > been
> > > > > > > > smoothed out more, and these differences not as
noticeable?
> > > > > > > >
> > > > > > > > Then, if it's the smaller dataset, is there a way to
stop
> > > > processing
> > > > > if
> > > > > > > > there is a limited number of data?  Or, would I need
to add
> > > > something
> > > > > > > into
> > > > > > > > my python processing script to alert me to such
problems?
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > > Roz
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Rosalyn MacCracken
> > > > > > > > Support Scientist
> > > > > > > >
> > > > > > > > Ocean Applilcations Branch
> > > > > > > > NOAA/NWS Ocean Prediction Center
> > > > > > > > NCWCP
> > > > > > > > 5830 University Research Ct
> > > > > > > > College Park, MD  20740-3818
> > > > > > > >
> > > > > > > > (p) 301-683-1551 <(301)%20683-1551> <(301)%20683-1551>
> > > > > > > > rosalyn.maccracken at noaa.gov
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Rosalyn MacCracken
> > > > > > Support Scientist
> > > > > >
> > > > > > Ocean Applilcations Branch
> > > > > > NOAA/NWS Ocean Prediction Center
> > > > > > NCWCP
> > > > > > 5830 University Research Ct
> > > > > > College Park, MD  20740-3818
> > > > > >
> > > > > > (p) 301-683-1551
> > > > > > rosalyn.maccracken at noaa.gov
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Rosalyn MacCracken
> > > > Support Scientist
> > > >
> > > > Ocean Applilcations Branch
> > > > NOAA/NWS Ocean Prediction Center
> > > > NCWCP
> > > > 5830 University Research Ct
> > > > College Park, MD  20740-3818
> > > >
> > > > (p) 301-683-1551
> > > > rosalyn.maccracken at noaa.gov
> > > >
> > > >
> > >
> > >
> >
> >
> > --
> > Rosalyn MacCracken
> > Support Scientist
> >
> > Ocean Applilcations Branch
> > NOAA/NWS Ocean Prediction Center
> > NCWCP
> > 5830 University Research Ct
> > College Park, MD  20740-3818
> >
> > (p) 301-683-1551
> > rosalyn.maccracken at noaa.gov
> >
> >
>
>

--
Rosalyn MacCracken
Support Scientist

Ocean Applilcations Branch
NOAA/NWS Ocean Prediction Center
NCWCP
5830 University Research Ct
College Park, MD  20740-3818

(p) 301-683-1551
rosalyn.maccracken at noaa.gov

------------------------------------------------
Subject: question on dealing with calculation of bad statistics
From: Rosalyn MacCracken - NOAA Affiliate
Time: Mon Oct 30 11:36:53 2017

And, then, after I decided that it would be a really useful tool, I
started
checking in to how much data would be created in a month (~470 GB),
and
that could potentially be a Tb within 6 months.  So...that's not a
feasible
solution.  Ok...back to the original way that I was creating tables
and
such.  Just cut out what I need from the cycle file, and use python to
manipulate that.  The "smaller, sleaker, more compact" version...not
the
big monster truck" version.

Well, it was a really good idea...but...

Roz

On Mon, Oct 30, 2017 at 5:11 PM, Rosalyn MacCracken - NOAA Affiliate <
rosalyn.maccracken at noaa.gov> wrote:

> Yeah, that makes sense about not fixing what isn't broken, but, I
might
> have more options available to me if I did use it...And, I already
have the
> command to aggregate through time and compute CTC counts....
>
> It's so hard to know which is the way to go...well, I could just
copy a
> subset of my *.stat data to a driectory and play around with it.
Maybe,
> that would make the decision of which way to go a little easier...
>
> Roz
>
> On Mon, Oct 30, 2017 at 4:55 PM, John Halley Gotway via RT <
> met_help at ucar.edu> wrote:
>
>> Roz,
>>
>> I can't really answer your question about the most popular ways of
using
>> STAT-Analysis.  Most of the STAT-Analysis functionality
(aggregating
>> statistics through time) is handled by METViewer.  In our
application of
>> MET to testing and evaluation projects in the DTC, we almost always
use
>> METViewer instead of STAT-Analysis.
>>
>> But METViewer does not have logic for processing the individual MPR
lines,
>> like STAT-Analysis does.  I can think of two T&E projects where we
created
>> MPR lines and then used STAT-Analysis to aggregate them through
time and
>> compute contingency table counts (CTC lines) that we then loaded
into
>> METViewer.
>>
>> We also have used when creating that point statistics plot that
Perry
>> Shafran was asking about last week during the MET+ tutorial.  We
computed
>> stats separately for each unique station ID through time with a job
like
>> this:
>>    stat_analysis -job aggregate_stat -line_type MPR -out_line_type
CNT
>> -fcst_var TMP -fcst_lev Z2 -by OBS_SID -lookin met_dir
>>
>> And then we ran an NCL script to plot the output of that STAT-
Analysis
>> job.
>>
>> I can say that STAT-Analysis is extremely flexible.  And since
you're
>> processing MPR lines, I figured you'd find it useful.
>>
>> But as the say goes, if it ain't broke, don't fix it!
>>
>> John
>>
>> On Mon, Oct 30, 2017 at 10:19 AM, Rosalyn MacCracken - NOAA
Affiliate via
>> RT <met_help at ucar.edu> wrote:
>>
>> >
>> > <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=82515 >
>> >
>> > Hi John,
>> >
>> > That's all really good info!  You might want to consider adding
those
>> > examples with all that detail to the Users Guide, for people like
me
>> that
>> > don't get how to use stat_analysis.
>> >
>> > So, I can see how I could use stat_analysis fairly easily if I
save that
>> > file to a directory with month and day, like your example:
>> > met_output/YYYYMM/YYYYMMDD/YYYYMMDDHH
>> > and, then I wouldn't need to keep the other files after I plot
them, or
>> > create my tables with them.  If I only outputed the *.stat file,
I would
>> > have to go back through all my python scripts (and functions) and
change
>> > the logic of how to create my images and tables.  That would be a
pain.
>> > Oh, but, I copied save the *stat files to a directory like
>> > stat_output/YYYYMM, without day and year, and that would be the
easiest
>> to
>> > search through, right?
>> >
>> > Oh, so, the one thing I was worried about using the *.stat table,
was
>> how
>> > to search through it and the column headings with my python
scripts. I
>> > wouldn't have to worry about that if I'm only saving the *.stat
files to
>> > use with stat_analyis, right?  And, my output from stat_anlaysis
is just
>> > the same columns headings as the cts or cnt files, or even mpr
files,
>> > correct?
>> >
>> > So, tell me, what are the most popular uses of stat_analysis?
Time
>> series
>> > plots and to aggregate statistics to look at things like forecast
lead
>> > time, etc?  I think I've looked into using stat_analysis when I
was
>> > creating the initial processing, because I have a list of
stat_analysis
>> > commands that I compiled.  Now, I'm wondering if I've basically
covered
>> all
>> > the most popular ways of using stat_analysis.
>> >
>> > Ok...maybe I'll use this for the last bits of my processing....it
might
>> > make the last of my scripts easier to implement.
>> >
>> > Roz
>> >
>> > On Mon, Oct 30, 2017 at 3:33 PM, John Halley Gotway via RT <
>> > met_help at ucar.edu> wrote:
>> >
>> > > Roz,
>> > >
>> > > I have a few points to make.
>> > >
>> > > First, I'd suggest using "mpr = STAT" in the output_flag
section of
>> the
>> > > Point-Stat configuration file.  You mentioned having both
"_mpr.txt"
>> and
>> > > ".stat" files in your output directory.  Point-Stat writes all
the
>> output
>> > > to the ".stat" file and *duplicate* output to the ".txt" output
files,
>> > > sorted by line type.  If you're writing both .stat and .txt
output
>> files,
>> > > then the output is twice as large as it needs to be.  If you
>> incorporate
>> > > STAT-Analysis into your processing logic, you'd only need to
write the
>> > > .stat output files.
>> > >
>> > > Second, STAT-Analysis searches the "-lookin" directories
>> *recursively*.
>> > > Suppose all of your output is in directories named
>> > > "met_output/YYYYMMDDHH".  You could just use "-lookin
met_output" and
>> > it'd
>> > > search recursively through all the date subdirectories looking
for
>> files
>> > > ending in ".stat".  However, that isn't a great idea because
>> > STAT-Analysis
>> > > would spend a lot of time reading through data that it'll skip
over
>> > anyway.
>> > >
>> > > Instead, you might consider using output directories for month
and
>> day:
>> > > "met_output/YYYYMM/YYYYMMDD/YYYYMMDDHH".  Then processing one
>> month's of
>> > > data would be as simple as "-lookin met_output/YYYYMM".
>> > >
>> > > Also, be aware that the "-lookin" option can take multiple
arguments,
>> > > enabling you to you use wildcards.  So all the days in January
2017
>> would
>> > > be: -lookin 201701*
>> > >
>> > > And lastly, be aware that you can use the job command options
of
>> > > STAT-Analysis to filter your data down even more.  For example,
lets
>> say
>> > > you've passed in data for all days in January, as shown above.
But
>> you
>> > > actually only want to process data from January 10th through
January
>> > 25th.
>> > > In you job, you'd use the "-fcst_init_beg" and "-fcst_init_end"
>> options:
>> > >    -fcst_init_beg 20170110_00 -fcst_init_end 20170125_18
>> > >
>> > > Hope that helps.
>> > >
>> > > Thanks,
>> > > John
>> > >
>> > > On Mon, Oct 30, 2017 at 6:51 AM, Rosalyn MacCracken - NOAA
Affiliate
>> via
>> > RT
>> > > <met_help at ucar.edu> wrote:
>> > >
>> > > >
>> > > > <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=82515
>
>> > > >
>> > > > Hi John,
>> > > >
>> > > > That does sound like a nice enhancement for 6.1, and then,
maybe
>> later
>> > > > working on an upper and lower threshold bound.
>> > > >
>> > > > As for stat_analysis, yes, I think there are some very useful
>> things it
>> > > can
>> > > > do.  I only have two issues.  One is file/directory size
after a
>> while
>> > of
>> > > > running these things.  Maybe I could use stat_analysis as a
first
>> step,
>> > > get
>> > > > the output I want, then, read it into a python script and
subset
>> what I
>> > > > really need to keep.  That might be easily doable.
>> > > >
>> > > > The other issue I have is the directory structure of my 6
hourly
>> files.
>> > > > Currently, the *.stat and *.txt output from point_stat goes
into a
>> > > > directory structure like:
>> > > > <full_path>/$YYYY$MM$DD$HH
>> > > > So, they are separated by yea, month, day and hour.  So, to
>> aggregate a
>> > > > single day of files (like yesterday), I will need to look in
the
>> > > > directories: 2017102900 2017102906, 2017102912 and 2017102918
(4
>> > > separate
>> > > > directories).  A week's worth of data would be 28 separate
>> directories,
>> > > > etc.
>> > > >
>> > > > How do I use -lookin to find with 4 directories, or 28
directories,
>> > > etc?  I
>> > > > guess 4 directories might not be so bad on the command line:
>> > > > -lookin <full_path>/+year+mon+day+"00"
>> <full_path>/+year+mon+day+"06"
>> > > > <full_path>/+year+mon+day+"12" <full_path>/+year+mon+day+"18"
>> > > >
>> > > > but, a week or months worth of data could be a little much to
put
>> on a
>> > > > single command line.  How would you do that?
>> > > >
>> > > > I guess if I can figure out how to get stat_analysis to work
with my
>> > > > directory structure, then, I can take advantage of that tool.
>> > > >
>> > > > Roz
>> > > >
>> > > > On Fri, Oct 27, 2017 at 10:15 PM, John Halley Gotway via RT <
>> > > > met_help at ucar.edu> wrote:
>> > > >
>> > > > > Roz,
>> > > > >
>> > > > > Thanks for describing the logic you use.  I get a sense for
the
>> > general
>> > > > > flow of data through your system, but don't fully
understand the
>> > > > specifics.
>> > > > >
>> > > > > It does seems to me that you may find the STAT-Analysis
tool to be
>> > > > > extremely useful.  I has that ability to read MPR lines
from .stat
>> > > files,
>> > > > > aggregate them in very flexible ways, and derive a variety
of
>> output
>> > > > > statistics line type.
>> > > > >
>> > > > > As for supporting user-specified subsets of output columns
for the
>> > line
>> > > > > types, there would be a lot of details involved there.
There's a
>> lot
>> > > of
>> > > > > logic in MET for reading/writing the various line types.
If we
>> > talked
>> > > > some
>> > > > > more about it and brainstormed, I suspect we could find
some
>> > solutions
>> > > > that
>> > > > > wouldn't require that functionality.
>> > > > >
>> > > > > To answer your question, no, the min_total option does not
>> currently
>> > > > > exist.  I mentioned it as a potential enhancement for MET.
I'll
>> > > create a
>> > > > > development issue in JiRA for it.  It'd probably be wise to
>> define it
>> > > as
>> > > > a
>> > > > > threshold instead.  Perhaps someone would like both an
upper and
>> > lower
>> > > > > bound:  n_obs_thresh = '>100&&<1000';
>> > > > >
>> > > > > There's one really nice feature coming in met-6.1 in
>> STAT-Analysis to
>> > > > > filter lines based on the difference of columns.  For
example,
>> this
>> > > job:
>> > > > >    stat_analysis -lookin met_out -job filter -line_type MPR
>> -dump_row
>> > > > > big_errors.stat -fcst_var WIND -column_thresh abs(FCST-OBS)
ge5
>> > > > >
>> > > > > This job would look in a directory named "met_out" for
files
>> ending
>> > in
>> > > > > ".stat".  It'll read all the MPR lines where FCST_VAR =
WIND and
>> only
>> > > > keep
>> > > > > lines where the absolute value of FCST - OBS is greater
than or
>> equal
>> > > to
>> > > > > 5.  Any lines it finds are written to an output file named
>> > > > big_errors.stat.
>> > > > >
>> > > > > So this job will help you identify points where there are
large
>> > errors
>> > > in
>> > > > > your model.
>> > > > >
>> > > > > Thanks,
>> > > > > John
>> > > > >
>> > > > >
>> > > > > On Fri, Oct 27, 2017 at 6:34 AM, Rosalyn MacCracken - NOAA
>> Affiliate
>> > > via
>> > > > RT
>> > > > > <met_help at ucar.edu> wrote:
>> > > > >
>> > > > > >
>> > > > > > <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=82515 >
>> > > > > >
>> > > > > > Hi John,
>> > > > > >
>> > > > > > I'm actually not using stat_analysis.  So, after I run
>> point_stat,
>> > I
>> > > > have
>> > > > > > all those hourly output files, out to the 96 forecast
hour.  We
>> > don't
>> > > > > have
>> > > > > > the disk space to keep all of them after maybe 6 months
(maybe
>> > > > > > shorter...can't remember right now...3 months maybe?),
so, what
>> I
>> > do
>> > > is
>> > > > > use
>> > > > > > python to strip out the variables that I want, and write
them
>> to a
>> > > csv
>> > > > > > file.  We figured out that if I make a csv file, we can
make a
>> > > dynamic
>> > > > > > table for our website.
>> > > > > >
>> > > > > > So, my time series are made from reading in the hourly
csv
>> files,
>> > and
>> > > > > > writing them to a file that I can use to create a time
series
>> plot.
>> > > > Oh,
>> > > > > > and all of those tables I showed yesterday in that one
plot
>> (with
>> > the
>> > > > > > contingency table, the continuous stats table, and the
one with
>> the
>> > > > > > thresholds), are created from the hourly csv files.  I
also
>> create
>> > a
>> > > > > daily
>> > > > > > average, and will create a weekly average.  That's again,
from
>> > > reading
>> > > > in
>> > > > > > the csv files, and creating the averages.
>> > > > > >
>> > > > > > You have to be creative when you don't have a ton of disk
space
>> > has,
>> > > > like
>> > > > > > WCOSS has.  What would be nice (maybe for later releases
of
>> MET),
>> > is
>> > > > for
>> > > > > > the user to be able what stats they want outputed, just
in case
>> > they
>> > > > > don't
>> > > > > > use everything in the ctc file, or everything in the cts
file,
>> etc.
>> > > > > >
>> > > > > > So, that min_total option...does that exist now?  That
would be
>> > > useful.
>> > > > > > But, here's the thing.  I agree with what you said about
that
>> MET
>> > > just
>> > > > > > takes the output and doesn't care how many points are
there.
>> So,
>> > if
>> > > > > those
>> > > > > > matched points are close in forecasted/observation value,
the
>> stats
>> > > > will
>> > > > > be
>> > > > > > good.  I think what happened in this case, was that the
wind
>> speed
>> > > > values
>> > > > > > weren't close, ~5 knots difference, and the stats
reflected
>> that.
>> > > I'm
>> > > > > > actually glad that that case surfaced, since that's the
errors
>> that
>> > > > we're
>> > > > > > looking for.
>> > > > > >
>> > > > > > Roz
>> > > > > >
>> > > > > >
>> > > > > > On Thu, Oct 26, 2017 at 5:03 PM, John Halley Gotway via
RT <
>> > > > > > met_help at ucar.edu> wrote:
>> > > > > >
>> > > > > > > Roz,
>> > > > > > >
>> > > > > > > If I recall correctly, you're running Point-Stat to
generate
>> > > matched
>> > > > > > pairs
>> > > > > > > files and then running stat-analysis to aggregate them
and
>> > compute
>> > > > > > > contingency tables.  Those tools don't have any idea
how many
>> > data
>> > > > > points
>> > > > > > > they "should" be processing.  They just process the
data you
>> pass
>> > > too
>> > > > > > them.
>> > > > > > >
>> > > > > > > One option that might be helpful is the STAT-Analysis
>> > > > "-column_thresh"
>> > > > > > > option.   Here's some potential logic you could use:
>> > > > > > >
>> > > > > > > (1) Run a STAT-Analysis "aggregate_stat" job to read
the MPR
>> > lines,
>> > > > > apply
>> > > > > > > thresholds, and write a CTC output line using "-
out_stat"
>> command
>> > > > line
>> > > > > > > option to write a .stat file.
>> > > > > > > (2) Run a 2nd STAT-Analysis "aggregate_stat" job to
read the
>> CTC
>> > > > output
>> > > > > > of
>> > > > > > > job run and write a CTS statistics output line.  But
use this
>> > > option
>> > > > > > > "-column_thresh TOTAL >1000".  That tells STAT-Analysis
to
>> only
>> > > > process
>> > > > > > CTC
>> > > > > > > lines where the TOTAL column is at least 1000.
>> > > > > > >
>> > > > > > > Hopefully you'll find that logic, or some variant of
it, to be
>> > > > helpful.
>> > > > > > >
>> > > > > > > Another idea would be to add a config file option for
>> Point-Stat
>> > > and
>> > > > > > > Grid-Stat... something like "min_total = 1000;".  When
>> > processing a
>> > > > > > > verification task, if fewer than the required minimum
number
>> of
>> > > > matched
>> > > > > > > pairs are found, we could skip it and not write any
output.
>> Do
>> > you
>> > > > > think
>> > > > > > > that logic would be helpful for you?
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > > John
>> > > > > > >
>> > > > > > > On Thu, Oct 26, 2017 at 2:47 PM, Rosalyn MacCracken -
NOAA
>> > > Affiliate
>> > > > > via
>> > > > > > RT
>> > > > > > > <met_help at ucar.edu> wrote:
>> > > > > > >
>> > > > > > > >
>> > > > > > > > Thu Oct 26 12:47:53 2017: Request 82515 was acted
upon.
>> > > > > > > > Transaction: Ticket created by
rosalyn.maccracken at noaa.gov
>> > > > > > > >        Queue: met_help
>> > > > > > > >      Subject: question on dealing with calculation of
bad
>> > > > statistics
>> > > > > > > >        Owner: Nobody
>> > > > > > > >   Requestors: rosalyn.maccracken at noaa.gov
>> > > > > > > >       Status: new
>> > > > > > > >  Ticket <URL: https://rt.rap.ucar.edu/rt/
>> > > > > Ticket/Display.html?id=82515
>> > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Hi,
>> > > > > > > >
>> > > > > > > > I'm using pb2nc and point_stat to create matched
pairs (mpr
>> > file)
>> > > > and
>> > > > > > > stats
>> > > > > > > > of Wind speed data between ASCAT data (point data
from the
>> > > prepbufr
>> > > > > > file)
>> > > > > > > > and GFS data.  I'm also using wind thresholds, and
>> calculating
>> > > > stats
>> > > > > > (cts
>> > > > > > > > and cnt files), and then creating tables from the
output of
>> > those
>> > > > > > files.
>> > > > > > > >
>> > > > > > > > I have pb2nc and point_stat running in a python
script,
>> within
>> > a
>> > > > > > cronjob,
>> > > > > > > > which I've been running since Sept 20th.  On Oct 8th,
not
>> all
>> > the
>> > > > > data
>> > > > > > > was
>> > > > > > > > retrieved from the prepbufr file, and point_stat
processing
>> > took
>> > > > > place
>> > > > > > > with
>> > > > > > > > the smaller dataset( 1039 matched points, instead of
>> >100,000
>> > > > matched
>> > > > > > > > points).  I've attached the 3 images (ASCAT, GFS and
the
>> > > difference
>> > > > > of
>> > > > > > > the
>> > > > > > > > wind speed field) of the region where the data was
located.
>> > > Taking
>> > > > a
>> > > > > > look
>> > > > > > > > at these images, there are 2 small regions where
there are
>> > > > > differences,
>> > > > > > > > but, not really big differences.  Anyway, when
plotting a
>> time
>> > > > series
>> > > > > > of
>> > > > > > > > the CSI, POD and FAR stats, I noticed that there was
a big
>> dip
>> > in
>> > > > the
>> > > > > > > time
>> > > > > > > > series.  I've attached that plot, too.
>> > > > > > > >
>> > > > > > > > I took a closer look at the CSI values for the
contingency
>> > table,
>> > > > > and I
>> > > > > > > saw
>> > > > > > > > these values:
>> > > > > > > > Total = 339
>> > > > > > > > FY_OY = 14
>> > > > > > > > FY_ON = 0
>> > > > > > > > FN_OY = 66
>> > > > > > > > FN_ON = 259
>> > > > > > > > for a calculated CSI value of (14/80) = .175
>> > > > > > > >
>> > > > > > > > My question is, if there was a full dataset, would
the data
>> > have
>> > > > been
>> > > > > > > > smoothed out more, and these differences not as
noticeable?
>> > > > > > > >
>> > > > > > > > Then, if it's the smaller dataset, is there a way to
stop
>> > > > processing
>> > > > > if
>> > > > > > > > there is a limited number of data?  Or, would I need
to add
>> > > > something
>> > > > > > > into
>> > > > > > > > my python processing script to alert me to such
problems?
>> > > > > > > >
>> > > > > > > > Thanks!
>> > > > > > > >
>> > > > > > > > Roz
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > Rosalyn MacCracken
>> > > > > > > > Support Scientist
>> > > > > > > >
>> > > > > > > > Ocean Applilcations Branch
>> > > > > > > > NOAA/NWS Ocean Prediction Center
>> > > > > > > > NCWCP
>> > > > > > > > 5830 University Research Ct
>> > > > > > > > College Park, MD  20740-3818
>> > > > > > > >
>> > > > > > > > (p) 301-683-1551 <(301)%20683-1551> <(301)%20683-
1551>
>> <(301)%20683-1551>
>> > > > > > > > rosalyn.maccracken at noaa.gov
>> > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Rosalyn MacCracken
>> > > > > > Support Scientist
>> > > > > >
>> > > > > > Ocean Applilcations Branch
>> > > > > > NOAA/NWS Ocean Prediction Center
>> > > > > > NCWCP
>> > > > > > 5830 University Research Ct
>> > > > > > College Park, MD  20740-3818
>> > > > > >
>> > > > > > (p) 301-683-1551
>> > > > > > rosalyn.maccracken at noaa.gov
>> > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > > Rosalyn MacCracken
>> > > > Support Scientist
>> > > >
>> > > > Ocean Applilcations Branch
>> > > > NOAA/NWS Ocean Prediction Center
>> > > > NCWCP
>> > > > 5830 University Research Ct
>> > > > College Park, MD  20740-3818
>> > > >
>> > > > (p) 301-683-1551
>> > > > rosalyn.maccracken at noaa.gov
>> > > >
>> > > >
>> > >
>> > >
>> >
>> >
>> > --
>> > Rosalyn MacCracken
>> > Support Scientist
>> >
>> > Ocean Applilcations Branch
>> > NOAA/NWS Ocean Prediction Center
>> > NCWCP
>> > 5830 University Research Ct
>> > College Park, MD  20740-3818
>> >
>> > (p) 301-683-1551
>> > rosalyn.maccracken at noaa.gov
>> >
>> >
>>
>>
>
>
> --
> Rosalyn MacCracken
> Support Scientist
>
> Ocean Applilcations Branch
> NOAA/NWS Ocean Prediction Center
> NCWCP
> 5830 University Research Ct
> College Park, MD  20740-3818
>
> (p) 301-683-1551 <(301)%20683-1551>
> rosalyn.maccracken at noaa.gov
>

--
Rosalyn MacCracken
Support Scientist

Ocean Applilcations Branch
NOAA/NWS Ocean Prediction Center
NCWCP
5830 University Research Ct
College Park, MD  20740-3818

(p) 301-683-1551
rosalyn.maccracken at noaa.gov

------------------------------------------------