[Met_help] [rt.rap.ucar.edu #83451] History for question about Stat-Analysis

Tue Jul 9 12:03:58 MDT 2019

----------------------------------------------------------------
  Initial Request
----------------------------------------------------------------

I'm trying to perform an analysis of MET Grid-Stat output which will produce a single value for the FSS score for a given 24 hour WRF run. I want to generate the same FSS score for other WRF runs in order to compare one run against another to see which configuration has the best FSS score. The attached files are a Stat-Analysis run script and the configuration file I used which produced a single value of FSS over all the threshold values and neighborhood sizes which were used in generating the Grid-Stat output. Is the way I used Stat-Analysis to generate the FSS value statistically sound or is there a better way to generate it? 

Thanks.

R/
John Raby 

----------------------------------------------------------------
  Complete Ticket History
----------------------------------------------------------------

Subject: question about Stat-Analysis
From: John Halley Gotway
Time: Tue Jan 02 14:37:03 2018

John,

Hope the new year is treating you well.  Thanks for sending your
STAT-Analysis config file.  I see that you've used Grid-Stat to
compute FSS
for many thresholds (5 to 40, every 5) and neighborhood sizes.  You'd
like
to come up with a single number that summarizes these FSS results
across
all thresholds and neighborhood sizes.  And I'm guessing you'd look at
the
mean of all those FSS values.  And finally, you'd compare the mean FSS
value for one WRF configuration to the mean FSS values for another WRF
configuration.

First, one quick point to make.  You're running 2 jobs... a "filter"
job to
dump the .stat lines that we used to a output file... and a "summary"
job
where you're summarizing the FSS values.  This isn't necessary.
You're
already using the "-dump_row" option for the "summary" job and the
contents
of those two dump_row files should be identical.  Unless I'm missing
something, I'd suggest removing the "filter" job.

As for the summary job, I see several short-comings in this approach.
By
computing a single mean FSS value, you're losing the spatial
information
that FSS was designed to capture in the first place.  It may be the
case
that one configuration of WRF does very well for lower precip
accumulations
and poorly for intense precip... or vice-versa.  And this information
would
also be lost in the averaging.  Ultimately, the analysis you do should
be
based on the questions you're trying to answer.  If all of the
thresholds
(5 to 40) and all of the spatial scales are of equal importance to
you,
then perhaps the mean of them all is sufficient in determining which
configuration is "better".  But you won't know *how* it is better
without
looking at the thresholds separately.

Here's 3 ideas about all this:

(1) For FSS, we usually make a plot with the spatial scale on the X-
axis
and the FSS on the Y-axis.  Then we plot multiple lines, one for each
threshold of interest.  Or multiple lines for multiple models.  Then
you
can see how model performance varies by threshold and spatial scale.

(2) The Air Force does something similar to what you're describing
with
their GO Index, which is supported in STAT-Analysis.  The GO Index
value is
computed once for each model initialization.  It is defined by
selecting a
number of model variables, levels, lead times, statistics, and
weights.
The combinations they use can be found in this file:
met-6.1/share/met/config/STATAnalysisConfig_GO_Index

The GO Index is used to compare 2 models, model A versus model 5.  If
its >
1 then model A is better.  If its < 1 then model B is better.  For
each of
the terms (the GO Index uses 48), STAT-Analysis get the RMSE value for
models A and B and computes a skill score as 1 - RMSE(A)/RMSE(B).  The
final GO Index is just a weighted average of the skill scores for the
48
terms which describe it.  The Air Force chose that particular
combination
of variables/levels/lead times/statistics/weights because they are
important to them operationally.

If you're trying to aggregate model performance into a single number,
you
could do something similar.  Just define the terms that you'd like to
go
into your large weighted average.

(3) Rather then comparing the means for models A and B, you could
first
compute the pairwise difference:  FSS model A - FSS model B for each
date,
threshold, and neighborhood size.  Then test to see if the
distribution of
differences includes 0.  We use METViewer to check for statistically
significant differences.  That's how we typically decide if one model
is
better than another.

Hope that helps!  If you have more questions, Tressa Fowler might have
some
additional thoughts on this.

Thanks,
John

On Tue, Jan 2, 2018 at 1:30 PM, Raby, John W USA CIV via RT <
met_help at ucar.edu> wrote:

>
> Tue Jan 02 13:30:37 2018: Request 83451 was acted upon.
> Transaction: Ticket created by john.w.raby2.civ at mail.mil
>        Queue: met_help
>      Subject: question about Stat-Analysis
>        Owner: Nobody
>   Requestors: john.w.raby2.civ at mail.mil
>       Status: new
>  Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=83451 >
>
>
> I'm trying to perform an analysis of MET Grid-Stat output which will
> produce a single value for the FSS score for a given 24 hour WRF
run. I
> want to generate the same FSS score for other WRF runs in order to
compare
> one run against another to see which configuration has the best FSS
score.
> The attached files are a Stat-Analysis run script and the
configuration
> file I used which produced a single value of FSS over all the
threshold
> values and neighborhood sizes which were used in generating the
Grid-Stat
> output. Is the way I used Stat-Analysis to generate the FSS value
> statistically sound or is there a better way to generate it?
>
> Thanks.
>
> R/
> John Raby
>
>

------------------------------------------------
Subject: RE: [Non-DoD Source] Re: [rt.rap.ucar.edu #83451] question about Stat-Analysis
From: Raby, John W USA CIV
Time: Wed Jan 03 08:53:56 2018

John -

My new year is going very well so far, thanks. Glad not to have to
deal with the cold weather affecting so much of the country north and
east of here. I hope yours is going well too.

Thanks for the explanation which offers alternative ways to consider
for accomplishing the verification. I'm starting a discussion with the
modelers to see what we want to do. Thanks for looking over my config
file and noting the duplication. I did notice the two files which
contained the same info. The reason I did the filter job was because I
thought that the results of the filter were passed to subsequent jobs
through the use of a temporary file. This was my interpretation of the
following paragraph which is from the MET V5.2 tutorial at:
https://dtcenter.org/met/users/support/online_tutorial/METv5.2/tutorial.php?name=stat_analysis&category=configure

"The Stat-Analysis configuration file has two main sections. The items
in the first section are used to filter the STAT data being processed.
Only those lines which meet the filtering requirements specified are
retained and passed down to the second section. The second section
defines the analysis jobs to be performed on the filtered data. When
defining analysis jobs, additional filtering parameters may be defined
to further refine the STAT data with which to perform that particular
job."

R/
John

-----Original Message-----
From: John Halley Gotway via RT [mailto:met_help at ucar.edu]
Sent: Tuesday, January 2, 2018 2:37 PM
To: Raby, John W CIV USARMY RDECOM ARL (US)
<john.w.raby2.civ at mail.mil>
Subject: [Non-DoD Source] Re: [rt.rap.ucar.edu #83451] question about
Stat-Analysis

All active links contained in this email were disabled.  Please verify
the identity of the sender, and confirm the authenticity of all links
contained within the message prior to copying and pasting the address
to a Web browser.

----

John,

Hope the new year is treating you well.  Thanks for sending your STAT-
Analysis config file.  I see that you've used Grid-Stat to compute FSS
for many thresholds (5 to 40, every 5) and neighborhood sizes.  You'd
like to come up with a single number that summarizes these FSS results
across all thresholds and neighborhood sizes.  And I'm guessing you'd
look at the mean of all those FSS values.  And finally, you'd compare
the mean FSS value for one WRF configuration to the mean FSS values
for another WRF configuration.

First, one quick point to make.  You're running 2 jobs... a "filter"
job to dump the .stat lines that we used to a output file... and a
"summary" job where you're summarizing the FSS values.  This isn't
necessary.  You're already using the "-dump_row" option for the
"summary" job and the contents of those two dump_row files should be
identical.  Unless I'm missing something, I'd suggest removing the
"filter" job.

As for the summary job, I see several short-comings in this approach.
By computing a single mean FSS value, you're losing the spatial
information that FSS was designed to capture in the first place.  It
may be the case that one configuration of WRF does very well for lower
precip accumulations and poorly for intense precip... or vice-versa.
And this information would also be lost in the averaging.  Ultimately,
the analysis you do should be based on the questions you're trying to
answer.  If all of the thresholds
(5 to 40) and all of the spatial scales are of equal importance to
you, then perhaps the mean of them all is sufficient in determining
which configuration is "better".  But you won't know *how* it is
better without looking at the thresholds separately.

Here's 3 ideas about all this:

(1) For FSS, we usually make a plot with the spatial scale on the X-
axis and the FSS on the Y-axis.  Then we plot multiple lines, one for
each threshold of interest.  Or multiple lines for multiple models.
Then you can see how model performance varies by threshold and spatial
scale.

(2) The Air Force does something similar to what you're describing
with their GO Index, which is supported in STAT-Analysis.  The GO
Index value is computed once for each model initialization.  It is
defined by selecting a number of model variables, levels, lead times,
statistics, and weights.
The combinations they use can be found in this file:
met-6.1/share/met/config/STATAnalysisConfig_GO_Index

The GO Index is used to compare 2 models, model A versus model 5.  If
its >
1 then model A is better.  If its < 1 then model B is better.  For
each of the terms (the GO Index uses 48), STAT-Analysis get the RMSE
value for models A and B and computes a skill score as 1 -
RMSE(A)/RMSE(B).  The final GO Index is just a weighted average of the
skill scores for the 48 terms which describe it.  The Air Force chose
that particular combination of variables/levels/lead
times/statistics/weights because they are important to them
operationally.

If you're trying to aggregate model performance into a single number,
you could do something similar.  Just define the terms that you'd like
to go into your large weighted average.

(3) Rather then comparing the means for models A and B, you could
first compute the pairwise difference:  FSS model A - FSS model B for
each date, threshold, and neighborhood size.  Then test to see if the
distribution of differences includes 0.  We use METViewer to check for
statistically significant differences.  That's how we typically decide
if one model is better than another.

Hope that helps!  If you have more questions, Tressa Fowler might have
some additional thoughts on this.

Thanks,
John

On Tue, Jan 2, 2018 at 1:30 PM, Raby, John W USA CIV via RT <
met_help at ucar.edu> wrote:

>
> Tue Jan 02 13:30:37 2018: Request 83451 was acted upon.
> Transaction: Ticket created by john.w.raby2.civ at mail.mil
>        Queue: met_help
>      Subject: question about Stat-Analysis
>        Owner: Nobody
>   Requestors: john.w.raby2.civ at mail.mil
>       Status: new
>  Ticket <Caution-url:
> Caution-https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=83451 >
>
>
> I'm trying to perform an analysis of MET Grid-Stat output which will
> produce a single value for the FSS score for a given 24 hour WRF
run.
> I want to generate the same FSS score for other WRF runs in order to
> compare one run against another to see which configuration has the
best FSS score.
> The attached files are a Stat-Analysis run script and the
> configuration file I used which produced a single value of FSS over
> all the threshold values and neighborhood sizes which were used in
> generating the Grid-Stat output. Is the way I used Stat-Analysis to
> generate the FSS value statistically sound or is there a better way
to generate it?
>
> Thanks.
>
> R/
> John Raby
>
>

------------------------------------------------
Subject: question about Stat-Analysis
From: Tressa Fowler
Time: Wed Jan 03 10:10:20 2018

Hi John,

I think that the plot showing the different FSS values by threshold
and
neighborhood size are a good idea.

I would discourage you from averaging the FSS, especially for the
largest
neighborhood. When the whole domain is included, the FSS is really
just a
measure of the frequency bias. When bias is included with error, such
as
with RMSE, it often dominates the calculation. Results could thus be
misleading.

Thanks,

Tressa

On Tue, Jan 2, 2018 at 2:36 PM, John Halley Gotway <johnhg at ucar.edu>
wrote:

> John,
>
> Hope the new year is treating you well.  Thanks for sending your
> STAT-Analysis config file.  I see that you've used Grid-Stat to
compute FSS
> for many thresholds (5 to 40, every 5) and neighborhood sizes.
You'd like
> to come up with a single number that summarizes these FSS results
across
> all thresholds and neighborhood sizes.  And I'm guessing you'd look
at the
> mean of all those FSS values.  And finally, you'd compare the mean
FSS
> value for one WRF configuration to the mean FSS values for another
WRF
> configuration.
>
> First, one quick point to make.  You're running 2 jobs... a "filter"
job
> to dump the .stat lines that we used to a output file... and a
"summary"
> job where you're summarizing the FSS values.  This isn't necessary.
You're
> already using the "-dump_row" option for the "summary" job and the
contents
> of those two dump_row files should be identical.  Unless I'm missing
> something, I'd suggest removing the "filter" job.
>
> As for the summary job, I see several short-comings in this
approach.  By
> computing a single mean FSS value, you're losing the spatial
information
> that FSS was designed to capture in the first place.  It may be the
case
> that one configuration of WRF does very well for lower precip
accumulations
> and poorly for intense precip... or vice-versa.  And this
information would
> also be lost in the averaging.  Ultimately, the analysis you do
should be
> based on the questions you're trying to answer.  If all of the
thresholds
> (5 to 40) and all of the spatial scales are of equal importance to
you,
> then perhaps the mean of them all is sufficient in determining which
> configuration is "better".  But you won't know *how* it is better
without
> looking at the thresholds separately.
>
> Here's 3 ideas about all this:
>
> (1) For FSS, we usually make a plot with the spatial scale on the X-
axis
> and the FSS on the Y-axis.  Then we plot multiple lines, one for
each
> threshold of interest.  Or multiple lines for multiple models.  Then
you
> can see how model performance varies by threshold and spatial scale.
>
> (2) The Air Force does something similar to what you're describing
with
> their GO Index, which is supported in STAT-Analysis.  The GO Index
value is
> computed once for each model initialization.  It is defined by
selecting a
> number of model variables, levels, lead times, statistics, and
weights.
> The combinations they use can be found in this file:
> met-6.1/share/met/config/STATAnalysisConfig_GO_Index
>
> The GO Index is used to compare 2 models, model A versus model 5.
If its
> > 1 then model A is better.  If its < 1 then model B is better.  For
each
> of the terms (the GO Index uses 48), STAT-Analysis get the RMSE
value for
> models A and B and computes a skill score as 1 - RMSE(A)/RMSE(B).
The
> final GO Index is just a weighted average of the skill scores for
the 48
> terms which describe it.  The Air Force chose that particular
combination
> of variables/levels/lead times/statistics/weights because they are
> important to them operationally.
>
> If you're trying to aggregate model performance into a single
number, you
> could do something similar.  Just define the terms that you'd like
to go
> into your large weighted average.
>
> (3) Rather then comparing the means for models A and B, you could
first
> compute the pairwise difference:  FSS model A - FSS model B for each
date,
> threshold, and neighborhood size.  Then test to see if the
distribution of
> differences includes 0.  We use METViewer to check for statistically
> significant differences.  That's how we typically decide if one
model is
> better than another.
>
> Hope that helps!  If you have more questions, Tressa Fowler might
have
> some additional thoughts on this.
>
> Thanks,
> John
>
>
> On Tue, Jan 2, 2018 at 1:30 PM, Raby, John W USA CIV via RT <
> met_help at ucar.edu> wrote:
>
>>
>> Tue Jan 02 13:30:37 2018: Request 83451 was acted upon.
>> Transaction: Ticket created by john.w.raby2.civ at mail.mil
>>        Queue: met_help
>>      Subject: question about Stat-Analysis
>>        Owner: Nobody
>>   Requestors: john.w.raby2.civ at mail.mil
>>       Status: new
>>  Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=83451 >
>>
>>
>> I'm trying to perform an analysis of MET Grid-Stat output which
will
>> produce a single value for the FSS score for a given 24 hour WRF
run. I
>> want to generate the same FSS score for other WRF runs in order to
compare
>> one run against another to see which configuration has the best FSS
score.
>> The attached files are a Stat-Analysis run script and the
configuration
>> file I used which produced a single value of FSS over all the
threshold
>> values and neighborhood sizes which were used in generating the
Grid-Stat
>> output. Is the way I used Stat-Analysis to generate the FSS value
>> statistically sound or is there a better way to generate it?
>>
>> Thanks.
>>
>> R/
>> John Raby
>>
>>
>

------------------------------------------------