[Met_help] [rt.rap.ucar.edu #64284] History for Optimization

Thu Nov 14 11:11:26 MST 2013

----------------------------------------------------------------
  Initial Request
----------------------------------------------------------------

Hello, I have an optimization question, and perhaps there is no better solution than the one that I am currently using, but I thought I would ask....

I am doing daily evaluations for 7 parameters at time resolution of 3h up to 24 hrs.  I have also broken my domain into 40 subdomains.  Within my point-stat config file, I have defined these subregions as such:

   sid = ["1.txt","2.txt","3.txt...... "40.txt"]

where each text file has a space-separated list of stations at which matched pairs are to be found.  Each region has on average, I guess around 35 stations...

So far, everything works correctly.  There approximately 1400 observations in my domain, which means 7 params * 9 time points * 1400 stations, which means ~88000 stat lines.  With this many STAT lines, every stat_aggregate job takes approximately 2 seconds.  

I want to aggregate statistics for each time point, parameter, and region separately.  That means, 7 params * 9 time points * 40 regions = 2520 aggregate jobs.  Multiplying each job by 2 seconds, you can see that this would about an hour and a half.  Remember this is only for one single day (24 hours).  And I havent even mentioned yet that I would like to do this for several different models.  Considering the current computational time, I wanted to look into optimization before I developed my script any further.  I have looked into forking my script to split up the stat_agg jobs, but I first wanted to ask if there is anything I could within MET to facilitate a speedier calculation or at least allow the STAT jobs to run a bit faster.

Thank you,

Andrew

----------------------------------------------------------------
  Complete Ticket History
----------------------------------------------------------------

Subject: Re: [rt.rap.ucar.edu #64284] Optimization
From: John Halley Gotway
Time: Wed Nov 13 10:44:00 2013

Andrew,

Good question and thanks for providing the details about your
processing logic.  That sure is a lot of subdomains!  Here are some
considerations that may impact your logic...

First, with "7 params * 9 time points * 1400 stations, which means
~88000 stat lines" it sounds like you're using Point-Stat to simply
dump out matched pair (MPR) values.  Is that true?  You could
easily also turn on the output of statistics for each region (like
continuous statistics CNT and partial sums SL1L2).  Would that reduce
the number of STAT-Analysis jobs you need to run?  I guess I
don't really understand what types of STAT-Analysis jobs you're
running.  You say you want to "aggregate statistics for each time
point, parameter, and region separately", but isn't that exactly what
Point-Stat can do each time you run it?

Second, if you're running METv4.1, you could consider using the newly
added "-by case" option for STAT-Analysis.  This option allows you to
define "case" information and then run the same job once for
each unique set of case information.  For example, since you have 40
different masking regions, you could try running this job:
   -job aggregate_stat -line_type MPR -out_line_type CNT -fcst_var TMP
-fcst_lev Z2 -by VX_MASK

This job will compute continuous statistics for 2-meter temperature
from the input matched pair lines.  But it will run the same job once
for each unique entry it finds in the VX_MASK column.  In your
case you should get 40 output lines rather than 1.  You can also add
more case information like this: -by VX_MASK -by FCST_VAR -by
FCST_LEV.  Then it'll run the same job for each unique combination of
those things.

Third, the "slow" parts of STAT-Analysis are likely I/O, computing
bootstrap confidence intervals, and computing rank correlation
statistics.  If your jobs operate on MPR lines, the latter two may be
slowing you down.  You could disable bootstrapping and rank
correlation stats using the "-n_boot_rep 0" and "-rank_corr_flag
FALSE" job command options.  As for the I/O, a large portion of
STAT-Analysis is just reading .stat files and filtering them down to
the subset of data relevant to the current job.  But parsing each
input stat line and checking all the job filtering logic is much
slower than the unix "grep" command.  In your script, you could try
reducing the amount of data you pass to STAT-Analysis by smart uses of
grep.  For example, if you want to run a bunch of jobs for 1
of the 40 masking regions, you could "grep" out that relevant subset
of data to temporary file, and then pass that temp file to STAT-
Analysis.

Hopefully, those give you some things to play around with when try to
optimize the speed.

Thanks,
John Halley Gotway
met_help at ucar.edu

On 11/13/2013 08:42 AM, Andrew J. via RT wrote:
>
> Wed Nov 13 08:42:18 2013: Request 64284 was acted upon.
> Transaction: Ticket created by andrewwx at yahoo.com
>         Queue: met_help
>       Subject: Optimization
>         Owner: Nobody
>    Requestors: andrewwx at yahoo.com
>        Status: new
>   Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=64284 >
>
>
> Hello, I have an optimization question, and perhaps there is no
better solution than the one that I am currently using, but I thought
I would ask....
>
> I am doing daily evaluations for 7 parameters at time resolution of
3h up to 24 hrs.  I have also broken my domain into 40 subdomains.
Within my point-stat config file, I have defined these subregions as
such:
>
>
>     sid = ["1.txt","2.txt","3.txt...... "40.txt"]
>
> where each text file has a space-separated list of stations at which
matched pairs are to be found.  Each region has on average, I guess
around 35 stations...
>
> So far, everything works correctly.  There approximately 1400
observations in my domain, which means 7 params * 9 time points * 1400
stations, which means ~88000 stat lines.  With this many STAT lines,
every stat_aggregate job takes approximately 2 seconds.
>
> I want to aggregate statistics for each time point, parameter, and
region separately.  That means, 7 params * 9 time points * 40 regions
= 2520 aggregate jobs.  Multiplying each job by 2 seconds, you can see
that this would about an hour and a half.  Remember this is only for
one single day (24 hours).  And I havent even mentioned yet that I
would like to do this for several different models.  Considering the
current computational time, I wanted to look into optimization before
I developed my script any further.  I have looked into forking my
script to split up the stat_agg jobs, but I first wanted to ask if
there is anything I could within MET to facilitate a speedier
calculation or at least allow the STAT jobs to run a bit faster.
>
> Thank you,
>
> Andrew
>

------------------------------------------------
Subject: Optimization
From: Andrew J.
Time: Thu Nov 14 10:28:19 2013

Hi,

Yes, I suppose I should have mentioned that I need to keep the matched
pair files.  I am making a spatial representation of errors by
station.  For this, I have written a perl code to search the matched
pair files and calculate errors and plot them based on matching
latitudes and longitudes.   Perhaps there is also an easier way to do
this, but in any case, I prefer to keep the matched pair files because
there is often a high demand for me to check the results of individual
stations.

But I have employed both the -by case option and also turned off
bootstrapping and rank statistics.  Wow!  Yesterday, it took me 4.5
hours to evaluate 3 models.  Today 5 minutes!  Such a big difference. 
Thank you so much for your suggestions.  I only wish I would have
tried this a bit earlier, but definitely happy that I asked so that I
can save time from now on!!

Thank you again!

- Andrew

John Halley Gotway via RT <met_help at ucar.edu> schrieb am 18:44
Mittwoch, 13.November 2013:

Andrew,

Good question and thanks for providing the details about your
processing logic.  That sure is a lot of subdomains!  Here are some
considerations that may impact your logic...

First, with "7 params * 9 time points * 1400 stations, which means
~88000 stat lines" it sounds like you're using Point-Stat to simply
dump out matched pair (MPR) values.  Is that true?  You could
easily also turn on the output of statistics for each region (like
continuous statistics CNT and partial sums SL1L2).  Would that reduce
the number of STAT-Analysis jobs you need to run?  I guess I
don't really understand what types of STAT-Analysis jobs you're
running.  You say you want to "aggregate statistics for each time
point, parameter, and region separately", but isn't that exactly what
Point-Stat can do each time you run it?

Second, if you're running METv4.1, you could consider using the newly
added "-by case" option for STAT-Analysis.  This option allows you to
define "case" information and then run the same job once for
each unique set of case information.  For example, since you have 40
different masking regions, you could try running this job:
   -job aggregate_stat -line_type MPR -out_line_type CNT -fcst_var TMP
-fcst_lev Z2 -by VX_MASK

This job will compute continuous statistics for 2-meter temperature
from the input matched pair lines.  But it will run the same job once
for each unique entry it finds in the VX_MASK column.  In your
case you should get 40 output lines rather than 1.  You can also add
more case information like this: -by VX_MASK -by FCST_VAR -by
FCST_LEV.  Then it'll run the same job for each unique combination of
those things.

Third, the "slow" parts of STAT-Analysis are likely I/O, computing
bootstrap confidence intervals, and computing rank correlation
statistics.  If your jobs operate on MPR lines, the latter two may be
slowing you down.  You could disable bootstrapping and rank
correlation stats using the "-n_boot_rep 0" and "-rank_corr_flag
FALSE" job command options.  As for the I/O, a large portion of
STAT-Analysis is just reading .stat files and filtering them down to
the subset of data relevant to the current job.  But parsing each
input stat line and checking all the job filtering logic is much
slower than the unix "grep" command.  In your script, you could try
reducing the amount of data you pass to STAT-Analysis by smart uses of
grep.  For example, if you want to run a bunch of jobs for 1
of the 40 masking regions, you could "grep" out that relevant subset
of data to temporary file, and then pass that temp file to STAT-
Analysis.

Hopefully, those give you some things to play around with when try to
optimize the speed.

Thanks,
John Halley Gotway
met_help at ucar.edu

On 11/13/2013 08:42 AM, Andrew J. via RT wrote:
>
> Wed Nov 13 08:42:18 2013: Request 64284 was acted upon.
> Transaction: Ticket created by andrewwx at yahoo.com
>         Queue: met_help
>       Subject: Optimization
>         Owner: Nobody
>    Requestors: andrewwx at yahoo.com
>        Status: new
>   Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=64284 >
>
>
> Hello, I have an optimization question, and perhaps there is no
better solution than the one that I am currently using, but I thought
I would ask....
>
> I am doing daily evaluations for 7 parameters at time resolution of
3h up to 24 hrs.  I have also broken my domain into 40 subdomains. 
Within my point-stat config file, I have defined these subregions as
such:
>
>
>     sid = ["1.txt","2.txt","3.txt...... "40.txt"]
>
> where each text file has a space-separated list of stations at which
matched pairs are to be found.  Each region has on average, I guess
around 35 stations...
>
> So far, everything works correctly.  There approximately 1400
observations in my domain, which means 7 params * 9 time points * 1400
stations, which means ~88000 stat lines.  With this many STAT lines,
every stat_aggregate job takes approximately 2 seconds.
>
> I want to aggregate statistics for each time point, parameter, and
region separately.  That means, 7 params * 9 time points * 40 regions
= 2520 aggregate jobs.  Multiplying each job by 2 seconds, you can see
that this would about an hour and a half.  Remember this is only for
one single day (24 hours).  And I havent even mentioned yet that I
would like to do this for several different models.  Considering the
current computational time, I wanted to look into optimization before
I developed my script any further.  I have looked into forking my
script to split up the stat_agg jobs, but I first wanted to ask if
there is anything I could within MET to facilitate a speedier
calculation or at least allow the STAT jobs to run a bit faster.
>
> Thank you,
>
> Andrew
>

------------------------------------------------
Subject: Optimization
From: Andrew J.
Time: Thu Nov 14 10:29:07 2013

Hi,

Yes, I suppose I should have
mentioned that I need to keep the matched pair files.  I am making a
spatial representation of errors by station.  For this, I have written
a
 perl code to search the matched pair files and calculate errors and
plot them based on matching latitudes and longitudes.   Perhaps there
is
 also an easier way to do this, but in any case, I prefer to keep the
matched pair files because there is often a high demand for me to
check
the results of individual stations.

But
 I have employed both the -by case option and
 also turned off bootstrapping and rank statistics.  Wow!  Yesterday,
it
 took me 4.5 hours to evaluate 3 models.  Today 5 minutes!  Such a big
difference.  Thank you so much for your suggestions.  I
 only wish I would have tried this a bit earlier, but definitely happy
that I asked so that I can save time from now on!!

Thank you again!

- Andrew

John Halley Gotway via RT <met_help at ucar.edu> schrieb am 18:44
Mittwoch, 13.November 2013:

Andrew,

Good question and thanks for providing the details about your
processing logic.  That sure is a lot of subdomains!  Here are some
considerations that may impact your logic...

First, with "7 params * 9 time points * 1400 stations, which means
~88000 stat lines" it sounds like you're using Point-Stat to simply
dump out matched pair (MPR) values.  Is that true?  You could
easily also turn on the output of statistics for each region (like
continuous statistics CNT and partial sums SL1L2).  Would that reduce
the number of STAT-Analysis jobs you need to run?  I guess I
don't really understand what types of STAT-Analysis jobs you're
running.  You say you want to "aggregate statistics for each time
point, parameter, and region separately", but isn't that exactly what
Point-Stat can do each time you run it?

Second, if you're running METv4.1, you could consider using the newly
added "-by case" option for STAT-Analysis.  This option allows you to
define "case" information and then run the same job once for
each unique set of case information.  For example, since you have 40
different masking regions, you could try running this job:
   -job aggregate_stat -line_type MPR -out_line_type CNT -fcst_var TMP
-fcst_lev Z2 -by VX_MASK

This job will compute continuous statistics for 2-meter temperature
from the input matched pair lines.  But it will run the same job once
for each unique entry it finds in the VX_MASK column.  In your
case you should get 40 output lines rather than 1.  You can also add
more case information like this: -by VX_MASK -by FCST_VAR -by
FCST_LEV.  Then it'll run the same job for each unique combination of
those things.

Third, the "slow" parts of STAT-Analysis are likely I/O, computing
bootstrap confidence intervals, and computing rank correlation
statistics.  If your jobs operate on MPR lines, the latter two may be
slowing you down.  You could disable bootstrapping and rank
correlation stats using the "-n_boot_rep 0" and "-rank_corr_flag
FALSE" job command options.  As for the I/O, a large portion of
STAT-Analysis is just reading .stat files and filtering them down to
the subset of data relevant to the current job.  But parsing each
input stat line and checking all the job filtering logic is much
slower than the unix "grep" command.  In your script, you could try
reducing the amount of data you pass to STAT-Analysis by smart uses of
grep.  For example, if you want to run a bunch of jobs for 1
of the 40 masking regions, you could "grep" out that relevant subset
of data to temporary file, and then pass that temp file to STAT-
Analysis.

Hopefully, those give you some things to play around with when try to
optimize the speed.

Thanks,
John Halley Gotway
met_help at ucar.edu

On 11/13/2013 08:42 AM, Andrew J. via RT wrote:
>
> Wed Nov 13 08:42:18 2013: Request 64284 was acted upon.
> Transaction: Ticket created by andrewwx at yahoo.com
>         Queue: met_help
>       Subject: Optimization
>         Owner: Nobody
>    Requestors: andrewwx at yahoo.com
>        Status: new
>   Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=64284 >
>
>
> Hello, I have an optimization question, and perhaps there is no
better solution than the one that I am currently using, but I thought
I would ask....
>
> I am doing daily evaluations for 7 parameters at time resolution of
3h up to 24 hrs.  I have also broken my domain into 40 subdomains. 
Within my point-stat config file, I have defined these subregions as
such:
>
>
>     sid = ["1.txt","2.txt","3.txt...... "40.txt"]
>
> where each text file has a space-separated list of stations at which
matched pairs are to be found.  Each region has on average, I guess
around 35 stations...
>
> So far, everything works correctly.  There approximately 1400
observations in my domain, which means 7 params * 9 time points * 1400
stations, which means ~88000 stat lines.  With this many STAT lines,
every stat_aggregate job takes approximately 2 seconds.
>
> I want to aggregate statistics for each time point, parameter, and
region separately.  That means, 7 params * 9 time points * 40 regions
= 2520 aggregate jobs.  Multiplying each job by 2 seconds, you can see
that this would about an hour and a half.  Remember this is only for
one single day (24 hours).  And I havent even mentioned yet that I
would like to do this for several different models.  Considering the
current computational time, I wanted to look into optimization before
I developed my script any further.  I have looked into forking my
script to split up the stat_agg jobs, but I first wanted to ask if
there is anything I could within MET to facilitate a speedier
calculation or at least allow the STAT jobs to run a bit faster.
>
> Thank you,
>
> Andrew
>

------------------------------------------------