[Met_help] [rt.rap.ucar.edu #65100] History for Re: Optimization

Tue Feb 18 12:03:37 MST 2014

----------------------------------------------------------------
  Initial Request
----------------------------------------------------------------

Hello met_help,

About a month ago, I sent you the question below to you about optimizing stat_aggregate, and you gave several solutions which really sped up my processing.  I was wondering if you also had a few pointers to speed up point_stat with the same conditions as I mentioned in the previous question.  

Note:  I am currently outputting matched pairs for plotting and quality control purposes, and I would like to keep outputting matched pairs if it is not too costly.
To be honest, considering the amount of data and regions that I am using, point_stat is running pretty quickly.  I just wanted to check if there were any quick optimization tricks solutions to allow it run even more quickly.

Thank you,

Andrew

---------------------------
Original Question
---------------------------

Hello, I have an 
optimization question, and perhaps there is no better solution than the 
one that I am currently using, but I thought I would ask....

I am doing daily evaluations for 7 parameters at time resolution of 3h up to 24 hrs.  I have also broken my domain into 40 subdomains.  Within my point-stat config file, I have defined these subregions as such:

   sid = ["1.txt","2.txt","3.txt...... "40.txt"]

where
 each text file has a space-separated list of stations at which matched 
pairs are to be found.  Each region has on average, I guess around 35 
stations...

So far, everything works 
correctly.  There approximately 1400 observations in my domain, which 
means 7 params * 9 time points * 1400 stations, which
 means ~88000 stat lines.  With this many STAT lines, every 
stat_aggregate job takes approximately 2 seconds.  

I
 want to aggregate statistics for each time point, parameter, and region
 separately.  That means, 7 params * 9 time points * 40 regions = 2520 
aggregate jobs.  Multiplying each job by 2 seconds, you can see that 
this would about an hour and a half.  Remember this is only for one 
single day (24 hours).  And I havent even mentioned yet that I would 
like to do this for several different models.  Considering the current 
computational time, I wanted to look into optimization before I 
developed my script any further.  I have looked into forking my script 
to split up the stat_agg jobs, but I first wanted to ask if there is 
anything I could within MET to facilitate a speedier calculation or at 
least allow the STAT jobs to run a bit faster.

Andrew J. <andrewwx at yahoo.com> schrieb am 16:42 Mittwoch, 13.November 2013:

Hello, I have an optimization question, and perhaps there is no better solution than the one that I am currently using, but I thought I would ask....

I am doing daily evaluations for 7 parameters at time resolution of 3h up to 24 hrs.  I have also broken my domain into 40 subdomains.  Within my point-stat config file, I have defined these subregions as such:

   sid = ["1.txt","2.txt","3.txt...... "40.txt"]

where each text file has a space-separated list of stations at which matched pairs are to be found.  Each region has on average, I guess around 35 stations...

So far, everything works correctly.  There approximately 1400 observations in my domain, which means 7 params * 9 time points * 1400 stations, which
 means ~88000 stat lines.  With this many STAT lines, every stat_aggregate job takes approximately 2 seconds.  

I want to aggregate statistics for each time point, parameter, and region separately.  That means, 7 params * 9 time points * 40 regions = 2520 aggregate jobs.  Multiplying each job by 2 seconds, you can see that this would about an hour and a half.  Remember this is only for one single day (24 hours).  And I havent even mentioned yet that I would like to do this for several different models.  Considering the current computational time, I wanted to look into optimization before I developed my script any further.  I have looked into forking my script to split up the stat_agg jobs, but I first wanted to ask if there is anything I could within MET to facilitate a speedier calculation or at least allow the STAT jobs to run a bit faster.

Thank you,

Andrew

----------------------------------------------------------------
  Complete Ticket History
----------------------------------------------------------------

Subject: Re: [rt.rap.ucar.edu #65100] Re: Optimization
From: John Halley Gotway
Time: Mon Feb 03 10:25:58 2014

Andrew,

I'm sorry, I believe I missed this question from last week.  There are
few things you can look at to speed up Point-Stat.

(1) Turn off bootstrapping.  In the "boot" section of the
PointStatConfig file, set "n_rep" to 0.  If you're interested in the
individual matched pairs, I'm guessing you won't be as interested in
bootstrap confidence intervals from any given run of Point-Stat.

(2) Turn off rank correlation statistics.  Near the end of the
PointStatConfig file, set "rank_corr_flag" to FALSE.  That will
disable the computation of Kendall's Tau and Spearman's rank
correlation
coefficients which are computed over the ranks of the matched pairs
rather than their actual values.  If you have a lot of matched pairs,
that ranking process can be slow.

(3) Another option is to reduce the number of point observations
you're passing to Point-Stat.  The fewer observations it has to
process, the faster it'll run.  For example, suppose you're verifying
with a global PREPBUFR dataset but your model domain is only over
Brazil.  If you're running PB2NC, you could only retain the
observations in Brazil and therefore pass Point-Stat fewer
observations.
Or if you're only verifying upper-air variables, you could throw out
all the surface observations in PB2NC.  Just give your verification
process some thought to see if there's an opportunity to
discard observations you never plan to use.

Lastly, we are aware of a slow down in the Point-Stat tool from
METv4.0 to METv4.1.  We have a development task defined to investigate
it and try to speed it back up.  But we haven't made any progress
on that yet.

Hope that helps.

Thanks,
John Halley Gotway
met_help at ucar.edu

On 01/24/2014 10:39 AM, Andrew J. via RT wrote:
>
> Fri Jan 24 10:39:04 2014: Request 65100 was acted upon.
> Transaction: Ticket created by andrewwx at yahoo.com
>         Queue: met_help
>       Subject: Re: Optimization
>         Owner: Nobody
>    Requestors: andrewwx at yahoo.com
>        Status: new
>   Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=65100 >
>
>
>
>
> Hello met_help,
>
> About a month ago, I sent you the question below to you about
optimizing stat_aggregate, and you gave several solutions which really
sped up my processing.  I was wondering if you also had a few pointers
to speed up point_stat with the same conditions as I mentioned in the
previous question.
>
>
> Note:  I am currently outputting matched pairs for plotting and
quality control purposes, and I would like to keep outputting matched
pairs if it is not too costly.
> To be honest, considering the amount of data and regions that I am
using, point_stat is running pretty quickly.  I just wanted to check
if there were any quick optimization tricks solutions to allow it run
even more quickly.
>
> Thank you,
>
> Andrew
>
>
>
>
> ---------------------------
> Original Question
> ---------------------------
>
>
> Hello, I have an
> optimization question, and perhaps there is no better solution than
the
> one that I am currently using, but I thought I would ask....
>
> I am doing daily evaluations for 7 parameters at time resolution of
3h up to 24 hrs.  I have also broken my domain into 40 subdomains.
Within my point-stat config file, I have defined these subregions as
such:
>
>     sid = ["1.txt","2.txt","3.txt...... "40.txt"]
>
> where
>   each text file has a space-separated list of stations at which
matched
> pairs are to be found.  Each region has on average, I guess around
35
> stations...
>
> So far, everything works
> correctly.  There approximately 1400 observations in my domain,
which
> means 7 params * 9 time points * 1400 stations, which
>   means ~88000 stat lines.  With this many STAT lines, every
> stat_aggregate job takes approximately 2 seconds.
>
> I
>   want to aggregate statistics for each time point, parameter, and
region
>   separately.  That means, 7 params * 9 time points * 40 regions =
2520
> aggregate jobs.  Multiplying each job by 2 seconds, you can see that
> this would about an hour and a half.  Remember this is only for one
> single day (24 hours).  And I havent even mentioned yet that I would
> like to do this for several different models.  Considering the
current
> computational time, I wanted to look into optimization before I
> developed my script any further.  I have looked into forking my
script
> to split up the stat_agg jobs, but I first wanted to ask if there is
> anything I could within MET to facilitate a speedier calculation or
at
> least allow the STAT jobs to run a bit faster.
>
>
>
> Andrew J. <andrewwx at yahoo.com> schrieb am 16:42 Mittwoch,
13.November 2013:
>
> Hello, I have an optimization question, and perhaps there is no
better solution than the one that I am currently using, but I thought
I would ask....
>
> I am doing daily evaluations for 7 parameters at time resolution of
3h up to 24 hrs.  I have also broken my domain into 40 subdomains.
Within my point-stat config file, I have defined these subregions as
such:
>
>
>     sid = ["1.txt","2.txt","3.txt...... "40.txt"]
>
> where each text file has a space-separated list of stations at which
matched pairs are to be found.  Each region has on average, I guess
around 35 stations...
>
> So far, everything works correctly.  There approximately 1400
observations in my domain, which means 7 params * 9 time points * 1400
stations, which
>   means ~88000 stat lines.  With this many STAT lines, every
stat_aggregate job takes approximately 2 seconds.
>
> I want to aggregate statistics for each time point, parameter, and
region separately.  That means, 7 params * 9 time points * 40 regions
= 2520 aggregate jobs.  Multiplying each job by 2 seconds, you can see
that this would about an hour and a half.  Remember this is only for
one single day (24 hours).  And I havent even mentioned yet that I
would like to do this for several different models.  Considering the
current computational time, I wanted to look into optimization before
I developed my script any further.  I have looked into forking my
script to split up the stat_agg jobs, but I first wanted to ask if
there is anything I could within MET to facilitate a speedier
calculation or at least allow the STAT jobs to run a bit faster.
>
>
> Thank you,
>
> Andrew
>

------------------------------------------------