[Met_help] stat_analysis question on Pearson CC

John Halley Gotway johnhg at rap.ucar.edu
Thu Apr 30 15:07:39 MDT 2009


Great.  Thanks for letting me know.

You're welcome to use MET however you like, but I think you've experienced
the downside of dumping the matched pair lines.  Not only does it take up
a lot of space, but it slows Stat-Analysis to a crawl!

If you'd like, you can easily disable the MPR output by modifying the
"output_flag" in the Point-Stat config file.


> John,
> I created a temporary directory with only 1 file in it that excluded all
> MPR lines from the 80 forecasts I'm summarizing.  This whittled down the
> total *.stat data from 5.5GB to a single file at about 100MB.
> Stat_analysis is now running considerably faster.
> Thanks for the help,
> Jon
>> -----Original Message-----
>> From: John Halley Gotway [mailto:johnhg at rap.ucar.edu]
>> Sent: Thursday, April 30, 2009 11:14 AM
>> To: Case, Jonathan (MSFC-VP61)[Other]
>> Subject: Re: stat_analysis question on Pearson CC
>> Jonathan,
>> By the arguments in the job you sent, it looks like you're using
>> METv1.1.  Is that correct, or are you using METv2.0 now?
>> In METv2.0, the job would be something like:
>> -job aggregate_stat -line_type SL1L2 -out_line_type CNT -fcst_lead 03 -
>> fcst_var TMP -obtype ADPSFC
>> I really don't think that the Pearson CC is slowing things down.  It's
>> a pretty quick direct computation.  It's the Spearman's and Kendall's
>> Tau that are slow, but when aggregating SL1L2 data and
>> computing CNT info, we can't compute Spearman's or Kendall's Tau
>> anyway.  There's no way of turning off Pearson since it's a quick
>> computation.
>> I'm very surprised at how slow Stat-Analysis is running for you!  My
>> guess would be that it's spending a lot of time searching directory
>> trees for STAT files, doing I/O, and parsing the STAT data.  Do
>> you have any sense of how many STAT lines it's processing?  Are you by
>> any chance dumping out the matched pair lines to your Point-Stat
>> output?  If you are, that would explode the size of the STAT
>> files and likely make Stat-Analysis run much slower since it has to
>> parse each of those lines.
>> If you're dumping MPR lines, you could try a test of excluding those
>> MPR lines from the STAT files by first running "egrep -v MPR *.stat >
>> tmp.stat".  Try running STAT-Analysis on the data with and
>> without the MPR lines to see if that makes a difference.
>> Also, if you're parsing a large amount of data, I'd suggest setting up
>> jobs for STAT-Analysis using the config file instead of doing it on the
>> command line.  In the config file, use the filtering
>> options in the beginning to filter down to the subset of data over
>> which you'll perform your jobs.  That reduces the amount data that
>> Stat-Analysis has to parse for each job, and would therefore speed
>> it up.
>> For example, suppose you have 1,000 input STAT lines, and you want to
>> run SL1L2->CNT jobs for several variables.  Set "line_type[] =
>> ["SL1L2"];" to filter out only the SL1L2 lines, of which you might
>> have 100 of them.  Then define several SL1L2->CNT jobs for your
>> variables (perhaps PRES, TMP, RH).  For each job, Stat-Analysis will
>> only have to parse those 100 SL1L2 STAT lines, rather than all 1000
>> of them.  Since you can only run 1 job on the command line at a time,
>> Stat-Analysis has to parse all 1000 lines for each job.
>> Hope that helps.
>> Thanks,
>> John
>> Case, Jonathan (MSFC-VP61)[Other] wrote:
>> > John,
>> >
>> > I'm trying to run stat_analysis to summarize point_stat output over
>> 80 forecasts (x 2 experiments), 27 forecast hours, and several
>> variables.   There are approximately 400 surface stations fcst/obs
>> pairs at each time.  An example  job I'm running is:
>> >
>> >
>> > Stat_analysis is running very slow, taking ~25 minutes just to
>> process a single variable at a single forecast hour.   I suspect that
>> the calculation of the Pearson correlation coefficient is causing the
>> slow performance.   However, I can't seem to turn off the calculation
>> of the Pearson CC.   I tried passing "-rank_corr_flag 0" to the command
>> line in my script, and I also tried passing the STAT_Analysis config
>> file by specifying "-config ./STATAnalysisConfig_pointstat" after the
>> job name.  But, the Pearson CC output continues to show up in the
>> output files.  So, what do I need to do to turn off the CC calculations
>> in order to speed up stat_analysis performance?
>> >
>> > Thanks for your time,
>> > Jonathan
>> >
>> > ***********************************************************
>> > Jonathan Case, ENSCO, Inc.
>> > Aerospace Sciences & Engineering Division
>> > Short-term Prediction Research and Transition Center
>> > 320 Sparkman Drive, Room 3062
>> > Huntsville, AL 35805-1912
>> > Voice: (256) 961-7504   Fax: (256) 961-7788
>> > Emails: Jonathan.Case-1 at nasa.gov
>> >              case.jonathan at ensco.com
>> > ***********************************************************
>> >
>> >

More information about the Met_help mailing list