[Met_help] [rt.rap.ucar.edu #41321] History for Error during runs of point_stat

RAL HelpDesk {for John Halley Gotway} met_help at ucar.edu
Thu Nov 11 14:28:53 MST 2010


----------------------------------------------------------------
  Initial Request
----------------------------------------------------------------

Hello again John,

I've come across the following error that shows up when doing multiple point_stat runs simultaneously:

humboldt[95]$ cat  STDIN.e187708
GSL_RNG_TYPE=mt19937
GSL_RNG_SEED=1635888204


ERROR: compute_cnt_stats_ci_perc() -> can't delete the temporary file:
tmp/tmp_28973_cnt_r.txt

These problems only affect a small percentage of dozens of simultaenous runs. A snapshot of my temp directory typically shows only 1-3 tmp files, some empty some not.

I'm just wondering if you have any idea what the problem might be. Could it be that more than one run uses the same tmp file or could it be a system problem on my end? The end result is empty point_stat output files.

Thanks.

John 

----------------------------------------------------------------
  Complete Ticket History
----------------------------------------------------------------

Subject: Re: [rt.rap.ucar.edu #41321] Error during runs of point_stat
From: John Halley Gotway
Time: Wed Oct 06 11:11:19 2010

John,

I went ahead and made a ticket for this in our met_help system.

This is interesting.  It's the first time we've heard of this issue.

When Point-Stat runs and computes bootstrap confidence intervals, it
writes temporary files that include the process id (PID) in their
names.  If you have multiple instances of Point-Stat running
concurrently, they should be writing temp files with different names
since their PID's should be different.  One possibility is that two
instances of Point-Stat are using the same temp file name which
could cause the error you're seeing.  When Point-Stat can't remove the
temp file it created, it prints an error message and exits - that's
why you're getting zero output.

I talked this over with another software developer and here's our
ideas:
   - In the short run, if you turn off bootstrapping in Point-Stat
(set "n_boot_rep = 0;" in the config file), you should no longer
encounter this error.
   - In the long run, we can work on adding more code to make the temp
file names *more* unique, and we can print out more diagnostic
information about why it was unable to remove the temp file.

Thanks,
John

RAL HelpDesk {for John Halley Gotway} wrote:
> Wed Oct 06 11:01:26 2010: Request 41321 was acted upon.
> Transaction: Ticket created by johnhg
>        Queue: met_help
>      Subject: Error during runs of point_stat
>        Owner: johnhg
>   Requestors: jhenders at aer.com
>       Status: new
>  Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321 >
>
>
> Hello again John,
>
> I've come across the following error that shows up when doing
multiple point_stat runs simultaneously:
>
> humboldt[95]$ cat  STDIN.e187708
> GSL_RNG_TYPE=mt19937
> GSL_RNG_SEED=1635888204
>
>
> ERROR: compute_cnt_stats_ci_perc() -> can't delete the temporary
file:
> tmp/tmp_28973_cnt_r.txt
>
> These problems only affect a small percentage of dozens of
simultaenous runs. A snapshot of my temp directory typically shows
only 1-3 tmp files, some empty some not.
>
> I'm just wondering if you have any idea what the problem might be.
Could it be that more than one run uses the same tmp file or could it
be a system problem on my end? The end result is empty point_stat
output files.
>
> Thanks.
>
> John

------------------------------------------------
Subject: Re: [rt.rap.ucar.edu #41321] Error during runs of point_stat
From: jhenders at aer.com
Time: Wed Oct 06 11:18:46 2010

  John,

Thanks for your quick response. Is the generation of the PIDs
system-dependent? It's disconcerting to know that there is even the
possibility of multiple jobs using the same PID... Is there any
possibility that submitting multiple jobs simultaneously using scripts
to a queuing system could trample on each other...meaning a workaround
would be simply to wait a second or so to submit? Perhaps there could
be
a bug in my submission scripts? Again, though, the problem affects
only
a very small fraction of jobs.

Thanks.

John

On 10/6/10 1:11 PM, RAL HelpDesk {for John Halley Gotway} wrote:
> John,
>
> I went ahead and made a ticket for this in our met_help system.
>
> This is interesting.  It's the first time we've heard of this issue.
>
> When Point-Stat runs and computes bootstrap confidence intervals, it
writes temporary files that include the process id (PID) in their
names.  If you have multiple instances of Point-Stat running
> concurrently, they should be writing temp files with different names
since their PID's should be different.  One possibility is that two
instances of Point-Stat are using the same temp file name which
> could cause the error you're seeing.  When Point-Stat can't remove
the temp file it created, it prints an error message and exits -
that's why you're getting zero output.
>
> I talked this over with another software developer and here's our
ideas:
>     - In the short run, if you turn off bootstrapping in Point-Stat
(set "n_boot_rep = 0;" in the config file), you should no longer
encounter this error.
>     - In the long run, we can work on adding more code to make the
temp file names *more* unique, and we can print out more diagnostic
information about why it was unable to remove the temp file.
>
> Thanks,
> John
>
> RAL HelpDesk {for John Halley Gotway} wrote:
>> Wed Oct 06 11:01:26 2010: Request 41321 was acted upon.
>> Transaction: Ticket created by johnhg
>>         Queue: met_help
>>       Subject: Error during runs of point_stat
>>         Owner: johnhg
>>    Requestors: jhenders at aer.com
>>        Status: new
>>   Ticket<URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>
>>
>> Hello again John,
>>
>> I've come across the following error that shows up when doing
multiple point_stat runs simultaneously:
>>
>> humboldt[95]$ cat  STDIN.e187708
>> GSL_RNG_TYPE=mt19937
>> GSL_RNG_SEED=1635888204
>>
>>
>> ERROR: compute_cnt_stats_ci_perc() ->  can't delete the temporary
file:
>> tmp/tmp_28973_cnt_r.txt
>>
>> These problems only affect a small percentage of dozens of
simultaenous runs. A snapshot of my temp directory typically shows
only 1-3 tmp files, some empty some not.
>>
>> I'm just wondering if you have any idea what the problem might be.
Could it be that more than one run uses the same tmp file or could it
be a system problem on my end? The end result is empty point_stat
output files.
>>
>> Thanks.
>>
>> John

------------------------------------------------
Subject: Error during runs of point_stat
From: Deidre Brucker
Time: Wed Oct 06 11:24:58 2010

Hi John,
What's this ticket in reference to?

Thanks!
deidre

------------------------------------------------
Subject: Re: [rt.rap.ucar.edu #41321] Error during runs of point_stat
From: jhenders at aer.com
Time: Wed Oct 06 13:37:22 2010

  In the meantime, I'll disable running of the boostrapping CI method.

Thanks for the advice.

John

On 10/6/10 1:18 PM, John Henderson wrote:
>  John,
>
> Thanks for your quick response. Is the generation of the PIDs
> system-dependent? It's disconcerting to know that there is even the
> possibility of multiple jobs using the same PID... Is there any
> possibility that submitting multiple jobs simultaneously using
scripts
> to a queuing system could trample on each other...meaning a
workaround
> would be simply to wait a second or so to submit? Perhaps there
could
> be a bug in my submission scripts? Again, though, the problem
affects
> only a very small fraction of jobs.
>
> Thanks.
>
> John
>
> On 10/6/10 1:11 PM, RAL HelpDesk {for John Halley Gotway} wrote:
>> John,
>>
>> I went ahead and made a ticket for this in our met_help system.
>>
>> This is interesting.  It's the first time we've heard of this
issue.
>>
>> When Point-Stat runs and computes bootstrap confidence intervals,
it
>> writes temporary files that include the process id (PID) in their
>> names.  If you have multiple instances of Point-Stat running
>> concurrently, they should be writing temp files with different
names
>> since their PID's should be different.  One possibility is that two
>> instances of Point-Stat are using the same temp file name which
>> could cause the error you're seeing.  When Point-Stat can't remove
>> the temp file it created, it prints an error message and exits -
>> that's why you're getting zero output.
>>
>> I talked this over with another software developer and here's our
ideas:
>>     - In the short run, if you turn off bootstrapping in Point-Stat
>> (set "n_boot_rep = 0;" in the config file), you should no longer
>> encounter this error.
>>     - In the long run, we can work on adding more code to make the
>> temp file names *more* unique, and we can print out more diagnostic
>> information about why it was unable to remove the temp file.
>>
>> Thanks,
>> John
>>
>> RAL HelpDesk {for John Halley Gotway} wrote:
>>> Wed Oct 06 11:01:26 2010: Request 41321 was acted upon.
>>> Transaction: Ticket created by johnhg
>>>         Queue: met_help
>>>       Subject: Error during runs of point_stat
>>>         Owner: johnhg
>>>    Requestors: jhenders at aer.com
>>>        Status: new
>>>   Ticket<URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>
>>>
>>> Hello again John,
>>>
>>> I've come across the following error that shows up when doing
>>> multiple point_stat runs simultaneously:
>>>
>>> humboldt[95]$ cat  STDIN.e187708
>>> GSL_RNG_TYPE=mt19937
>>> GSL_RNG_SEED=1635888204
>>>
>>>
>>> ERROR: compute_cnt_stats_ci_perc() ->  can't delete the temporary
file:
>>> tmp/tmp_28973_cnt_r.txt
>>>
>>> These problems only affect a small percentage of dozens of
>>> simultaenous runs. A snapshot of my temp directory typically shows
>>> only 1-3 tmp files, some empty some not.
>>>
>>> I'm just wondering if you have any idea what the problem might be.
>>> Could it be that more than one run uses the same tmp file or could
>>> it be a system problem on my end? The end result is empty
point_stat
>>> output files.
>>>
>>> Thanks.
>>>
>>> John

------------------------------------------------
Subject: Error during runs of point_stat
From: John Halley Gotway
Time: Wed Oct 06 15:19:45 2010

John,

I modified the temp file naming convention a bit to avoid ever writing
to the same file name.

Could you please try running with the attached version of
"METv3.0/lib/vx_met_util/compute_ci.cc", turn bootstrapping back on,
and let me know if you see any more of these types of errors after
some
reasonable amount of time running?

If you're able to run long enough to be confident that the problem is
fixed, we can post the changes as a bugfix.

But I'll wait to hear back from you first.

Thanks,
John

RAL HelpDesk {for jhenders at aer.com} wrote:
> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321 >
>
>   In the meantime, I'll disable running of the boostrapping CI
method.
>
> Thanks for the advice.
>
> John
>
> On 10/6/10 1:18 PM, John Henderson wrote:
>>  John,
>>
>> Thanks for your quick response. Is the generation of the PIDs
>> system-dependent? It's disconcerting to know that there is even the
>> possibility of multiple jobs using the same PID... Is there any
>> possibility that submitting multiple jobs simultaneously using
scripts
>> to a queuing system could trample on each other...meaning a
workaround
>> would be simply to wait a second or so to submit? Perhaps there
could
>> be a bug in my submission scripts? Again, though, the problem
affects
>> only a very small fraction of jobs.
>>
>> Thanks.
>>
>> John
>>
>> On 10/6/10 1:11 PM, RAL HelpDesk {for John Halley Gotway} wrote:
>>> John,
>>>
>>> I went ahead and made a ticket for this in our met_help system.
>>>
>>> This is interesting.  It's the first time we've heard of this
issue.
>>>
>>> When Point-Stat runs and computes bootstrap confidence intervals,
it
>>> writes temporary files that include the process id (PID) in their
>>> names.  If you have multiple instances of Point-Stat running
>>> concurrently, they should be writing temp files with different
names
>>> since their PID's should be different.  One possibility is that
two
>>> instances of Point-Stat are using the same temp file name which
>>> could cause the error you're seeing.  When Point-Stat can't remove
>>> the temp file it created, it prints an error message and exits -
>>> that's why you're getting zero output.
>>>
>>> I talked this over with another software developer and here's our
ideas:
>>>     - In the short run, if you turn off bootstrapping in Point-
Stat
>>> (set "n_boot_rep = 0;" in the config file), you should no longer
>>> encounter this error.
>>>     - In the long run, we can work on adding more code to make the
>>> temp file names *more* unique, and we can print out more
diagnostic
>>> information about why it was unable to remove the temp file.
>>>
>>> Thanks,
>>> John
>>>
>>> RAL HelpDesk {for John Halley Gotway} wrote:
>>>> Wed Oct 06 11:01:26 2010: Request 41321 was acted upon.
>>>> Transaction: Ticket created by johnhg
>>>>         Queue: met_help
>>>>       Subject: Error during runs of point_stat
>>>>         Owner: johnhg
>>>>    Requestors: jhenders at aer.com
>>>>        Status: new
>>>>   Ticket<URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>>
>>>>
>>>> Hello again John,
>>>>
>>>> I've come across the following error that shows up when doing
>>>> multiple point_stat runs simultaneously:
>>>>
>>>> humboldt[95]$ cat  STDIN.e187708
>>>> GSL_RNG_TYPE=mt19937
>>>> GSL_RNG_SEED=1635888204
>>>>
>>>>
>>>> ERROR: compute_cnt_stats_ci_perc() ->  can't delete the temporary
file:
>>>> tmp/tmp_28973_cnt_r.txt
>>>>
>>>> These problems only affect a small percentage of dozens of
>>>> simultaenous runs. A snapshot of my temp directory typically
shows
>>>> only 1-3 tmp files, some empty some not.
>>>>
>>>> I'm just wondering if you have any idea what the problem might
be.
>>>> Could it be that more than one run uses the same tmp file or
could
>>>> it be a system problem on my end? The end result is empty
point_stat
>>>> output files.
>>>>
>>>> Thanks.
>>>>
>>>> John

------------------------------------------------
Subject: Re: [rt.rap.ucar.edu #41321] Error during runs of point_stat
From: jhenders at aer.com
Time: Wed Oct 06 15:37:00 2010

  John,

I will try your new code, however, the wait time on our in-house
cluster
is significant right now, so it will be days before I'll be able to
report back. I will try ASAP, though.

Thanks.

John

On 10/6/10 5:19 PM, RAL HelpDesk {for John Halley Gotway} wrote:
> John,
>
> I modified the temp file naming convention a bit to avoid ever
writing to the same file name.
>
> Could you please try running with the attached version of
"METv3.0/lib/vx_met_util/compute_ci.cc", turn bootstrapping back on,
and let me know if you see any more of these types of errors after
some
> reasonable amount of time running?
>
> If you're able to run long enough to be confident that the problem
is fixed, we can post the changes as a bugfix.
>
> But I'll wait to hear back from you first.
>
> Thanks,
> John
>
> RAL HelpDesk {for jhenders at aer.com} wrote:
>> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>
>>    In the meantime, I'll disable running of the boostrapping CI
method.
>>
>> Thanks for the advice.
>>
>> John
>>
>> On 10/6/10 1:18 PM, John Henderson wrote:
>>>   John,
>>>
>>> Thanks for your quick response. Is the generation of the PIDs
>>> system-dependent? It's disconcerting to know that there is even
the
>>> possibility of multiple jobs using the same PID... Is there any
>>> possibility that submitting multiple jobs simultaneously using
scripts
>>> to a queuing system could trample on each other...meaning a
workaround
>>> would be simply to wait a second or so to submit? Perhaps there
could
>>> be a bug in my submission scripts? Again, though, the problem
affects
>>> only a very small fraction of jobs.
>>>
>>> Thanks.
>>>
>>> John
>>>
>>> On 10/6/10 1:11 PM, RAL HelpDesk {for John Halley Gotway} wrote:
>>>> John,
>>>>
>>>> I went ahead and made a ticket for this in our met_help system.
>>>>
>>>> This is interesting.  It's the first time we've heard of this
issue.
>>>>
>>>> When Point-Stat runs and computes bootstrap confidence intervals,
it
>>>> writes temporary files that include the process id (PID) in their
>>>> names.  If you have multiple instances of Point-Stat running
>>>> concurrently, they should be writing temp files with different
names
>>>> since their PID's should be different.  One possibility is that
two
>>>> instances of Point-Stat are using the same temp file name which
>>>> could cause the error you're seeing.  When Point-Stat can't
remove
>>>> the temp file it created, it prints an error message and exits -
>>>> that's why you're getting zero output.
>>>>
>>>> I talked this over with another software developer and here's our
ideas:
>>>>      - In the short run, if you turn off bootstrapping in Point-
Stat
>>>> (set "n_boot_rep = 0;" in the config file), you should no longer
>>>> encounter this error.
>>>>      - In the long run, we can work on adding more code to make
the
>>>> temp file names *more* unique, and we can print out more
diagnostic
>>>> information about why it was unable to remove the temp file.
>>>>
>>>> Thanks,
>>>> John
>>>>
>>>> RAL HelpDesk {for John Halley Gotway} wrote:
>>>>> Wed Oct 06 11:01:26 2010: Request 41321 was acted upon.
>>>>> Transaction: Ticket created by johnhg
>>>>>          Queue: met_help
>>>>>        Subject: Error during runs of point_stat
>>>>>          Owner: johnhg
>>>>>     Requestors: jhenders at aer.com
>>>>>         Status: new
>>>>>    Ticket<URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>>>
>>>>>
>>>>> Hello again John,
>>>>>
>>>>> I've come across the following error that shows up when doing
>>>>> multiple point_stat runs simultaneously:
>>>>>
>>>>> humboldt[95]$ cat  STDIN.e187708
>>>>> GSL_RNG_TYPE=mt19937
>>>>> GSL_RNG_SEED=1635888204
>>>>>
>>>>>
>>>>> ERROR: compute_cnt_stats_ci_perc() ->   can't delete the
temporary file:
>>>>> tmp/tmp_28973_cnt_r.txt
>>>>>
>>>>> These problems only affect a small percentage of dozens of
>>>>> simultaenous runs. A snapshot of my temp directory typically
shows
>>>>> only 1-3 tmp files, some empty some not.
>>>>>
>>>>> I'm just wondering if you have any idea what the problem might
be.
>>>>> Could it be that more than one run uses the same tmp file or
could
>>>>> it be a system problem on my end? The end result is empty
point_stat
>>>>> output files.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> John

------------------------------------------------
Subject: Re: [rt.rap.ucar.edu #41321] Error during runs of point_stat
From: John Halley Gotway
Time: Wed Oct 06 15:45:43 2010

John,

OK, that's fine.  Whenever is fine.  I'll just sit on the changes
until I hear back from you.

John

RAL HelpDesk {for jhenders at aer.com} wrote:
> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321 >
>
>   John,
>
> I will try your new code, however, the wait time on our in-house
cluster
> is significant right now, so it will be days before I'll be able to
> report back. I will try ASAP, though.
>
> Thanks.
>
> John
>
> On 10/6/10 5:19 PM, RAL HelpDesk {for John Halley Gotway} wrote:
>> John,
>>
>> I modified the temp file naming convention a bit to avoid ever
writing to the same file name.
>>
>> Could you please try running with the attached version of
"METv3.0/lib/vx_met_util/compute_ci.cc", turn bootstrapping back on,
and let me know if you see any more of these types of errors after
some
>> reasonable amount of time running?
>>
>> If you're able to run long enough to be confident that the problem
is fixed, we can post the changes as a bugfix.
>>
>> But I'll wait to hear back from you first.
>>
>> Thanks,
>> John
>>
>> RAL HelpDesk {for jhenders at aer.com} wrote:
>>> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>
>>>    In the meantime, I'll disable running of the boostrapping CI
method.
>>>
>>> Thanks for the advice.
>>>
>>> John
>>>
>>> On 10/6/10 1:18 PM, John Henderson wrote:
>>>>   John,
>>>>
>>>> Thanks for your quick response. Is the generation of the PIDs
>>>> system-dependent? It's disconcerting to know that there is even
the
>>>> possibility of multiple jobs using the same PID... Is there any
>>>> possibility that submitting multiple jobs simultaneously using
scripts
>>>> to a queuing system could trample on each other...meaning a
workaround
>>>> would be simply to wait a second or so to submit? Perhaps there
could
>>>> be a bug in my submission scripts? Again, though, the problem
affects
>>>> only a very small fraction of jobs.
>>>>
>>>> Thanks.
>>>>
>>>> John
>>>>
>>>> On 10/6/10 1:11 PM, RAL HelpDesk {for John Halley Gotway} wrote:
>>>>> John,
>>>>>
>>>>> I went ahead and made a ticket for this in our met_help system.
>>>>>
>>>>> This is interesting.  It's the first time we've heard of this
issue.
>>>>>
>>>>> When Point-Stat runs and computes bootstrap confidence
intervals, it
>>>>> writes temporary files that include the process id (PID) in
their
>>>>> names.  If you have multiple instances of Point-Stat running
>>>>> concurrently, they should be writing temp files with different
names
>>>>> since their PID's should be different.  One possibility is that
two
>>>>> instances of Point-Stat are using the same temp file name which
>>>>> could cause the error you're seeing.  When Point-Stat can't
remove
>>>>> the temp file it created, it prints an error message and exits -
>>>>> that's why you're getting zero output.
>>>>>
>>>>> I talked this over with another software developer and here's
our ideas:
>>>>>      - In the short run, if you turn off bootstrapping in Point-
Stat
>>>>> (set "n_boot_rep = 0;" in the config file), you should no longer
>>>>> encounter this error.
>>>>>      - In the long run, we can work on adding more code to make
the
>>>>> temp file names *more* unique, and we can print out more
diagnostic
>>>>> information about why it was unable to remove the temp file.
>>>>>
>>>>> Thanks,
>>>>> John
>>>>>
>>>>> RAL HelpDesk {for John Halley Gotway} wrote:
>>>>>> Wed Oct 06 11:01:26 2010: Request 41321 was acted upon.
>>>>>> Transaction: Ticket created by johnhg
>>>>>>          Queue: met_help
>>>>>>        Subject: Error during runs of point_stat
>>>>>>          Owner: johnhg
>>>>>>     Requestors: jhenders at aer.com
>>>>>>         Status: new
>>>>>>    Ticket<URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>>>>
>>>>>>
>>>>>> Hello again John,
>>>>>>
>>>>>> I've come across the following error that shows up when doing
>>>>>> multiple point_stat runs simultaneously:
>>>>>>
>>>>>> humboldt[95]$ cat  STDIN.e187708
>>>>>> GSL_RNG_TYPE=mt19937
>>>>>> GSL_RNG_SEED=1635888204
>>>>>>
>>>>>>
>>>>>> ERROR: compute_cnt_stats_ci_perc() ->   can't delete the
temporary file:
>>>>>> tmp/tmp_28973_cnt_r.txt
>>>>>>
>>>>>> These problems only affect a small percentage of dozens of
>>>>>> simultaenous runs. A snapshot of my temp directory typically
shows
>>>>>> only 1-3 tmp files, some empty some not.
>>>>>>
>>>>>> I'm just wondering if you have any idea what the problem might
be.
>>>>>> Could it be that more than one run uses the same tmp file or
could
>>>>>> it be a system problem on my end? The end result is empty
point_stat
>>>>>> output files.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> John

------------------------------------------------
Subject: Re: [rt.rap.ucar.edu #41321] Error during runs of point_stat
From: jhenders at aer.com
Time: Sun Oct 17 13:58:41 2010

  Hello again John,

I apologize for not getting back to you regarding the new code you
have
provided. With the old code, however, I seem to getting similar errors
when running stat_analysis:

ERROR: do_job() -> can't open the temporary file "tmp/tmp_17462.stat"
for reading!
ERROR: clean_up() -> can't remove temporary file "tmp/tmp_17462.stat"

n_boot_rep is zero for these runs, so is there perhaps another
situation
whereby I could be trampling on myself?

Thanks.

John


On 10/6/10 5:45 PM, RAL HelpDesk {for John Halley Gotway} wrote:
> John,
>
> OK, that's fine.  Whenever is fine.  I'll just sit on the changes
until I hear back from you.
>
> John
>
> RAL HelpDesk {for jhenders at aer.com} wrote:
>> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>
>>    John,
>>
>> I will try your new code, however, the wait time on our in-house
cluster
>> is significant right now, so it will be days before I'll be able to
>> report back. I will try ASAP, though.
>>
>> Thanks.
>>
>> John
>>
>> On 10/6/10 5:19 PM, RAL HelpDesk {for John Halley Gotway} wrote:
>>> John,
>>>
>>> I modified the temp file naming convention a bit to avoid ever
writing to the same file name.
>>>
>>> Could you please try running with the attached version of
"METv3.0/lib/vx_met_util/compute_ci.cc", turn bootstrapping back on,
and let me know if you see any more of these types of errors after
some
>>> reasonable amount of time running?
>>>
>>> If you're able to run long enough to be confident that the problem
is fixed, we can post the changes as a bugfix.
>>>
>>> But I'll wait to hear back from you first.
>>>
>>> Thanks,
>>> John
>>>
>>> RAL HelpDesk {for jhenders at aer.com} wrote:
>>>> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>>
>>>>     In the meantime, I'll disable running of the boostrapping CI
method.
>>>>
>>>> Thanks for the advice.
>>>>
>>>> John
>>>>
>>>> On 10/6/10 1:18 PM, John Henderson wrote:
>>>>>    John,
>>>>>
>>>>> Thanks for your quick response. Is the generation of the PIDs
>>>>> system-dependent? It's disconcerting to know that there is even
the
>>>>> possibility of multiple jobs using the same PID... Is there any
>>>>> possibility that submitting multiple jobs simultaneously using
scripts
>>>>> to a queuing system could trample on each other...meaning a
workaround
>>>>> would be simply to wait a second or so to submit? Perhaps there
could
>>>>> be a bug in my submission scripts? Again, though, the problem
affects
>>>>> only a very small fraction of jobs.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> John
>>>>>
>>>>> On 10/6/10 1:11 PM, RAL HelpDesk {for John Halley Gotway} wrote:
>>>>>> John,
>>>>>>
>>>>>> I went ahead and made a ticket for this in our met_help system.
>>>>>>
>>>>>> This is interesting.  It's the first time we've heard of this
issue.
>>>>>>
>>>>>> When Point-Stat runs and computes bootstrap confidence
intervals, it
>>>>>> writes temporary files that include the process id (PID) in
their
>>>>>> names.  If you have multiple instances of Point-Stat running
>>>>>> concurrently, they should be writing temp files with different
names
>>>>>> since their PID's should be different.  One possibility is that
two
>>>>>> instances of Point-Stat are using the same temp file name which
>>>>>> could cause the error you're seeing.  When Point-Stat can't
remove
>>>>>> the temp file it created, it prints an error message and exits
-
>>>>>> that's why you're getting zero output.
>>>>>>
>>>>>> I talked this over with another software developer and here's
our ideas:
>>>>>>       - In the short run, if you turn off bootstrapping in
Point-Stat
>>>>>> (set "n_boot_rep = 0;" in the config file), you should no
longer
>>>>>> encounter this error.
>>>>>>       - In the long run, we can work on adding more code to
make the
>>>>>> temp file names *more* unique, and we can print out more
diagnostic
>>>>>> information about why it was unable to remove the temp file.
>>>>>>
>>>>>> Thanks,
>>>>>> John
>>>>>>
>>>>>> RAL HelpDesk {for John Halley Gotway} wrote:
>>>>>>> Wed Oct 06 11:01:26 2010: Request 41321 was acted upon.
>>>>>>> Transaction: Ticket created by johnhg
>>>>>>>           Queue: met_help
>>>>>>>         Subject: Error during runs of point_stat
>>>>>>>           Owner: johnhg
>>>>>>>      Requestors: jhenders at aer.com
>>>>>>>          Status: new
>>>>>>>     Ticket<URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>>>>>
>>>>>>>
>>>>>>> Hello again John,
>>>>>>>
>>>>>>> I've come across the following error that shows up when doing
>>>>>>> multiple point_stat runs simultaneously:
>>>>>>>
>>>>>>> humboldt[95]$ cat  STDIN.e187708
>>>>>>> GSL_RNG_TYPE=mt19937
>>>>>>> GSL_RNG_SEED=1635888204
>>>>>>>
>>>>>>>
>>>>>>> ERROR: compute_cnt_stats_ci_perc() ->    can't delete the
temporary file:
>>>>>>> tmp/tmp_28973_cnt_r.txt
>>>>>>>
>>>>>>> These problems only affect a small percentage of dozens of
>>>>>>> simultaenous runs. A snapshot of my temp directory typically
shows
>>>>>>> only 1-3 tmp files, some empty some not.
>>>>>>>
>>>>>>> I'm just wondering if you have any idea what the problem might
be.
>>>>>>> Could it be that more than one run uses the same tmp file or
could
>>>>>>> it be a system problem on my end? The end result is empty
point_stat
>>>>>>> output files.
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> John

------------------------------------------------
Subject: Re: [rt.rap.ucar.edu #41321] Error during runs of point_stat
From: John Halley Gotway
Time: Mon Oct 18 09:29:42 2010

John,

Ah yes, STAT-Analysis is another place where MET creates temp files.
And it looks like we do it in the PB2NC tool as well.

So MET is using temp files in 3 spots:
(1) confidence interval library code
(2) STAT-Analysis
(3) PB2NC

I'd like to come up with a set of changes that will fix the naming
conventions in all three places.

Hopefully, I can put together a set of fixes as a tar file and then
have you test them out.  I'll let you know when I have something
ready.

Thanks,
John


On 10/17/2010 01:58 PM, RAL HelpDesk {for jhenders at aer.com} wrote:
>
> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321 >
>
>   Hello again John,
>
> I apologize for not getting back to you regarding the new code you
have
> provided. With the old code, however, I seem to getting similar
errors
> when running stat_analysis:
>
> ERROR: do_job() -> can't open the temporary file
"tmp/tmp_17462.stat"
> for reading!
> ERROR: clean_up() -> can't remove temporary file
"tmp/tmp_17462.stat"
>
> n_boot_rep is zero for these runs, so is there perhaps another
situation
> whereby I could be trampling on myself?
>
> Thanks.
>
> John
>
>
> On 10/6/10 5:45 PM, RAL HelpDesk {for John Halley Gotway} wrote:
>> John,
>>
>> OK, that's fine.  Whenever is fine.  I'll just sit on the changes
until I hear back from you.
>>
>> John
>>
>> RAL HelpDesk {for jhenders at aer.com} wrote:
>>> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>
>>>    John,
>>>
>>> I will try your new code, however, the wait time on our in-house
cluster
>>> is significant right now, so it will be days before I'll be able
to
>>> report back. I will try ASAP, though.
>>>
>>> Thanks.
>>>
>>> John
>>>
>>> On 10/6/10 5:19 PM, RAL HelpDesk {for John Halley Gotway} wrote:
>>>> John,
>>>>
>>>> I modified the temp file naming convention a bit to avoid ever
writing to the same file name.
>>>>
>>>> Could you please try running with the attached version of
"METv3.0/lib/vx_met_util/compute_ci.cc", turn bootstrapping back on,
and let me know if you see any more of these types of errors after
some
>>>> reasonable amount of time running?
>>>>
>>>> If you're able to run long enough to be confident that the
problem is fixed, we can post the changes as a bugfix.
>>>>
>>>> But I'll wait to hear back from you first.
>>>>
>>>> Thanks,
>>>> John
>>>>
>>>> RAL HelpDesk {for jhenders at aer.com} wrote:
>>>>> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>>>
>>>>>     In the meantime, I'll disable running of the boostrapping CI
method.
>>>>>
>>>>> Thanks for the advice.
>>>>>
>>>>> John
>>>>>
>>>>> On 10/6/10 1:18 PM, John Henderson wrote:
>>>>>>    John,
>>>>>>
>>>>>> Thanks for your quick response. Is the generation of the PIDs
>>>>>> system-dependent? It's disconcerting to know that there is even
the
>>>>>> possibility of multiple jobs using the same PID... Is there any
>>>>>> possibility that submitting multiple jobs simultaneously using
scripts
>>>>>> to a queuing system could trample on each other...meaning a
workaround
>>>>>> would be simply to wait a second or so to submit? Perhaps there
could
>>>>>> be a bug in my submission scripts? Again, though, the problem
affects
>>>>>> only a very small fraction of jobs.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> John
>>>>>>
>>>>>> On 10/6/10 1:11 PM, RAL HelpDesk {for John Halley Gotway}
wrote:
>>>>>>> John,
>>>>>>>
>>>>>>> I went ahead and made a ticket for this in our met_help
system.
>>>>>>>
>>>>>>> This is interesting.  It's the first time we've heard of this
issue.
>>>>>>>
>>>>>>> When Point-Stat runs and computes bootstrap confidence
intervals, it
>>>>>>> writes temporary files that include the process id (PID) in
their
>>>>>>> names.  If you have multiple instances of Point-Stat running
>>>>>>> concurrently, they should be writing temp files with different
names
>>>>>>> since their PID's should be different.  One possibility is
that two
>>>>>>> instances of Point-Stat are using the same temp file name
which
>>>>>>> could cause the error you're seeing.  When Point-Stat can't
remove
>>>>>>> the temp file it created, it prints an error message and exits
-
>>>>>>> that's why you're getting zero output.
>>>>>>>
>>>>>>> I talked this over with another software developer and here's
our ideas:
>>>>>>>       - In the short run, if you turn off bootstrapping in
Point-Stat
>>>>>>> (set "n_boot_rep = 0;" in the config file), you should no
longer
>>>>>>> encounter this error.
>>>>>>>       - In the long run, we can work on adding more code to
make the
>>>>>>> temp file names *more* unique, and we can print out more
diagnostic
>>>>>>> information about why it was unable to remove the temp file.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> John
>>>>>>>
>>>>>>> RAL HelpDesk {for John Halley Gotway} wrote:
>>>>>>>> Wed Oct 06 11:01:26 2010: Request 41321 was acted upon.
>>>>>>>> Transaction: Ticket created by johnhg
>>>>>>>>           Queue: met_help
>>>>>>>>         Subject: Error during runs of point_stat
>>>>>>>>           Owner: johnhg
>>>>>>>>      Requestors: jhenders at aer.com
>>>>>>>>          Status: new
>>>>>>>>     Ticket<URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hello again John,
>>>>>>>>
>>>>>>>> I've come across the following error that shows up when doing
>>>>>>>> multiple point_stat runs simultaneously:
>>>>>>>>
>>>>>>>> humboldt[95]$ cat  STDIN.e187708
>>>>>>>> GSL_RNG_TYPE=mt19937
>>>>>>>> GSL_RNG_SEED=1635888204
>>>>>>>>
>>>>>>>>
>>>>>>>> ERROR: compute_cnt_stats_ci_perc() ->    can't delete the
temporary file:
>>>>>>>> tmp/tmp_28973_cnt_r.txt
>>>>>>>>
>>>>>>>> These problems only affect a small percentage of dozens of
>>>>>>>> simultaenous runs. A snapshot of my temp directory typically
shows
>>>>>>>> only 1-3 tmp files, some empty some not.
>>>>>>>>
>>>>>>>> I'm just wondering if you have any idea what the problem
might be.
>>>>>>>> Could it be that more than one run uses the same tmp file or
could
>>>>>>>> it be a system problem on my end? The end result is empty
point_stat
>>>>>>>> output files.
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> John

------------------------------------------------
Subject: Re: [rt.rap.ucar.edu #41321] Error during runs of point_stat
From: jhenders at aer.com
Time: Mon Oct 18 09:48:14 2010

  John,

Thanks for you prompt attention. I've had to stop running
stat_analysis
until this is fixed.

I will test your fixes right away.

Thanks again.

John

On 10/18/10 11:29 AM, RAL HelpDesk {for John Halley Gotway} wrote:
> John,
>
> Ah yes, STAT-Analysis is another place where MET creates temp files.
And it looks like we do it in the PB2NC tool as well.
>
> So MET is using temp files in 3 spots:
> (1) confidence interval library code
> (2) STAT-Analysis
> (3) PB2NC
>
> I'd like to come up with a set of changes that will fix the naming
conventions in all three places.
>
> Hopefully, I can put together a set of fixes as a tar file and then
have you test them out.  I'll let you know when I have something
ready.
>
> Thanks,
> John
>
>
> On 10/17/2010 01:58 PM, RAL HelpDesk {for jhenders at aer.com} wrote:
>> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>
>>    Hello again John,
>>
>> I apologize for not getting back to you regarding the new code you
have
>> provided. With the old code, however, I seem to getting similar
errors
>> when running stat_analysis:
>>
>> ERROR: do_job() ->  can't open the temporary file
"tmp/tmp_17462.stat"
>> for reading!
>> ERROR: clean_up() ->  can't remove temporary file
"tmp/tmp_17462.stat"
>>
>> n_boot_rep is zero for these runs, so is there perhaps another
situation
>> whereby I could be trampling on myself?
>>
>> Thanks.
>>
>> John
>>
>>
>> On 10/6/10 5:45 PM, RAL HelpDesk {for John Halley Gotway} wrote:
>>> John,
>>>
>>> OK, that's fine.  Whenever is fine.  I'll just sit on the changes
until I hear back from you.
>>>
>>> John
>>>
>>> RAL HelpDesk {for jhenders at aer.com} wrote:
>>>> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>>
>>>>     John,
>>>>
>>>> I will try your new code, however, the wait time on our in-house
cluster
>>>> is significant right now, so it will be days before I'll be able
to
>>>> report back. I will try ASAP, though.
>>>>
>>>> Thanks.
>>>>
>>>> John
>>>>
>>>> On 10/6/10 5:19 PM, RAL HelpDesk {for John Halley Gotway} wrote:
>>>>> John,
>>>>>
>>>>> I modified the temp file naming convention a bit to avoid ever
writing to the same file name.
>>>>>
>>>>> Could you please try running with the attached version of
"METv3.0/lib/vx_met_util/compute_ci.cc", turn bootstrapping back on,
and let me know if you see any more of these types of errors after
some
>>>>> reasonable amount of time running?
>>>>>
>>>>> If you're able to run long enough to be confident that the
problem is fixed, we can post the changes as a bugfix.
>>>>>
>>>>> But I'll wait to hear back from you first.
>>>>>
>>>>> Thanks,
>>>>> John
>>>>>
>>>>> RAL HelpDesk {for jhenders at aer.com} wrote:
>>>>>> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>>>>
>>>>>>      In the meantime, I'll disable running of the boostrapping
CI method.
>>>>>>
>>>>>> Thanks for the advice.
>>>>>>
>>>>>> John
>>>>>>
>>>>>> On 10/6/10 1:18 PM, John Henderson wrote:
>>>>>>>     John,
>>>>>>>
>>>>>>> Thanks for your quick response. Is the generation of the PIDs
>>>>>>> system-dependent? It's disconcerting to know that there is
even the
>>>>>>> possibility of multiple jobs using the same PID... Is there
any
>>>>>>> possibility that submitting multiple jobs simultaneously using
scripts
>>>>>>> to a queuing system could trample on each other...meaning a
workaround
>>>>>>> would be simply to wait a second or so to submit? Perhaps
there could
>>>>>>> be a bug in my submission scripts? Again, though, the problem
affects
>>>>>>> only a very small fraction of jobs.
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> John
>>>>>>>
>>>>>>> On 10/6/10 1:11 PM, RAL HelpDesk {for John Halley Gotway}
wrote:
>>>>>>>> John,
>>>>>>>>
>>>>>>>> I went ahead and made a ticket for this in our met_help
system.
>>>>>>>>
>>>>>>>> This is interesting.  It's the first time we've heard of this
issue.
>>>>>>>>
>>>>>>>> When Point-Stat runs and computes bootstrap confidence
intervals, it
>>>>>>>> writes temporary files that include the process id (PID) in
their
>>>>>>>> names.  If you have multiple instances of Point-Stat running
>>>>>>>> concurrently, they should be writing temp files with
different names
>>>>>>>> since their PID's should be different.  One possibility is
that two
>>>>>>>> instances of Point-Stat are using the same temp file name
which
>>>>>>>> could cause the error you're seeing.  When Point-Stat can't
remove
>>>>>>>> the temp file it created, it prints an error message and
exits -
>>>>>>>> that's why you're getting zero output.
>>>>>>>>
>>>>>>>> I talked this over with another software developer and here's
our ideas:
>>>>>>>>        - In the short run, if you turn off bootstrapping in
Point-Stat
>>>>>>>> (set "n_boot_rep = 0;" in the config file), you should no
longer
>>>>>>>> encounter this error.
>>>>>>>>        - In the long run, we can work on adding more code to
make the
>>>>>>>> temp file names *more* unique, and we can print out more
diagnostic
>>>>>>>> information about why it was unable to remove the temp file.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> John
>>>>>>>>
>>>>>>>> RAL HelpDesk {for John Halley Gotway} wrote:
>>>>>>>>> Wed Oct 06 11:01:26 2010: Request 41321 was acted upon.
>>>>>>>>> Transaction: Ticket created by johnhg
>>>>>>>>>            Queue: met_help
>>>>>>>>>          Subject: Error during runs of point_stat
>>>>>>>>>            Owner: johnhg
>>>>>>>>>       Requestors: jhenders at aer.com
>>>>>>>>>           Status: new
>>>>>>>>>      Ticket<URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hello again John,
>>>>>>>>>
>>>>>>>>> I've come across the following error that shows up when
doing
>>>>>>>>> multiple point_stat runs simultaneously:
>>>>>>>>>
>>>>>>>>> humboldt[95]$ cat  STDIN.e187708
>>>>>>>>> GSL_RNG_TYPE=mt19937
>>>>>>>>> GSL_RNG_SEED=1635888204
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ERROR: compute_cnt_stats_ci_perc() ->     can't delete the
temporary file:
>>>>>>>>> tmp/tmp_28973_cnt_r.txt
>>>>>>>>>
>>>>>>>>> These problems only affect a small percentage of dozens of
>>>>>>>>> simultaenous runs. A snapshot of my temp directory typically
shows
>>>>>>>>> only 1-3 tmp files, some empty some not.
>>>>>>>>>
>>>>>>>>> I'm just wondering if you have any idea what the problem
might be.
>>>>>>>>> Could it be that more than one run uses the same tmp file or
could
>>>>>>>>> it be a system problem on my end? The end result is empty
point_stat
>>>>>>>>> output files.
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> John

------------------------------------------------
Subject: Error during runs of point_stat
From: John Halley Gotway
Time: Mon Oct 18 11:39:39 2010

John,

Please try using the attached patches:
(1) Copy this file into the top-level METv3.0 directory.
(2) Unzip the file:
   gunzip temp_file_patches.tar.gz
(3) Untar the file:
   tar -xvf temp_file_patches.tar
(4) Rebuild MET:
   make clean
   make

The fix is a new library routine that builds a temp file name to be
used.  It basically uses the process id in the temp file name (as we
were doing before), but checks to see if that file already
exists.  If it does, it appends an "_1" at the end, and checks to see
if that file exists.  If so, it tries "_2", then "_3", and so on,
until it finds a temp file name to use that doesn't already exist.

Hopefully that logic will solve the problem you're having.

Please let me know how it goes.  Once we determine that this fixes
your problems, we'll add it to the development version of the code and
post it as a bug fix.

Thanks,
John

On 10/18/2010 09:48 AM, RAL HelpDesk {for jhenders at aer.com} wrote:
>
> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321 >
>
>   John,
>
> Thanks for you prompt attention. I've had to stop running
stat_analysis
> until this is fixed.
>
> I will test your fixes right away.
>
> Thanks again.
>
> John
>
> On 10/18/10 11:29 AM, RAL HelpDesk {for John Halley Gotway} wrote:
>> John,
>>
>> Ah yes, STAT-Analysis is another place where MET creates temp
files.  And it looks like we do it in the PB2NC tool as well.
>>
>> So MET is using temp files in 3 spots:
>> (1) confidence interval library code
>> (2) STAT-Analysis
>> (3) PB2NC
>>
>> I'd like to come up with a set of changes that will fix the naming
conventions in all three places.
>>
>> Hopefully, I can put together a set of fixes as a tar file and then
have you test them out.  I'll let you know when I have something
ready.
>>
>> Thanks,
>> John
>>
>>
>> On 10/17/2010 01:58 PM, RAL HelpDesk {for jhenders at aer.com} wrote:
>>> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>
>>>    Hello again John,
>>>
>>> I apologize for not getting back to you regarding the new code you
have
>>> provided. With the old code, however, I seem to getting similar
errors
>>> when running stat_analysis:
>>>
>>> ERROR: do_job() ->  can't open the temporary file
"tmp/tmp_17462.stat"
>>> for reading!
>>> ERROR: clean_up() ->  can't remove temporary file
"tmp/tmp_17462.stat"
>>>
>>> n_boot_rep is zero for these runs, so is there perhaps another
situation
>>> whereby I could be trampling on myself?
>>>
>>> Thanks.
>>>
>>> John
>>>
>>>
>>> On 10/6/10 5:45 PM, RAL HelpDesk {for John Halley Gotway} wrote:
>>>> John,
>>>>
>>>> OK, that's fine.  Whenever is fine.  I'll just sit on the changes
until I hear back from you.
>>>>
>>>> John
>>>>
>>>> RAL HelpDesk {for jhenders at aer.com} wrote:
>>>>> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>>>
>>>>>     John,
>>>>>
>>>>> I will try your new code, however, the wait time on our in-house
cluster
>>>>> is significant right now, so it will be days before I'll be able
to
>>>>> report back. I will try ASAP, though.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> John
>>>>>
>>>>> On 10/6/10 5:19 PM, RAL HelpDesk {for John Halley Gotway} wrote:
>>>>>> John,
>>>>>>
>>>>>> I modified the temp file naming convention a bit to avoid ever
writing to the same file name.
>>>>>>
>>>>>> Could you please try running with the attached version of
"METv3.0/lib/vx_met_util/compute_ci.cc", turn bootstrapping back on,
and let me know if you see any more of these types of errors after
some
>>>>>> reasonable amount of time running?
>>>>>>
>>>>>> If you're able to run long enough to be confident that the
problem is fixed, we can post the changes as a bugfix.
>>>>>>
>>>>>> But I'll wait to hear back from you first.
>>>>>>
>>>>>> Thanks,
>>>>>> John
>>>>>>
>>>>>> RAL HelpDesk {for jhenders at aer.com} wrote:
>>>>>>> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>>>>>
>>>>>>>      In the meantime, I'll disable running of the boostrapping
CI method.
>>>>>>>
>>>>>>> Thanks for the advice.
>>>>>>>
>>>>>>> John
>>>>>>>
>>>>>>> On 10/6/10 1:18 PM, John Henderson wrote:
>>>>>>>>     John,
>>>>>>>>
>>>>>>>> Thanks for your quick response. Is the generation of the PIDs
>>>>>>>> system-dependent? It's disconcerting to know that there is
even the
>>>>>>>> possibility of multiple jobs using the same PID... Is there
any
>>>>>>>> possibility that submitting multiple jobs simultaneously
using scripts
>>>>>>>> to a queuing system could trample on each other...meaning a
workaround
>>>>>>>> would be simply to wait a second or so to submit? Perhaps
there could
>>>>>>>> be a bug in my submission scripts? Again, though, the problem
affects
>>>>>>>> only a very small fraction of jobs.
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> John
>>>>>>>>
>>>>>>>> On 10/6/10 1:11 PM, RAL HelpDesk {for John Halley Gotway}
wrote:
>>>>>>>>> John,
>>>>>>>>>
>>>>>>>>> I went ahead and made a ticket for this in our met_help
system.
>>>>>>>>>
>>>>>>>>> This is interesting.  It's the first time we've heard of
this issue.
>>>>>>>>>
>>>>>>>>> When Point-Stat runs and computes bootstrap confidence
intervals, it
>>>>>>>>> writes temporary files that include the process id (PID) in
their
>>>>>>>>> names.  If you have multiple instances of Point-Stat running
>>>>>>>>> concurrently, they should be writing temp files with
different names
>>>>>>>>> since their PID's should be different.  One possibility is
that two
>>>>>>>>> instances of Point-Stat are using the same temp file name
which
>>>>>>>>> could cause the error you're seeing.  When Point-Stat can't
remove
>>>>>>>>> the temp file it created, it prints an error message and
exits -
>>>>>>>>> that's why you're getting zero output.
>>>>>>>>>
>>>>>>>>> I talked this over with another software developer and
here's our ideas:
>>>>>>>>>        - In the short run, if you turn off bootstrapping in
Point-Stat
>>>>>>>>> (set "n_boot_rep = 0;" in the config file), you should no
longer
>>>>>>>>> encounter this error.
>>>>>>>>>        - In the long run, we can work on adding more code to
make the
>>>>>>>>> temp file names *more* unique, and we can print out more
diagnostic
>>>>>>>>> information about why it was unable to remove the temp file.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> John
>>>>>>>>>
>>>>>>>>> RAL HelpDesk {for John Halley Gotway} wrote:
>>>>>>>>>> Wed Oct 06 11:01:26 2010: Request 41321 was acted upon.
>>>>>>>>>> Transaction: Ticket created by johnhg
>>>>>>>>>>            Queue: met_help
>>>>>>>>>>          Subject: Error during runs of point_stat
>>>>>>>>>>            Owner: johnhg
>>>>>>>>>>       Requestors: jhenders at aer.com
>>>>>>>>>>           Status: new
>>>>>>>>>>      Ticket<URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hello again John,
>>>>>>>>>>
>>>>>>>>>> I've come across the following error that shows up when
doing
>>>>>>>>>> multiple point_stat runs simultaneously:
>>>>>>>>>>
>>>>>>>>>> humboldt[95]$ cat  STDIN.e187708
>>>>>>>>>> GSL_RNG_TYPE=mt19937
>>>>>>>>>> GSL_RNG_SEED=1635888204
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ERROR: compute_cnt_stats_ci_perc() ->     can't delete the
temporary file:
>>>>>>>>>> tmp/tmp_28973_cnt_r.txt
>>>>>>>>>>
>>>>>>>>>> These problems only affect a small percentage of dozens of
>>>>>>>>>> simultaenous runs. A snapshot of my temp directory
typically shows
>>>>>>>>>> only 1-3 tmp files, some empty some not.
>>>>>>>>>>
>>>>>>>>>> I'm just wondering if you have any idea what the problem
might be.
>>>>>>>>>> Could it be that more than one run uses the same tmp file
or could
>>>>>>>>>> it be a system problem on my end? The end result is empty
point_stat
>>>>>>>>>> output files.
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>> John

------------------------------------------------
Subject: Re: [rt.rap.ucar.edu #41321] Error during runs of point_stat
From: jhenders at aer.com
Time: Mon Oct 18 12:06:24 2010

  Thanks John. I haven't upgraded to the v3 code - partly because I
want
the continuity of using the same executables for an ongoing project.
Will the attached patches work for v2?

John

On 10/18/10 1:39 PM, RAL HelpDesk {for John Halley Gotway} wrote:
> John,
>
> Please try using the attached patches:
> (1) Copy this file into the top-level METv3.0 directory.
> (2) Unzip the file:
>     gunzip temp_file_patches.tar.gz
> (3) Untar the file:
>     tar -xvf temp_file_patches.tar
> (4) Rebuild MET:
>     make clean
>     make
>
> The fix is a new library routine that builds a temp file name to be
used.  It basically uses the process id in the temp file name (as we
were doing before), but checks to see if that file already
> exists.  If it does, it appends an "_1" at the end, and checks to
see if that file exists.  If so, it tries "_2", then "_3", and so on,
until it finds a temp file name to use that doesn't already exist.
>
> Hopefully that logic will solve the problem you're having.
>
> Please let me know how it goes.  Once we determine that this fixes
your problems, we'll add it to the development version of the code and
post it as a bug fix.
>
> Thanks,
> John
>
> On 10/18/2010 09:48 AM, RAL HelpDesk {for jhenders at aer.com} wrote:
>> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>
>>    John,
>>
>> Thanks for you prompt attention. I've had to stop running
stat_analysis
>> until this is fixed.
>>
>> I will test your fixes right away.
>>
>> Thanks again.
>>
>> John
>>
>> On 10/18/10 11:29 AM, RAL HelpDesk {for John Halley Gotway} wrote:
>>> John,
>>>
>>> Ah yes, STAT-Analysis is another place where MET creates temp
files.  And it looks like we do it in the PB2NC tool as well.
>>>
>>> So MET is using temp files in 3 spots:
>>> (1) confidence interval library code
>>> (2) STAT-Analysis
>>> (3) PB2NC
>>>
>>> I'd like to come up with a set of changes that will fix the naming
conventions in all three places.
>>>
>>> Hopefully, I can put together a set of fixes as a tar file and
then have you test them out.  I'll let you know when I have something
ready.
>>>
>>> Thanks,
>>> John
>>>
>>>
>>> On 10/17/2010 01:58 PM, RAL HelpDesk {for jhenders at aer.com} wrote:
>>>> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>>
>>>>     Hello again John,
>>>>
>>>> I apologize for not getting back to you regarding the new code
you have
>>>> provided. With the old code, however, I seem to getting similar
errors
>>>> when running stat_analysis:
>>>>
>>>> ERROR: do_job() ->   can't open the temporary file
"tmp/tmp_17462.stat"
>>>> for reading!
>>>> ERROR: clean_up() ->   can't remove temporary file
"tmp/tmp_17462.stat"
>>>>
>>>> n_boot_rep is zero for these runs, so is there perhaps another
situation
>>>> whereby I could be trampling on myself?
>>>>
>>>> Thanks.
>>>>
>>>> John
>>>>
>>>>
>>>> On 10/6/10 5:45 PM, RAL HelpDesk {for John Halley Gotway} wrote:
>>>>> John,
>>>>>
>>>>> OK, that's fine.  Whenever is fine.  I'll just sit on the
changes until I hear back from you.
>>>>>
>>>>> John
>>>>>
>>>>> RAL HelpDesk {for jhenders at aer.com} wrote:
>>>>>> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>>>>
>>>>>>      John,
>>>>>>
>>>>>> I will try your new code, however, the wait time on our in-
house cluster
>>>>>> is significant right now, so it will be days before I'll be
able to
>>>>>> report back. I will try ASAP, though.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> John
>>>>>>
>>>>>> On 10/6/10 5:19 PM, RAL HelpDesk {for John Halley Gotway}
wrote:
>>>>>>> John,
>>>>>>>
>>>>>>> I modified the temp file naming convention a bit to avoid ever
writing to the same file name.
>>>>>>>
>>>>>>> Could you please try running with the attached version of
"METv3.0/lib/vx_met_util/compute_ci.cc", turn bootstrapping back on,
and let me know if you see any more of these types of errors after
some
>>>>>>> reasonable amount of time running?
>>>>>>>
>>>>>>> If you're able to run long enough to be confident that the
problem is fixed, we can post the changes as a bugfix.
>>>>>>>
>>>>>>> But I'll wait to hear back from you first.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> John
>>>>>>>
>>>>>>> RAL HelpDesk {for jhenders at aer.com} wrote:
>>>>>>>> <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>>>>>>
>>>>>>>>       In the meantime, I'll disable running of the
boostrapping CI method.
>>>>>>>>
>>>>>>>> Thanks for the advice.
>>>>>>>>
>>>>>>>> John
>>>>>>>>
>>>>>>>> On 10/6/10 1:18 PM, John Henderson wrote:
>>>>>>>>>      John,
>>>>>>>>>
>>>>>>>>> Thanks for your quick response. Is the generation of the
PIDs
>>>>>>>>> system-dependent? It's disconcerting to know that there is
even the
>>>>>>>>> possibility of multiple jobs using the same PID... Is there
any
>>>>>>>>> possibility that submitting multiple jobs simultaneously
using scripts
>>>>>>>>> to a queuing system could trample on each other...meaning a
workaround
>>>>>>>>> would be simply to wait a second or so to submit? Perhaps
there could
>>>>>>>>> be a bug in my submission scripts? Again, though, the
problem affects
>>>>>>>>> only a very small fraction of jobs.
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> John
>>>>>>>>>
>>>>>>>>> On 10/6/10 1:11 PM, RAL HelpDesk {for John Halley Gotway}
wrote:
>>>>>>>>>> John,
>>>>>>>>>>
>>>>>>>>>> I went ahead and made a ticket for this in our met_help
system.
>>>>>>>>>>
>>>>>>>>>> This is interesting.  It's the first time we've heard of
this issue.
>>>>>>>>>>
>>>>>>>>>> When Point-Stat runs and computes bootstrap confidence
intervals, it
>>>>>>>>>> writes temporary files that include the process id (PID) in
their
>>>>>>>>>> names.  If you have multiple instances of Point-Stat
running
>>>>>>>>>> concurrently, they should be writing temp files with
different names
>>>>>>>>>> since their PID's should be different.  One possibility is
that two
>>>>>>>>>> instances of Point-Stat are using the same temp file name
which
>>>>>>>>>> could cause the error you're seeing.  When Point-Stat can't
remove
>>>>>>>>>> the temp file it created, it prints an error message and
exits -
>>>>>>>>>> that's why you're getting zero output.
>>>>>>>>>>
>>>>>>>>>> I talked this over with another software developer and
here's our ideas:
>>>>>>>>>>         - In the short run, if you turn off bootstrapping
in Point-Stat
>>>>>>>>>> (set "n_boot_rep = 0;" in the config file), you should no
longer
>>>>>>>>>> encounter this error.
>>>>>>>>>>         - In the long run, we can work on adding more code
to make the
>>>>>>>>>> temp file names *more* unique, and we can print out more
diagnostic
>>>>>>>>>> information about why it was unable to remove the temp
file.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> John
>>>>>>>>>>
>>>>>>>>>> RAL HelpDesk {for John Halley Gotway} wrote:
>>>>>>>>>>> Wed Oct 06 11:01:26 2010: Request 41321 was acted upon.
>>>>>>>>>>> Transaction: Ticket created by johnhg
>>>>>>>>>>>             Queue: met_help
>>>>>>>>>>>           Subject: Error during runs of point_stat
>>>>>>>>>>>             Owner: johnhg
>>>>>>>>>>>        Requestors: jhenders at aer.com
>>>>>>>>>>>            Status: new
>>>>>>>>>>>       Ticket<URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hello again John,
>>>>>>>>>>>
>>>>>>>>>>> I've come across the following error that shows up when
doing
>>>>>>>>>>> multiple point_stat runs simultaneously:
>>>>>>>>>>>
>>>>>>>>>>> humboldt[95]$ cat  STDIN.e187708
>>>>>>>>>>> GSL_RNG_TYPE=mt19937
>>>>>>>>>>> GSL_RNG_SEED=1635888204
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ERROR: compute_cnt_stats_ci_perc() ->      can't delete
the temporary file:
>>>>>>>>>>> tmp/tmp_28973_cnt_r.txt
>>>>>>>>>>>
>>>>>>>>>>> These problems only affect a small percentage of dozens of
>>>>>>>>>>> simultaenous runs. A snapshot of my temp directory
typically shows
>>>>>>>>>>> only 1-3 tmp files, some empty some not.
>>>>>>>>>>>
>>>>>>>>>>> I'm just wondering if you have any idea what the problem
might be.
>>>>>>>>>>> Could it be that more than one run uses the same tmp file
or could
>>>>>>>>>>> it be a system problem on my end? The end result is empty
point_stat
>>>>>>>>>>> output files.
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>>
>>>>>>>>>>> John

------------------------------------------------
Subject: Error during runs of point_stat
From: John Halley Gotway
Time: Mon Oct 18 14:17:51 2010

John,

No, those changes would not work for METv2.0.  There are differences
in these files between METv2.0 and METv3.0.

Turns out the first set of patches I sent you was incomplete anyway.
I had forgotten to include 2 new files.

Please use the attached patches.  There's one version for METv2.0 and
another version for METv3.0.  I understand why you'd like to continue
using METv2.0 for your ongoing project.  Changing from
METv2.0 to METv3.0 would also require you to update the configuration
files you're using since they've changed some.

However, from my point of view, I'd really like to have the fixes for
METv3.0 tested out to make sure they resolve the issue.  If you do get
a chance to try out the METv3.0 patches, please let me know.

Thanks,
John


On 10/18/2010 12:06 PM, RAL HelpDesk {for jhenders at aer.com} wrote:
> graded to the v3 code - partly because I want
> the continuity of using the same executables for an ongoing project.
> Will the attached patches work for v2?

------------------------------------------------
Subject: Re: [rt.rap.ucar.edu #41321] Error during runs of point_stat
From: jhenders at aer.com
Time: Thu Nov 04 11:31:06 2010

Hello again John,

My apologies for taking so long to respond. I can confirm that your
fixes for v2 seem to work just fine. I have had absolutely no empty
files with the related error messages. Unfortunately it won't be
possible to test v3  since we must remain locked into use of v2 given
upcoming deadlines, but I suspect that the changes for v3 are quite
similar in structure.

I have an unrelated question for you. I'm anticipating receiving a
request for how exactly my statistics from MET are computed. The
concept
of doing a spatial average at one time level then computing bias,
RMSE,
MAE, etc. is fairly straightforward. However, the final statistics
from
MET that I will provide from stat_anal, of course, are based on a time
series (really an aggregation). So, my final RMSE values, for example,
will be an estimation of the spread of a number of mean errors that
represent spatial averages. You and your colleagues must have had a
good
basis for computing things this way and I was hoping to simply pass
your
explanation (and/or references) along to whomever asks me. I suspect
that the people who will ask me would simply aggregate individual
obs/forecast pairs from multiple times for a model domain then compute
statistics once.

Thanks and please let me know if I haven't either been clear or have
some of the procedure stated incorrectly.

John


On 10/18/10 4:17 PM, RAL HelpDesk {for John Halley Gotway} wrote:
> John,
>
> No, those changes would not work for METv2.0.  There are differences
in these files between METv2.0 and METv3.0.
>
> Turns out the first set of patches I sent you was incomplete anyway.
I had forgotten to include 2 new files.
>
> Please use the attached patches.  There's one version for METv2.0
and another version for METv3.0.  I understand why you'd like to
continue using METv2.0 for your ongoing project.  Changing from
> METv2.0 to METv3.0 would also require you to update the
configuration files you're using since they've changed some.
>
> However, from my point of view, I'd really like to have the fixes
for METv3.0 tested out to make sure they resolve the issue.  If you do
get a chance to try out the METv3.0 patches, please let me know.
>
> Thanks,
> John
>
>
> On 10/18/2010 12:06 PM, RAL HelpDesk {for jhenders at aer.com} wrote:
>> graded to the v3 code - partly because I want
>> the continuity of using the same executables for an ongoing
project.
>> Will the attached patches work for v2?

------------------------------------------------
Subject: Re: [rt.rap.ucar.edu #41321] Error during runs of point_stat
From: John Halley Gotway
Time: Fri Nov 05 12:11:27 2010

John,

Thanks for following up.  I'll go ahead and commit that logic for the
temp file naming conventions to our development repository.  So we'll
include it in future versions of MET.

Regarding your other question...

In order to compute an aggregated RMSE value through time, STAT-
Analysis first aggregates together the scalar partial sums lines
(SL1L2) and recomputes the RMSE from the aggregated partial sums.  We
inherited this method from NCEP.  Each SL1L2 line contains columns for
the number of matched pairs, the mean fcst value, the mean obs value,
the mean fcst*obs value, the mean of the squared fcst
value, and the mean of the squared obs value.   Many, but not all,
continuous statistics can be expressed in terms of these values.  RMSE
is one that can.  Take a look in
"METv2.0/lib/vx_met_util/met_stat.cc" for the routine:
   void compute_cntinfo(const SL1L2Info &s, int aflag, CNTInfo
&cnt_info)

When multiple SL1L2 lines are aggregated together, MET aggregates
those means as a weighted mean where the weights are proportional to
the "TOTAL" count in each line.  This functionality can be found
in the same in the routine:
   SL1L2Info & SL1L2Info::operator+=(const SL1L2Info &c)

Hopefully that helps clarify.  I'm not sure of any references that
would explicitly list the equations for those derivations.  But if
that's necessary, I'd refer you to our statistician and MET
project manager, Tressa Fowler (tressa at ucar.edu).

Hope that helps clarify.

John

On 11/04/2010 11:31 AM, RAL HelpDesk {for jhenders at aer.com} wrote:
>
> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=41321 >
>
> Hello again John,
>
> My apologies for taking so long to respond. I can confirm that your
> fixes for v2 seem to work just fine. I have had absolutely no empty
> files with the related error messages. Unfortunately it won't be
> possible to test v3  since we must remain locked into use of v2
given
> upcoming deadlines, but I suspect that the changes for v3 are quite
> similar in structure.
>
> I have an unrelated question for you. I'm anticipating receiving a
> request for how exactly my statistics from MET are computed. The
concept
> of doing a spatial average at one time level then computing bias,
RMSE,
> MAE, etc. is fairly straightforward. However, the final statistics
from
> MET that I will provide from stat_anal, of course, are based on a
time
> series (really an aggregation). So, my final RMSE values, for
example,
> will be an estimation of the spread of a number of mean errors that
> represent spatial averages. You and your colleagues must have had a
good
> basis for computing things this way and I was hoping to simply pass
your
> explanation (and/or references) along to whomever asks me. I suspect
> that the people who will ask me would simply aggregate individual
> obs/forecast pairs from multiple times for a model domain then
compute
> statistics once.
>
> Thanks and please let me know if I haven't either been clear or have
> some of the procedure stated incorrectly.
>
> John
>
>
> On 10/18/10 4:17 PM, RAL HelpDesk {for John Halley Gotway} wrote:
>> John,
>>
>> No, those changes would not work for METv2.0.  There are
differences in these files between METv2.0 and METv3.0.
>>
>> Turns out the first set of patches I sent you was incomplete
anyway.  I had forgotten to include 2 new files.
>>
>> Please use the attached patches.  There's one version for METv2.0
and another version for METv3.0.  I understand why you'd like to
continue using METv2.0 for your ongoing project.  Changing from
>> METv2.0 to METv3.0 would also require you to update the
configuration files you're using since they've changed some.
>>
>> However, from my point of view, I'd really like to have the fixes
for METv3.0 tested out to make sure they resolve the issue.  If you do
get a chance to try out the METv3.0 patches, please let me know.
>>
>> Thanks,
>> John
>>
>>
>> On 10/18/2010 12:06 PM, RAL HelpDesk {for jhenders at aer.com} wrote:
>>> graded to the v3 code - partly because I want
>>> the continuity of using the same executables for an ongoing
project.
>>> Will the attached patches work for v2?

------------------------------------------------


More information about the Met_help mailing list