[Met_help] [rt.rap.ucar.edu #62086] History for point_stat 0 rutime problem on slave nodes

Tue Aug 27 10:38:05 MDT 2013

----------------------------------------------------------------
  Initial Request
----------------------------------------------------------------

Hi,

We have a cluster with 16 slave nodes, and point_stat runs nicely on master node. However, it does not run on any of the slave nodes. The problem is that point_stat can start, and then just hangs (does not stop) there. Linux  "top" command shows that it is running, but simply hangs there.

The first several log file on screen show that the GRIB1 file has been opened, but it is trying to open NETCDF file.

Also, stat_analysis run nicely on slave nodes.   I know that point_stat uses a /tmp as its temporary working folder, is there any possible reasons that makes netcdf (obs files) unopenable?

Your suggestions is appreciated.

Thanks!

Fuquan

----------------------------------------------------------------
  Complete Ticket History
----------------------------------------------------------------

Subject: Re: [rt.rap.ucar.edu #62086] point_stat 0 rutime problem on slave nodes
From: Julie Prestopnik
Time: Fri Jul 05 15:34:09 2013

Hello.  My name is Julie Prestopnik, and I am part of the MET team. I
apologize for the delayed response.  NCAR was closed for the
Independence Day holiday yesterday.  Our lead developer, John, is on
vacation but will return on Monday.  I have limited experience with
the
code and am not able to answer your question at this time.  I will
consult with John on Monday and one of us will respond at that time.
Thank you for your patience in advance.

Regards,
Julie

On 07/04/2013 09:00 AM, Fuquan Yang via RT wrote:
>
> Thu Jul 04 09:00:27 2013: Request 62086 was acted upon.
> Transaction: Ticket created by fuquany at novusenv.com
>         Queue: met_help
>       Subject: point_stat 0 rutime problem on slave nodes
>         Owner: Nobody
>    Requestors: fuquany at novusenv.com
>        Status: new
>   Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=62086 >
>
>
> Hi,
>
> We have a cluster with 16 slave nodes, and point_stat runs nicely on
master node. However, it does not run on any of the slave nodes. The
problem is that point_stat can start, and then just hangs (does not
stop) there. Linux  "top" command shows that it is running, but simply
hangs there.
>
> The first several log file on screen show that the GRIB1 file has
been opened, but it is trying to open NETCDF file.
>
> Also, stat_analysis run nicely on slave nodes.   I know that
point_stat uses a /tmp as its temporary working folder, is there any
possible reasons that makes netcdf (obs files) unopenable?
>
> Your suggestions is appreciated.
>
> Thanks!
>
> Fuquan
>

--
Julie Prestopnik
National Center for Atmospheric Research
Research Applications Laboratory
Phone: 303.497.8399
Email: jpresto at ucar.edu

------------------------------------------------
Subject: Re: [rt.rap.ucar.edu #62086] point_stat 0 rutime problem on slave nodes
From: John Halley Gotway
Time: Mon Jul 08 14:51:06 2013

Fuquan,

As you've guessed, I also suspect it's an I/O problem.  Could you try
running point-stat using a high verbosity level (use -v 5 for
verbosity level 5) and write its output to a log file (-log option)?
  Then take a look at the log to see if you can zero in on where it's
hanging.  When you run top, how does the memory look?  Is point-stat
using up all the memory?

It's possible that it's a permissions problem or shared memory
problem.  But it's likely to be very system-specific.  You may find
that a sys admin could help you debug the specifics of this problem.

Thanks,
John Halley Gotway
met_help at ucar.edu

On 07/04/2013 09:00 AM, Fuquan Yang via RT wrote:
>
> Thu Jul 04 09:00:27 2013: Request 62086 was acted upon.
> Transaction: Ticket created by fuquany at novusenv.com
>         Queue: met_help
>       Subject: point_stat 0 rutime problem on slave nodes
>         Owner: Nobody
>    Requestors: fuquany at novusenv.com
>        Status: new
>   Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=62086 >
>
>
> Hi,
>
> We have a cluster with 16 slave nodes, and point_stat runs nicely on
master node. However, it does not run on any of the slave nodes. The
problem is that point_stat can start, and then just hangs (does not
stop) there. Linux  "top" command shows that it is running, but simply
hangs there.
>
> The first several log file on screen show that the GRIB1 file has
been opened, but it is trying to open NETCDF file.
>
> Also, stat_analysis run nicely on slave nodes.   I know that
point_stat uses a /tmp as its temporary working folder, is there any
possible reasons that makes netcdf (obs files) unopenable?
>
> Your suggestions is appreciated.
>
> Thanks!
>
> Fuquan
>

------------------------------------------------
Subject: RE: [rt.rap.ucar.edu #62086] point_stat 0 rutime problem on slave nodes
From: Fuquan Yang
Time: Fri Jul 26 14:16:26 2013

Hello Jon,

I tested with -v5 and found that point_stat can open grib files
nicely, but just hangs there, also,

1) very low memory use, top command shows that very little memory
usage.
2) checked the /tmp directory, it seems that it works fine.
"Stat_analyze" use the same /tmp and it works fine on slave nodes.

It seems that point_stat cannot load netcdf file into memory and just
stay there.

Also, I tried to compile point_stat with both pgi  compiler and gnu
compiler,  the problem stays there.  It is not caused by compiler
difference.

Any hints?

Thanks!

-----Original Message-----
From: John Halley Gotway via RT [mailto:met_help at ucar.edu]
Sent: July-08-13 4:51 PM
To: Fuquan Yang
Subject: Re: [rt.rap.ucar.edu #62086] point_stat 0 rutime problem on
slave nodes

Fuquan,

As you've guessed, I also suspect it's an I/O problem.  Could you try
running point-stat using a high verbosity level (use -v 5 for
verbosity level 5) and write its output to a log file (-log option)?
  Then take a look at the log to see if you can zero in on where it's
hanging.  When you run top, how does the memory look?  Is point-stat
using up all the memory?

It's possible that it's a permissions problem or shared memory
problem.  But it's likely to be very system-specific.  You may find
that a sys admin could help you debug the specifics of this problem.

Thanks,
John Halley Gotway
met_help at ucar.edu

On 07/04/2013 09:00 AM, Fuquan Yang via RT wrote:
>
> Thu Jul 04 09:00:27 2013: Request 62086 was acted upon.
> Transaction: Ticket created by fuquany at novusenv.com
>         Queue: met_help
>       Subject: point_stat 0 rutime problem on slave nodes
>         Owner: Nobody
>    Requestors: fuquany at novusenv.com
>        Status: new
>   Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=62086 >
>
>
> Hi,
>
> We have a cluster with 16 slave nodes, and point_stat runs nicely on
master node. However, it does not run on any of the slave nodes. The
problem is that point_stat can start, and then just hangs (does not
stop) there. Linux  "top" command shows that it is running, but simply
hangs there.
>
> The first several log file on screen show that the GRIB1 file has
been opened, but it is trying to open NETCDF file.
>
> Also, stat_analysis run nicely on slave nodes.   I know that
point_stat uses a /tmp as its temporary working folder, is there any
possible reasons that makes netcdf (obs files) unopenable?
>
> Your suggestions is appreciated.
>
> Thanks!
>
> Fuquan
>

------------------------------------------------
Subject: Re: [rt.rap.ucar.edu #62086] point_stat 0 rutime problem on slave nodes
From: John Halley Gotway
Time: Mon Jul 29 10:59:51 2013

Fuquan,

I talked to another engineer here and unfortunately we don't have many
good suggestions.  The only thought is perhaps your system is
configured to prevent multiple processes from accessing the same
file concurrently.  I'm not even sure if that's what going on in your
case - are you running multiple instances of Point-Stat concurrently
that are all trying to access the same NetCDF observation file?

To debug further, you could recompile MET using the "-g" option in
user_defs.mk.  Then run Point-Stat on one of the slave nodes through a
debugger.  Then you could step through the calls to see
exactly where the hang is occurring.  But I don't know if you're
familiar with running a debugger.  For GNU compilations, we use DDD
(which is a wrapper for the GDB debugger).

Sorry I can't be of more help.

Thanks,
John

On 07/26/2013 02:16 PM, Fuquan Yang via RT wrote:
>
> <URL: https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=62086 >
>
> Hello Jon,
>
> I tested with -v5 and found that point_stat can open grib files
nicely, but just hangs there, also,
>
> 1) very low memory use, top command shows that very little memory
usage.
> 2) checked the /tmp directory, it seems that it works fine.
"Stat_analyze" use the same /tmp and it works fine on slave nodes.
>
> It seems that point_stat cannot load netcdf file into memory and
just stay there.
>
> Also, I tried to compile point_stat with both pgi  compiler and gnu
compiler,  the problem stays there.  It is not caused by compiler
difference.
>
> Any hints?
>
> Thanks!
>
>
>
> -----Original Message-----
> From: John Halley Gotway via RT [mailto:met_help at ucar.edu]
> Sent: July-08-13 4:51 PM
> To: Fuquan Yang
> Subject: Re: [rt.rap.ucar.edu #62086] point_stat 0 rutime problem on
slave nodes
>
> Fuquan,
>
> As you've guessed, I also suspect it's an I/O problem.  Could you
try running point-stat using a high verbosity level (use -v 5 for
verbosity level 5) and write its output to a log file (-log option)?
>    Then take a look at the log to see if you can zero in on where
it's hanging.  When you run top, how does the memory look?  Is point-
stat using up all the memory?
>
> It's possible that it's a permissions problem or shared memory
problem.  But it's likely to be very system-specific.  You may find
that a sys admin could help you debug the specifics of this problem.
>
> Thanks,
> John Halley Gotway
> met_help at ucar.edu
>
>
> On 07/04/2013 09:00 AM, Fuquan Yang via RT wrote:
>>
>> Thu Jul 04 09:00:27 2013: Request 62086 was acted upon.
>> Transaction: Ticket created by fuquany at novusenv.com
>>          Queue: met_help
>>        Subject: point_stat 0 rutime problem on slave nodes
>>          Owner: Nobody
>>     Requestors: fuquany at novusenv.com
>>         Status: new
>>    Ticket <URL:
https://rt.rap.ucar.edu/rt/Ticket/Display.html?id=62086 >
>>
>>
>> Hi,
>>
>> We have a cluster with 16 slave nodes, and point_stat runs nicely
on master node. However, it does not run on any of the slave nodes.
The problem is that point_stat can start, and then just hangs (does
not stop) there. Linux  "top" command shows that it is running, but
simply hangs there.
>>
>> The first several log file on screen show that the GRIB1 file has
been opened, but it is trying to open NETCDF file.
>>
>> Also, stat_analysis run nicely on slave nodes.   I know that
point_stat uses a /tmp as its temporary working folder, is there any
possible reasons that makes netcdf (obs files) unopenable?
>>
>> Your suggestions is appreciated.
>>
>> Thanks!
>>
>> Fuquan
>>
>
>

------------------------------------------------