[Met_help] [rt.rap.ucar.edu #94744] History for Re: [Ticket#1007353] netCDF issues when submitting MET jobs to bsub

Julie Prestopnik via RT met_help at ucar.edu
Tue Mar 31 09:39:41 MDT 2020


----------------------------------------------------------------
  Initial Request
----------------------------------------------------------------

Dear John L Wagner - NOAA Federal,

Thank you for your request.

I don't believe any action is required from the WCOSS SA team on this.
Please let us know if the ticket in the WCOSS helpdesk may be closed

IBM WCOSS Systems Administration Team

--

01/31/2020 09:00 (America/New_York) - John L Wagner - NOAA Federal wrote: 
Greetings Sorry to email both the MET and WCOSS help desks, but I wasn't sure
where to send this ticket.  We have been encountering errors creating and
reading netCDF files on WCOSS (mars) recently.  These errors occur when we
call the MET programs for jobs submitted to the lsf queue.
 I was running a series of MET's grid_stat jobs yesterday.  All was going well
until some time after 19Z, all jobs began to fail.  Here is the error that I
got:
  
 terminate called after throwing an instance of 'netCDF::exceptions::NcHdfErr'
what():  NetCDF: HDF error
file: ncCheck.cpp  line:92  
  
 A sample of the netCDF files that I was using can be found in 
  
 /gpfs/dell3/ptmp/John.L.Wagner/matching_urma_tt_2019100509.134086/match_co_2019100509.134086/009
 
   I do not have any issues reading data from these files using ncdump.  They
don't appear to be corrupted.  They are copies of files that I created two
months ago and have been testing with for some time now.
 Here are the bsub settings I'm using to submit this job:
  
 export NTASK=104 
export PTILE=28 
export OMP_NUM_THREAD=20 
bsub -J ${flg}_${src}_${valid_date}_${elem} \
         -W 3:00 \
         -oo $logdir/${flg}_${src}_mpmd_${valid_date}_${elem}.log \
         -eo $logdir/${flg}_${src}_mpmd_${valid_date}_${elem}.log \
         -P MDLST-T2O \
         -M 3000 \
         -q "dev" \
         -cwd $PWD \
         -R "affinity[core(1)]" \
         -n $NTASK \
         -R "span[ptile=$PTILE]" \
         -w "$regrid_dep" \
          $procdir/met_creeper_linden.sh -s $src -t $valid_date -g $flg -f
$force -e $elem  
  
 A sample mpmd file that is used for CFP can be found here:
  
 /gpfs/dell3/ptmp/John.L.Wagner/matching_urma_tt_2019100509.134086/mpmd_file_matching_urma_2019100509_tt

  
 Myself and Erin Thead have also encountered issues creating netCDF files
using MET's regrid_data_plane program on WCOSS.  This issue has only been
occurring for the past week or two and again only occurs when a job is
submitted with bsub.  Its not clear to us if something on WCOSS has changed
(netCDF/hdf library), if MET is having issues reading/writing netCDF files
when jobs are run in parallel, or something else.
 If you need any more information from us, please let me know.  Again, sorry
for emailing both help desks at once.
 Thanks
 John
 --         John Wagner Verification Task Lead
 COR Task Manager
 NOAA/National Weather Service
 Meteorological Development Laboratory
 Digital Forecast Services Branch
 SSMC2 Room 10106
 Silver Spring, MD 20910
 (301) 427-9471 (office)
 (908) 902-4155 (cell/text)



----------------------------------------------------------------
  Complete Ticket History
----------------------------------------------------------------



More information about the Met_help mailing list