From bala@llnl.gov Fri Jul 5 20:10:35 2002 From: bala@llnl.gov (Bala Govindasamy) Date: Fri, 05 Jul 2002 12:10:35 -0700 Subject: [ccm-users] land model gives error message for pe < 16? Message-ID: <3D25EF2B.3EB6C7F2@llnl.gov> --------------06011CD5F85641E39D1795B3 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Dear CAM users, When I run CAM2 on less than 16 IBM processors (e.g. 8, 4) the model stops running For example, on 8 proc, I get the following error message: water balance nstep = 1 point = 4742 imbalance = 468.14 mm clm model is stopping ENDRUN IS BEING CALLED On 4 procs, I get a similar message at nstep =4 But there is no problem on 16, 32 and 64 processors. Any idea? Thanks, -- Bala ---------------------------------------- Bala Govindasamy L-103, Atmospheric Science Division Lawrence Livermore National Laboratory Livermore CA 94550 Ph.: 925 423 0771 Fax: 925 422 6388 Email: bala@LLNL.GOV; bala_indu@yahoo.com http://en-env.llnl.gov/cccm/balacv.html ----------------------------------------- --------------06011CD5F85641E39D1795B3 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 7bit Dear CAM users,

When I run CAM2 on less than 16 IBM processors (e.g. 8, 4) the model stops running

For example, on 8 proc, I get the following error message:
water balance  nstep =         1 point =  4742 imbalance =  468.14 mm
 clm model is stopping
 ENDRUN IS BEING CALLED

On 4 procs, I get a similar message at nstep =4

But there is no problem on 16, 32 and 64 processors.

Any idea?
Thanks,

-- 
Bala
----------------------------------------
Bala Govindasamy
L-103, Atmospheric Science Division
Lawrence Livermore National Laboratory
Livermore
CA 94550
Ph.: 925 423 0771
Fax: 925 422 6388
Email: bala@LLNL.GOV; bala_indu@yahoo.com
http://en-env.llnl.gov/cccm/balacv.html
-----------------------------------------
  --------------06011CD5F85641E39D1795B3-- From dpierce@ucsd.edu Mon Jul 8 17:29:04 2002 From: dpierce@ucsd.edu (Dave Pierce) Date: Mon, 8 Jul 2002 09:29:04 -0700 (PDT) Subject: [ccm-users] Bug in CCSM2 Message-ID: Hi folks, there seems to be a bug in ccsm's file models/atm/cam/src/control/ccsm_msg.F90, around line 1485. Right now the code looks like this: #if (defined SPMD) do n=1,nrcv do lat=1,plat arget_buf(:,n,lat) = arget(:,lat,n) end do end do Problem is, arget_buf is only allocated for the master processor. You might expect this would cause strange errors on some platforms, with rather hard to trace and non-reproducable results. I think it should instead be: #if (defined SPMD) if ( masterproc ) then do n=1,nrcv do lat=1,plat arget_buf(:,n,lat) = arget(:,lat,n) end do end do endif Perhaps one of the model coders could verify this conjecture. Also, in file models/ice/csim4/src/source/ice_itd.F, around line 163, the original code is like this: if (my_task.eq.master_task) then write (6,*) '' write (6,*) 'hin_max(nc-1) < Cat nc < hin_max(nc)' For some reason (probably compiler bug) this causes a failure on the PGI compilers version 3.2-4 (haven't tried the version 4 compilers yet). What happens is that the ice model halts with an I/O (permission denied) error to the output file. It works if you instead have: if (my_task.eq.master_task) then write (6,*) ' ' write (6,*) 'hin_max(nc-1) < Cat nc < hin_max(nc)' Note that the difference is writing a single space to the output file in the second line, rather than a null string. Regards, --Dave --------------------------------------------------------------- David W. Pierce / Climate Research Division Scripps Institution of Oceanography / (858) 534-8276 (voice) dpierce@ucsd.edu / (858) 534-8561 (fax) --------------------------------------------------------------- From erik@ucar.edu Thu Jul 11 20:30:49 2002 From: erik@ucar.edu (Erik Kluzek) Date: Thu, 11 Jul 2002 13:30:49 -0600 (MDT) Subject: [ccm-users] Bug in CCSM2 In-Reply-To: Message-ID: On Mon, 8 Jul 2002, Dave Pierce wrote: > > there seems to be a bug in ccsm's file > models/atm/cam/src/control/ccsm_msg.F90, around line 1485. Right now the > code looks like this: > > #if (defined SPMD) > do n=1,nrcv > do lat=1,plat > arget_buf(:,n,lat) = arget(:,lat,n) > end do > end do > > Problem is, arget_buf is only allocated for the master processor. You > might expect this would cause strange errors on some platforms, with > rather hard to trace and non-reproducable results. I think it should > instead be: > > #if (defined SPMD) > if ( masterproc ) then > do n=1,nrcv > do lat=1,plat > arget_buf(:,n,lat) = arget(:,lat,n) > end do > end do > endif > > Perhaps one of the model coders could verify this conjecture. > Yes, the above is a recognized problem in ccsm_msf.F90. It will be fixed in the CAM2.0.1 and CCSM2.0.1 release which is scheduled for later this month. > Also, in file models/ice/csim4/src/source/ice_itd.F, around line 163, the > original code is like this: > > if (my_task.eq.master_task) then > write (6,*) '' > write (6,*) 'hin_max(nc-1) < Cat nc < hin_max(nc)' > > For some reason (probably compiler bug) this causes a failure on the PGI > compilers version 3.2-4 (haven't tried the version 4 compilers yet). What > happens is that the ice model halts with an I/O (permission denied) error > to the output file. It works if you instead have: > > if (my_task.eq.master_task) then > write (6,*) ' ' > write (6,*) 'hin_max(nc-1) < Cat nc < hin_max(nc)' > > Note that the difference is writing a single space to the output file in > the second line, rather than a null string. > I'll report this the CSIM folks. Obviously it's something simple to fix... Erik Kluzek, (CGD at NCAR) National Center for Atmospheric Research Boulder CO, (off) (303)497-1326 (fax) (303)497-1324 --------- Home page and public PGP key--------------- http://www.cgd.ucar.edu/~erik !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! From erik@ucar.edu Thu Jul 11 20:54:56 2002 From: erik@ucar.edu (Erik Kluzek) Date: Thu, 11 Jul 2002 13:54:56 -0600 (MDT) Subject: [ccm-users] Moving CCM-users over to cam-users... Message-ID: All I will be moving everyone that is currently on ccm-users@ucar.edu over to the new CAM users e-mail list "cam-users@ucar.edu". Messages regarding both CAM, and the CCM are being sent to both lists, and I think it would be cleaner (and easier for me) to have a single list to manage. Messages regarding the CCM can still be sent to the "cam-users" list. Once, everyone is moved over I'll inactivate the "ccm-users" list, and disallow anyone from signing on to it. The CCM3 web-page will also refer to the CAM-users e-mail list for new questions. If you don't want to be on the ccm-users list, either unsubscribe by going to http://mailman.ucar.edu/mailman/listinfo/cam-users/ or send me e-mail and I'll take you off the list. Thanks Erik Kluzek, (CGD at NCAR) National Center for Atmospheric Research Boulder CO, (off) (303)497-1326 (fax) (303)497-1324 --------- Home page and public PGP key--------------- http://www.cgd.ucar.edu/~erik !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! From erik@ucar.edu Thu Jul 11 20:58:35 2002 From: erik@ucar.edu (Erik Kluzek) Date: Thu, 11 Jul 2002 13:58:35 -0600 (MDT) Subject: [ccm-users] ccm-users and cam-users now unmoderated for list subscribers... Message-ID: All In the past the ccm-users list was moderated to eliminate spam. I've now opened up the list to allow messages from list-members. If we start having problems with inappropriate messages or spam on the list again, I'll lock it down again. Thanks Erik Kluzek, (CGD at NCAR) National Center for Atmospheric Research Boulder CO, (off) (303)497-1326 (fax) (303)497-1324 --------- Home page and public PGP key--------------- http://www.cgd.ucar.edu/~erik !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! From erik@ucar.edu Thu Jul 11 21:07:16 2002 From: erik@ucar.edu (Erik Kluzek) Date: Thu, 11 Jul 2002 14:07:16 -0600 (MDT) Subject: [ccm-users] land model gives error message for pe < 16? In-Reply-To: <3D25EF2B.3EB6C7F2@llnl.gov> Message-ID: On Fri, 5 Jul 2002, Bala Govindasamy wrote: > > When I run CAM2 on less than 16 IBM processors (e.g. 8, 4) the model > stops running > > For example, on 8 proc, I get the following error message: > water balance nstep = 1 point = 4742 imbalance = 468.14 mm > clm model is stopping > ENDRUN IS BEING CALLED > > On 4 procs, I get a similar message at nstep =4 > > But there is no problem on 16, 32 and 64 processors. > Bala I just did some simple tests with 8 processors (2 nodes and 4 threads each) -- it worked for me. Can you send more specifics? Your config_cache.xml file, namelist, and commands you are using to invoke the executable (environment variables, and poe command line), would be helpful in reproducing the problem. Also any specifics on the machine you are running on might be useful, the machinename, compiler version, and OS version. Erik Kluzek, (CGD at NCAR) National Center for Atmospheric Research Boulder CO, (off) (303)497-1326 (fax) (303)497-1324 --------- Home page and public PGP key--------------- http://www.cgd.ucar.edu/~erik !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!