[mpas-developers] dmpar_abort

Michael Duda duda at ucar.edu
Tue Apr 27 15:53:08 MDT 2010


Hi, Phil.

Thanks very much for the input. I like the idea of having a stack
trace of sorts when an error is encountered, especially when that
error is inside a routine that is called from several different
higher-level routines. As you pointed out, we'd need to be careful
about leaving other MPI tasks hanging if one encounters an error;
nonetheless, I think this is an approach worth further
consideration.

Cheers,
Michael


On Mon, Apr 26, 2010 at 03:53:54PM -0600, Philip Jones wrote:
> 
> Michael,
> 
> This is what I was implementing in POP before we switched
> to mpas - all routines were passing an error code up to
> the calling routine and the program was only terminated at
> the highest level.  It's nice from a component standpoint
> because it lets everyone shut down cleanly in response to
> an error code.  But...you have to treat
> threaded regions a bit carefully and there's always a chance
> of hanging MPI if one mpi task returns an error and no one
> else does.  In MPI-specific routines, you might want to abort
> if you're between a send/recv pair or something and only use
> this error approach elsewhere.
> 
> Anyway, if you're interested in that approach, I have an
> error module that keeps what amounts to an internal stack
> trace as errors are propagated upward.  Before exiting
> a routine with the error code, you can call
>     call POP_ErrorSet(errorCode, rtnName, errMsg)
> and then return.  And in the calling routine, you can do
>     if (POP_ErrorCheck(errorCode, rtnName, errMsg)) return
> The error module keeps track of all the errMsg's to form
> an error trace that is output with an ErrorPrint call.
> 
> Phil
> 
> On Apr 26, 2010, at 3:21 PM, Michael Duda wrote:
> 
> > Hi, Xylar.
> >
> > One approach might be to return a status code from routines that
> > might encounter errors, and allow a routine higher up in the call
> > stack to handle the error with a dmpar_abort if it were deemed
> > appropriate. Depending on the nature of the subroutine, this might
> > be the preferable approach -- allow higher-level code to determine
> > whether the error can be recovered from or whether it is fatal.
> > However, this would either entail adding an error code argument to
> > the subroutine, which is one thing we'd like to avoid, or
> > converting the subroutine into a function, which wouldn't be an
> > option if the subroutine was in fact already a function.
> >
> > Another approach, and one that would be very simple to implement,
> > would be to add a dmpar_global_abort(mesg) routine that is
> > callable from any code that uses the dmpar module, and that prints
> > the message mesg before calling MPI_Abort with MPI_COMM_WORLD. The
> > current dmpar_abort only needs the dminfo argument to get the
> > communicator to abort on, and I'd be hard-pressed to find a case
> > where it would be desirable to abort on a communicator other than
> > the global one. Adding a dmpar_global_abort routine would obviate
> > the need to pass dminfo into any subroutine that might need to
> > abort, and adding it as a new subroutine would allow us to migrate
> > from existing calls to dmpar_abort on an as-needed basis.
> >
> > I'd support adding a dmpar_global_abort routine in the dmpar
> > module, but I'd also suggest considering whether the error being
> > checked for is one that can be recovered from, in which case a
> > return error code might be the cleanest approach in that
> > particular case.
> >
> > Cheers,
> > Michael
> >
> >
> > On Mon, Apr 26, 2010 at 02:10:10PM -0600, Xylar Asay-Davis wrote:
> >> I'm trying to use dmpar_abort as a way to stop the code with an error
> >> message when things go wrong with the code I'm testing.  I could just
> >> use stop, but I figured dmpar_abort was the "proper" way.  The  
> >> problem
> >> is that dminfo, the argument needed by dmpar_abort, is a member of  
> >> the
> >> domain, which is not available in many subroutines.  And it's
> >> inconvenient to have to pass around any extra arguments to my
> >> subroutines just in case I might want to abort.
> >>
> >> Any suggestions?
> >>
> >> -Xylar
> >>
> >> -- 
> >>
> >> ***********************
> >> Xylar S. Asay-Davis
> >> E-mail: xylar at lanl.gov
> >> Phone: (505) 606-0025
> >> Fax: (505) 665-2659
> >> CNLS, MS B258
> >> Los Alamos National Laboratory
> >> Los Alamos, NM 87545
> >> ***********************
> >>
> >>
> >> _______________________________________________
> >> mpas-developers mailing list
> >> mpas-developers at mailman.ucar.edu
> >> http://mailman.ucar.edu/mailman/listinfo/mpas-developers
> > _______________________________________________
> > mpas-developers mailing list
> > mpas-developers at mailman.ucar.edu
> > http://mailman.ucar.edu/mailman/listinfo/mpas-developers
> 
> ---
> Correspondence/TSPA/DUSA AOE
> ------------------------------------------------------------
> Philip Jones                                pwjones at lanl.gov
> T-3 MS B216                                 Ph: 505-667-6387
> Los Alamos National Lab                    Fax: 505-665-5926
> Los Alamos, NM 87545-1663
> 
> 
> 
> _______________________________________________
> mpas-developers mailing list
> mpas-developers at mailman.ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/mpas-developers


More information about the mpas-developers mailing list