<p><b>pwjones@lanl.gov</b> 2011-09-02 10:49:27 -0600 (Fri, 02 Sep 2011)</p><p><br>

Edits to the performance design doc.<br>

</p><hr noshade><pre><font color="gray">Modified: branches/ocean_projects/performance/PerformanceBranchDesign.pdf

===================================================================

(Binary files differ)

Modified: branches/ocean_projects/performance/PerformanceBranchDesign.tex

===================================================================

--- branches/ocean_projects/performance/PerformanceBranchDesign.tex        2011-09-02 15:06:40 UTC (rev 974)

+++ branches/ocean_projects/performance/PerformanceBranchDesign.tex        2011-09-02 16:49:27 UTC (rev 975)

@@ -17,17 +17,17 @@

 \chapter{Summary}

-This documents contains the requiremens and design specifications for use while optimizing MPAS. The overall outcome of this document will be a version of MPAS that has a higher level of parallelism available to users, as well as a more modular design. To begin requiremens regarding parallelism are laid out, followed by requiremens for the modularity of the code and potential enhancements that could be performed at a later time.

+This document contains the requiremens and design specifications for use while optimizing MPAS. The overall outcome of this document will be a version of MPAS that has a higher level of parallelism available to users, as well as a more modular design. To begin requiremens regarding parallelism are laid out, followed by requiremens for the modularity of the code and potential enhancements that could be performed at a later time.

 %-----------------------------------------------------------------------

 \chapter{Requirements}

-\section{Performance Portable}

+\section{Performance Portability}

 Date last modified: 2011/09/01 \\

 Contributors: (Doug Jacobsen, Phil Jones) \\

-After this design document has been implemented the code within MPAS-Ocean is expected to be performance portable, meaning the code is not going to be optimized for a specific archetecture. All optimized code should remain general, and as readable as possible.

+The MPAS-Ocean code should be performance portable, meaning the code is not optimized for a specific archetecture. All optimized code should remain general, and as readable as possible.  Should any architecture-specific implementations be necessary, there should be generic abstractions that can be confined to a small number of isolated modules that can be selected at compile time.  This may be required, for example, in initial accelerated systems where standards and abstractions have not yet been developed.  Portability in general requires adherence to both language standards and widely-used libraries whenever possible with non-standard approaches again isolated using conditional compilation or preprocessing.

 \section{Parallelism}

 Date last modified: 2011/09/01 \\

@@ -41,17 +41,23 @@

         \item Accelerated Architecture Parallelism - TBD (Cuda, OpenCL)

 \end{itemize}

+\section{Scalability}

+Date last modified: 2011/09/01 \\

+Contributors: (Doug Jacobsen, Phil Jones) \\

+

+MPAS-Ocean must achieve high scalability appropriate for a given problem size.  Near-linear scalability up to a tens of horizontal grid points per processing element is desired.  For future exascale machines, as many as a billion processing elements may be required, so exposing as much parallelism as possible is desired.  

+

 \section{Modularity}

 Date last modified: 2011/09/01 \\

 Contributors: (Doug Jacobsen, Phil Jones) \\

-The modularity of MPAS-Ocean is to be enhanced to improve readability of code as well as improving turn around time for implementing new parameterizations. Modules will include error feedback to be used for error checking.

+The modularity of MPAS-Ocean is to be enhanced to improve readability, enable encapsulation, and simplify the implementation of new parameterizations. Modules will include error feedback to be used for error checking.

-\section{Further Encapsulation}

+\section{Encapsulation}

 Date last modified: 2011/09/01 \\

 Contributors: (Doug Jacobsen, Phil Jones) \\

-As part of this design specification, the encapsulation of MPAS-Ocean should be increased at a later time to aid the overall performance as well. This will also lower the memory footprint of specific modules or routines included in MPAS-Ocean.

+Encapsulation within MPAS-Ocean is highly desired.  Encapsulation refers to keeping data and operators local and private within a module whenever possible and only exposing (making public) the interfaces or data needed by calling routines.  This practice helps to prevent namespace conflicts and inadvertant side effects (e.g. a module inadvertantly writing into another array that should not be public). In addition, this practice helps to reduce the overall memory footprint.

 %-----------------------------------------------------------------------

@@ -61,8 +67,10 @@

 Date last modified: 2011/09/01 \\

 Contributors: (Doug Jacobsen, Phil Jones) \\

-Performance optimizations will be implemented using standard general techniques. These techniques will include things like loop fusing, and removing branching statements within loops. Loop fusing can be seen in the following context, although there are more examples of this.

+Performance optimizations will be implemented using standard general techniques. These techniques will include things like loop fusing and removing branching statements within loops. Another general rule of thumb that memory operations (load/stores) are more expensive than flops and that future memory available to each processing element is likely to decrease, implying that temporary arrays should be avoided and overall memory use minimized. 

+Loop fusing can be seen in the following context, although there are more examples of this.

+

 \begin{verbatimtab}

 do iEdge=1,nEdges

         cell1 = cellsOnEdge(1,iEdge)

@@ -202,24 +210,28 @@

 Date last modified: 2011/09/01 \\

 Contributors: (Doug Jacobsen, Phil Jones) \\

-In order to implement the three levels of parallelism, various the code requires a variety of modifications. The distrubuted memory parallelism is already implemented using MPI commands, but some optimization for these can be explored. OpenMP directives can be added around loops and where it appears to be useful, though it is unclear if this will actually improve performance of MPAS-Ocean or not, some exploration can be done on this to see if the benefit is real or not. However, OpenMP directives might be useful on computers with multi-core CPU's that don't currently have MPI installed, and can be thought of as a potential alternative to MPI rather than begin used in addition to MPI. \\

+In order to implement the three levels of parallelism, code requires a variety of modifications. The distrubuted memory parallelism is already implemented using MPI commands, but some optimization for these can be explored. \\ 

+OpenMP (or other threading paradigm) is necessary on computers with multi-core CPU's, especially as core counts per node increase.  MPI implementations across cores can often result in bus contention if the vendor has not optimized for local shared memory. However, an appropriate OpenMP implementation will require some experimentation.  If enough subdomains are assigned to an MPI task, OpenMP parallelism may be desireable at a high level in the code, threading over subdomains similar to distributed memory parallism.  Alternatively, directives can be added around loops and where it appears to be useful. Care must be taken to evaluate the approaches and identify ways of maintaining data locality with the threads (often vendors provide some capability at run time, though may require care if ``first touch'' is a mechanism for pinning memory). 

+

 The third level of parallelism will take the most work. To begin, a suitable method of parallization for accelerated architectures needs to be identified. In the event CUDA or OpenCL are chosen to perform a set of tasks on GPUs some major modifications will need to be done to algorithms suitable for programing in this fashion, at least if portable code is still a major goal. \\

-In order to maintain portable code and use CUDA or OpenCL it is likely that some algorithms, or modules, will need to be ported to having a Fortran interface on top of C code, to allow the use of multiple compilers, as CUDA currently is only supported in Fortan using the PGI compilers, and OpenCL is not supported in Fortran at all. However it remains unclear if other avenues of parallelism on this level are available.

+In order to maintain portable code and use CUDA or OpenCL it is likely that some algorithms, or modules, will need to be ported to having a Fortran interface on top of C code, to allow the use of multiple compilers, as CUDA currently is only supported in Fortan using the PGI compilers, and OpenCL is not supported in Fortran at all. As we explore these specific implementations, suitable abstractions may become apparent and these abstractions can call specific CUDA or OpenCL code underneath.  This is a significant research area at present. 

 \section{Modularity}

 Date last modified: 2011/09/01 \\

 Contributors: (Doug Jacobsen, Phil Jones) \\

-In order to aid the parallel development of MPAS-Ocean by multiple developers, and increase overall performance of MPAS-Ocean, the modularity of MPAS-Ocean will be increased. Idenfied components within currently existing portions of code can be extracted and put into their own modules, as seen fit.

+In order to aid the parallel development of MPAS-Ocean by multiple developers, and increase overall performance of MPAS-Ocean, the modularity of MPAS-Ocean will be increased. Identified components within currently existing portions of code can be extracted and put into their own modules, as seen fit.

 As an example, within module\_time\_integration.F the horizontal mixing of tracers and momentum has a $</font>

<font color="black">abla^2$ and $</font>

<font color="gray">abla^4$ option hard coded. This would be removed, and put in it's own module named something like module\_OcnHmixMom.F. This module would include possible parameterizations, and handle the selection of each. In the described case two sub-modules would be created each named something like module\_OcnHmixMomDel2.F and module\_OcnHmixMomDel4.F, representing each of the two options. Each of these would contain the portion of code relevant to computing the tendencies for the momentum equation related to the parameterization.

-This modular programming fashion allows several things. First, array masking can be applied by passing in pointers to arrays which may or may not be modifed, which allows the compiler to optimize computations involved these more effectively. Second, it allows parameterizations to be explored and implemented in an easier fashion that is currently available in MPAS-Ocean. Third, it should reduce the overall memory footprint of the MPAS-Ocean at any point in time, as anywhere in the code less arrays will be used. And finally, it increases the ability for encapsulation within MPAS-Ocean.

+This modular programming fashion allows several things.  First, it allows parameterizations to be explored and implemented in an easier fashion that is currently available in MPAS-Ocean.  Second, the pointers inherent in the current MPAS data structures and registry present a barrier to compiler optimization.  By modularizing tendency calculations, it is possible to pass pointers into arrays so that compilers at the module level can optimize the underlying array code while retaining the flexibility of the pointers and structures at the top levels and for configuration. 

+Third, it should reduce the overall memory footprint of the MPAS-Ocean at any point in time, as only the arrays required by the current modules will be allocated. And finally, it increases the ability for encapsulation within MPAS-Ocean.

-Modularity can be seen in figure \ref{fig:modules}.

+Modularity can be seen in figure \ref{fig:modules}.  This figure is only a conceptual figure at present and will be replaced by a more detailed module layout soon.

+

 \begin{figure}

         \includegraphics[scale=0.35]{NewArchitecture.eps}

         \label{fig:modules}

@@ -230,7 +242,7 @@

 Date last modified: 2011/09/01 \\

 Contributors: (Doug Jacobsen, Phil Jones) \\

-Although not a short term goal of this specification, encapsulation should be explored within MPAS-Ocean.

+Encapsulation is currently implemented by making all module variables and interfaces private by default (with a private statement in the module header).  Only data or interfaces that need to be public are given the public attribute.  In addition, most modules will contain their own initialization routine to initialize module variables and options.  In the future, current registry variables will need to reflect the module granularity of input namelist options.

 %-----------------------------------------------------------------------

</font>

</pre>