<p><b>dwj07@fsu.edu</b> 2011-09-02 11:04:35 -0600 (Fri, 02 Sep 2011)</p><p><br>

        Some small edits to the design doc<br>

</p><hr noshade><pre><font color="gray">Modified: branches/ocean_projects/performance/PerformanceBranchDesign.pdf

===================================================================

(Binary files differ)

Modified: branches/ocean_projects/performance/PerformanceBranchDesign.tex

===================================================================

--- branches/ocean_projects/performance/PerformanceBranchDesign.tex        2011-09-02 16:49:27 UTC (rev 975)

+++ branches/ocean_projects/performance/PerformanceBranchDesign.tex        2011-09-02 17:04:35 UTC (rev 976)

@@ -3,6 +3,7 @@

 \usepackage{epsf,amsmath,amsfonts}

 \usepackage{graphicx}

 \usepackage{moreverb}

+\usepackage{placeins}

 \begin{document}

@@ -17,7 +18,12 @@

 \chapter{Summary}

-This document contains the requiremens and design specifications for use while optimizing MPAS. The overall outcome of this document will be a version of MPAS that has a higher level of parallelism available to users, as well as a more modular design. To begin requiremens regarding parallelism are laid out, followed by requiremens for the modularity of the code and potential enhancements that could be performed at a later time.

+This document contains the requiremens and design specifications for use while

+optimizing MPAS. The overall outcome of this document will be a version of MPAS

+that has a higher level of parallelism available to users, as well as a more

+modular design. To begin requirements regarding parallelism are laid out,

+followed by requirements for the modularity of the code and potential

+enhancements that could be performed at a later time.

 %-----------------------------------------------------------------------

@@ -204,20 +210,23 @@

 end do

 \end{verbatimtab}

-Which replaces the branch with a 6 multiplies and two logicial nots, and allows the loop to be vectorized easier. Other performance enhancements are to be implemented as seen fit.

+Which replaces the branch with 6 multiplies and two logicial not statements, and allows the loop to be vectorized easier. Other performance enhancements are to be implemented as seen fit.

 \section{Parallelism}

 Date last modified: 2011/09/01 \\

 Contributors: (Doug Jacobsen, Phil Jones) \\

-In order to implement the three levels of parallelism, code requires a variety of modifications. The distrubuted memory parallelism is already implemented using MPI commands, but some optimization for these can be explored. \\ 

+In order to implement the three levels of parallelism, code requires a variety of modifications. The distrubuted memory parallelism is already implemented using MPI commands, but some optimization for these can be explored. 

-OpenMP (or other threading paradigm) is necessary on computers with multi-core CPU's, especially as core counts per node increase.  MPI implementations across cores can often result in bus contention if the vendor has not optimized for local shared memory. However, an appropriate OpenMP implementation will require some experimentation.  If enough subdomains are assigned to an MPI task, OpenMP parallelism may be desireable at a high level in the code, threading over subdomains similar to distributed memory parallism.  Alternatively, directives can be added around loops and where it appears to be useful. Care must be taken to evaluate the approaches and identify ways of maintaining data locality with the threads (often vendors provide some capability at run time, though may require care if ``first touch'' is a mechanism for pinning memory). 

+OpenMP (or other threading paradigm) is necessary on computers with multi-core CPU's, especially as core counts per node increase.  MPI implementations across cores can often result in bus contention if the vendor has not optimized for local shared memory. However, an appropriate OpenMP implementation will require some experimentation.  If enough subdomains are assigned to an MPI task, OpenMP parallelism may be desireable at a high level in the code, threading over subdomains similar to distributed memory parallism.  Alternatively, directives can be added around loops and where it appears to be useful. Care must be taken to evaluate the approaches and identify ways of maintaining data locality with the threads (often vendors provide some capability at run time, though may require care if ``first touch'' is a mechanism for pinning memory).  

-The third level of parallelism will take the most work. To begin, a suitable method of parallization for accelerated architectures needs to be identified. In the event CUDA or OpenCL are chosen to perform a set of tasks on GPUs some major modifications will need to be done to algorithms suitable for programing in this fashion, at least if portable code is still a major goal. \\

+The third level of parallelism will take the most work. To begin, a suitable method of parallization for accelerated architectures needs to be identified. In the event CUDA or OpenCL are chosen to perform a set of tasks on GPUs some major modifications will need to be done to algorithms suitable for programing in this fashion, at least if portable code is still a major goal. 

-In order to maintain portable code and use CUDA or OpenCL it is likely that some algorithms, or modules, will need to be ported to having a Fortran interface on top of C code, to allow the use of multiple compilers, as CUDA currently is only supported in Fortan using the PGI compilers, and OpenCL is not supported in Fortran at all. As we explore these specific implementations, suitable abstractions may become apparent and these abstractions can call specific CUDA or OpenCL code underneath.  This is a significant research area at present. 

+In order to maintain portable code and use CUDA or OpenCL it is likely that some algorithms, or modules, will need to be ported to having a Fortran interface on top of C code, to allow the use of multiple compilers. CUDA is currently only supported with the use of the PGI Fortran compiler and OpenCL is not supported in Fortran at all. As we explore these specific implementations, suitable abstractions may become apparent and these abstractions can call specific CUDA or OpenCL code underneath.  This is a significant research area at present. 

+.

+

+\FloatBarrier

 \section{Modularity}

 Date last modified: 2011/09/01 \\

 Contributors: (Doug Jacobsen, Phil Jones) \\

@@ -226,7 +235,7 @@

 As an example, within module\_time\_integration.F the horizontal mixing of tracers and momentum has a $</font>

<font color="black">abla^2$ and $</font>

<font color="red">abla^4$ option hard coded. This would be removed, and put in it's own module named something like module\_OcnHmixMom.F. This module would include possible parameterizations, and handle the selection of each. In the described case two sub-modules would be created each named something like module\_OcnHmixMomDel2.F and module\_OcnHmixMomDel4.F, representing each of the two options. Each of these would contain the portion of code relevant to computing the tendencies for the momentum equation related to the parameterization.

-This modular programming fashion allows several things.  First, it allows parameterizations to be explored and implemented in an easier fashion that is currently available in MPAS-Ocean.  Second, the pointers inherent in the current MPAS data structures and registry present a barrier to compiler optimization.  By modularizing tendency calculations, it is possible to pass pointers into arrays so that compilers at the module level can optimize the underlying array code while retaining the flexibility of the pointers and structures at the top levels and for configuration. 

+This modular programming fashion allows several things.  First, it allows parameterizations to be explored and implemented in an easier fashion than is currently available in MPAS-Ocean.  Second, the pointers inherent in the current MPAS data structures and registry present a barrier to compiler optimization.  By modularizing tendency calculations, it is possible to pass pointers into arrays so that compilers at the module level can optimize the underlying array code while retaining the flexibility of the pointers and structures at the top levels and for configuration. 

 Third, it should reduce the overall memory footprint of the MPAS-Ocean at any point in time, as only the arrays required by the current modules will be allocated. And finally, it increases the ability for encapsulation within MPAS-Ocean.

</font>

</pre>