id	summary	reporter	owner	description	type	status	priority	milestone	component	version	resolution	keywords	cc
747	Important: Updating MPI communication options and NetCDF-4 Parallel I/O	arango		"This update is very important because it includes changes that allow the user to select the most appropriate MPI library routines for ROMS '''distribute.F''' internal message-passing routines: '''mp_assemble''', '''mp_boundary''', '''mp_collect''', and '''mp_reduce'''.  We can either use high-level ('''mpi_allgather''' or '''mpi_allreduce''') or low-level  ('''mpi_isend''' and '''mpi_irecv''') library routines. In some computers, the low-level routines are more efficient than the high-level routines or vice-versa.  There are proprietary MPI libraries offered by the supercomputer and cluster vendors that are optimized and tuned to the computer architecture resulting in very efficient high-order communication routines.  In the past, we decided for you and selected the communication methodology. In this update, we no longer make such selections and provide the user with several new C-preprocessing options to choose the appropriate methodology for your computer architecture.  The user needs to benchmark their computer(s) and select the most efficient strategy.

Notice that when '''mpi_allreduce''' is used, a summation reduction is implied. The array data is stored in a global array (same size) in each parallel tile. The global array is initialized to zero always, and then each MPI node only operates on a different portion of it.  Therefore, the computations are carried out by each node; a summation reduction is equivalent to each tile gathering data from other members of the MPI communicator.  

Affected communication routines:

* '''mp_assemble''' is used in the nesting algorithms to gather data from all the MPI communicator members (parallel tiles) for the expensive two-way exchange between nested grids. There are three different methods for gathering data using low-level (default) or high-level routines. The user may use C-preprocessing '''ASSEMBLE_ALLGATHER''' or '''ASSEMBLE_ALLREDUCE''' to use high-level routines '''mpi_allgather''' or '''mpi_allreduce''', respectively. If neither option is activated, ROMS will use the low-level functions '''mpi_isend''' and '''mpi_irecv''' routines to perform the task.
 {{{
      INTERFACE mp_assemble
        MODULE PROCEDURE mp_assemblef_1d
        MODULE PROCEDURE mp_assemblef_2d
        MODULE PROCEDURE mp_assemblef_3d
        MODULE PROCEDURE mp_assemblei_1d
        MODULE PROCEDURE mp_assemblei_2d
      END INTERFACE mp_assemble
}}}

* '''mp_boundary''' is used to exchange open lateral boundary conditions data between parallel tiles. Each parallel tile is associated with a different MPI node. Two different methods of exchange are currently available using only high-level MPI routines. The user may use C-preprocessing option '''BOUNDARY_ALLREDUCE''' to activate exchanges with '''mpi_allreduce'''.  Otherwise, ROMS will use '''mpi_allghater''' as default.

* '''mp_collect''' is used for gathering data for 1D integer or floating-point arrays. It is used in several ROMS algorithms: floats, stations, 4D-Var observation, and parallel I/O. Again, there are three different methods for gathering this type of data using low-level (default) or high-level routines. The user may use C-preprocessing '''COLLECT_ALLGATHER''' or '''COLLECT_ALLREDUCE''' to use high-level routines '''mpi_allgather''' or '''mpi_allreduce''', respectively.  If neither option is activated, ROMS will use the low-level functions '''mpi_isend''' and '''mpi_irecv''' routines to perform the task.
 {{{
      INTERFACE mp_collect
        MODULE PROCEDURE mp_collect_f
        MODULE PROCEDURE mp_collect_i
      END INTERFACE mp_collect
}}}

* '''mp_reduce''' is used for global reductions according to the operation handle '''MIN''', '''MAX''', or '''SUM'''. Various reduction operators are possible when the data is aggregated into a single vector for efficiency. Three different methods are availabe for gathering this type of data using low-level (default) or high-level routines. The user may use C-preprocessing '''REDUCE_ALLGATHER''' or '''REDUCE_ALLREDUCE''' to use high-level routines '''mpi_allgather''' or '''mpi_allreduce''', respectively.  If neither option is activated, ROMS will use the low-level functions '''mpi_isend''' and '''mpi_irecv''' routines to perform the task.

If you want to use the same methodology as in previous versions of the code, check your version of the routine distribute.F and you will see something similar to:
 {{{
#include ""cppdefs.h""
      MODULE distribute_mod
#ifdef DISTRIBUTE

# undef  ASSEMBLE_ALLGATHER /* use mpi_allgather im mp_assemble */
# undef  ASSEMBLE_ALLREDUCE /* use mpi_allreduce in mp_assemble */
# define BOUNDARY_ALLREDUCE /* use mpi_allreduce in mp_boundary */
# undef  COLLECT_ALLGATHER  /* use mpi_allgather in mp_collect  */
# undef  COLLECT_ALLREDUCE  /* use mpi_allreduce in mp_collect  */
# define REDUCE_ALLGATHER   /* use mpi_allgather in mp_reduce   */
# undef  REDUCE_ALLREDUCE   /* use mpi_allreduce in mp_reduce   */
...
}}}
Activate the same C-preprocessing options in your header ('''*.h''') or build script file if you want to use the same set-up.

Usually, in my applications I activate:
 {{{
#define COLLECT_ALLREDUCE
#define REDUCE_ALLGATHER
}}}

which reports to standard output the following information:
 {{{
!ASSEMBLE_ALL...    Using mpi_isend/mpi_recv in mp_assemble routine.
...
!BOUNDARY_ALLGATHER Using mpi_allreduce in mp_boundary routine.
...
COLLECT_ALLREDUCE   Using mpi_allreduce in mp_collect routine.
...
REDUCE_ALLGATHER    Using mpi_allgather in mp_reduce routine.
}}}

I also I made few changes to several of the NetCDF managing routines to avoid opening too many files.   The routines '''netcdf_check_dim''' and '''netcdf_inq_var''' have an additional optional argument '''ncid''' to pass the NetCDF file ID, so it is not necessary to open a NetCDF file it is already open.

----

The second part of this update includes changes to the parallel I/O in ROMS using the NetCDF4 library.  Parallel I/O only makes sense in High-Performance Computers (HPC) with the appropriate parallel file system. Otherwise, the I/O data is still processed serially with no improvement on the computation speed up.

In the NetCDF-4 library, the parallel I/O  is either done in '''independent mode''' or '''collective mode'''.   In independent mode, each processor accesses the data directly from the file and does not depend on or be affected by other processors.  Contrarily in  '''collective mode''',  all processors participate in doing IO using MPI-I/O and HDF5 for accessing of the tiled data.

To have parallel I/O in ROMS you need to:

* Activate C-preprocessing option '''HDF5''' and '''PARALLEL_IO'''.
* Compile with the parallel NetCDF4 library. If using the build script make sure that '''USE_PARALLEL_IO''' is activated, so the appropriate library is selected.
* Make sure that all input NetCDF files are of type netCDF-4. Use '''ncdump -k filename.nc''' to inquire about the file type:
 {{{
% ncdump -k grid_doppio.nc
netCDF-4
}}}
* Convert input files to netCDF-4, if necessary.  Use the following command to convert from '''classic''' or '''64-bit offset''':
 {{{
ncccopy -k 'netCDF-4' filetype3.nc filetype4.nc
or
nccopy -4 filetype3.nc  filetype4.nc
}}}
 Make sure that the '''nccopy''' version is that from the NetCDF-4 library. 
"	upgrade	closed	major	Release ROMS/TOMS 3.7	Parallelism	3.7	Done