Opened 2 years ago

Closed 2 years ago

#747 closed upgrade (Done)

Important: Updating MPI communication options and NetCDF-4 Parallel I/O

Reported by: arango Owned by:
Priority: major Milestone: Release ROMS/TOMS 3.7
Component: Parallelism Version: 3.7
Keywords: Cc:

Description (last modified by arango)

This update is very important because it includes changes that allow the user to select the most appropriate MPI library routines for ROMS distribute.F internal message-passing routines: mp_assemble, mp_boundary, mp_collect, and mp_reduce. We can either use high-level (mpi_allgather or mpi_allreduce) or low-level (mpi_isend and mpi_irecv) library routines. In some computers, the low-level routines are more efficient than the high-level routines or vice-versa. There are proprietary MPI libraries offered by the supercomputer and cluster vendors that are optimized and tuned to the computer architecture resulting in very efficient high-order communication routines. In the past, we decided for you and selected the communication methodology. In this update, we no longer make such selections and provide the user with several new C-preprocessing options to choose the appropriate methodology for your computer architecture. The user needs to benchmark their computer(s) and select the most efficient strategy.

Notice that when mpi_allreduce is used, a summation reduction is implied. The array data is stored in a global array (same size) in each parallel tile. The global array is initialized to zero always, and then each MPI node only operates on a different portion of it. Therefore, the computations are carried out by each node; a summation reduction is equivalent to each tile gathering data from other members of the MPI communicator.

Affected communication routines:

  • mp_assemble is used in the nesting algorithms to gather data from all the MPI communicator members (parallel tiles) for the expensive two-way exchange between nested grids. There are three different methods for gathering data using low-level (default) or high-level routines. The user may use C-preprocessing ASSEMBLE_ALLGATHER or ASSEMBLE_ALLREDUCE to use high-level routines mpi_allgather or mpi_allreduce, respectively. If neither option is activated, ROMS will use the low-level functions mpi_isend and mpi_irecv routines to perform the task.
         INTERFACE mp_assemble
           MODULE PROCEDURE mp_assemblef_1d
           MODULE PROCEDURE mp_assemblef_2d
           MODULE PROCEDURE mp_assemblef_3d
           MODULE PROCEDURE mp_assemblei_1d
           MODULE PROCEDURE mp_assemblei_2d
         END INTERFACE mp_assemble
    
  • mp_boundary is used to exchange open lateral boundary conditions data between parallel tiles. Each parallel tile is associated with a different MPI node. Two different methods of exchange are currently available using only high-level MPI routines. The user may use C-preprocessing option BOUNDARY_ALLREDUCE to activate exchanges with mpi_allreduce. Otherwise, ROMS will use mpi_allghater as default.
  • mp_collect is used for gathering data for 1D integer or floating-point arrays. It is used in several ROMS algorithms: floats, stations, 4D-Var observation, and parallel I/O. Again, there are three different methods for gathering this type of data using low-level (default) or high-level routines. The user may use C-preprocessing COLLECT_ALLGATHER or COLLECT_ALLREDUCE to use high-level routines mpi_allgather or mpi_allreduce, respectively. If neither option is activated, ROMS will use the low-level functions mpi_isend and mpi_irecv routines to perform the task.
         INTERFACE mp_collect
           MODULE PROCEDURE mp_collect_f
           MODULE PROCEDURE mp_collect_i
         END INTERFACE mp_collect
    
  • mp_reduce is used for global reductions according to the operation handle MIN, MAX, or SUM. Various reduction operators are possible when the data is aggregated into a single vector for efficiency. Three different methods are availabe for gathering this type of data using low-level (default) or high-level routines. The user may use C-preprocessing REDUCE_ALLGATHER or REDUCE_ALLREDUCE to use high-level routines mpi_allgather or mpi_allreduce, respectively. If neither option is activated, ROMS will use the low-level functions mpi_isend and mpi_irecv routines to perform the task.

If you want to use the same methodology as in previous versions of the code, check your version of the routine distribute.F and you will see something similar to:

#include "cppdefs.h"
      MODULE distribute_mod
#ifdef DISTRIBUTE

# undef  ASSEMBLE_ALLGATHER /* use mpi_allgather im mp_assemble */
# undef  ASSEMBLE_ALLREDUCE /* use mpi_allreduce in mp_assemble */
# define BOUNDARY_ALLREDUCE /* use mpi_allreduce in mp_boundary */
# undef  COLLECT_ALLGATHER  /* use mpi_allgather in mp_collect  */
# undef  COLLECT_ALLREDUCE  /* use mpi_allreduce in mp_collect  */
# define REDUCE_ALLGATHER   /* use mpi_allgather in mp_reduce   */
# undef  REDUCE_ALLREDUCE   /* use mpi_allreduce in mp_reduce   */
...

Activate the same C-preprocessing options in your header (*.h) or build script file if you want to use the same set-up.

Usually, in my applications I activate:

#define COLLECT_ALLREDUCE
#define REDUCE_ALLGATHER

which reports to standard output the following information:

!ASSEMBLE_ALL...    Using mpi_isend/mpi_recv in mp_assemble routine.
...
!BOUNDARY_ALLGATHER Using mpi_allreduce in mp_boundary routine.
...
COLLECT_ALLREDUCE   Using mpi_allreduce in mp_collect routine.
...
REDUCE_ALLGATHER    Using mpi_allgather in mp_reduce routine.

I also I made few changes to several of the NetCDF managing routines to avoid opening too many files. The routines netcdf_check_dim and netcdf_inq_var have an additional optional argument ncid to pass the NetCDF file ID, so it is not necessary to open a NetCDF file it is already open.


The second part of this update includes changes to the parallel I/O in ROMS using the NetCDF4 library. Parallel I/O only makes sense in High-Performance Computers (HPC) with the appropriate parallel file system. Otherwise, the I/O data is still processed serially with no improvement on the computation speed up.

In the NetCDF-4 library, the parallel I/O is either done in independent mode or collective mode. In independent mode, each processor accesses the data directly from the file and does not depend on or be affected by other processors. Contrarily in collective mode, all processors participate in doing IO using MPI-I/O and HDF5 for accessing of the tiled data.

To have parallel I/O in ROMS you need to:

  • Activate C-preprocessing option HDF5 and PARALLEL_IO.
  • Compile with the parallel NetCDF4 library. If using the build script make sure that USE_PARALLEL_IO is activated, so the appropriate library is selected.
  • Make sure that all input NetCDF files are of type netCDF-4. Use ncdump -k filename.nc to inquire about the file type:
    % ncdump -k grid_doppio.nc
    netCDF-4
    
  • Convert input files to netCDF-4, if necessary. Use the following command to convert from classic or 64-bit offset:
    ncccopy -k 'netCDF-4' filetype3.nc filetype4.nc
    or
    nccopy -4 filetype3.nc  filetype4.nc
    
    Make sure that the nccopy version is that from the NetCDF-4 library.

Change History (1)

comment:1 Changed 2 years ago by arango

  • Description modified (diff)
  • Resolution set to Done
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.