From WikiROMS
Jump to: navigation, search

ROMS uses NetCDF for all its input and output data management. The NetCDF files can be processed using the standard library developed by UNIDATA, the Parallel-IO (PIO) library developed at NCAR (Hartnett and Edwards, 2021; unpublished paper), or the Software for Cashing Output and Reads for Parallel I/O (SCORPIO) library intended for the DOE's Energy Exascale Earth Model System (E3SM). Furthermore, another parallel I/O strategy has been available in ROMS for several years with the NetCDF4/HDF5 libraries by activating the PARALLEL_IO and HDF5 CPP options.

The SCORPIO library was forked from the PIO library several years ago and evolved separately. The generic interface for parallel I/O in ROMS works for both the PIO or SCORPIO libraries and available by activating the PIO_LIB CPP option. However, we recommend using the PIO library because it is more efficient in processing I/O in our benchmark tests.

Generally, writing is usually a more frequent and more complicated operation than reading. There are four strategies for writing (Mendez et al., 2019; ​Preprint):

  1. Single file, single writer: Serial I/O in non-parallel or parallel applications. It is the default strategy in ROMS using the NetCDF3 or NetCDF4 libraries.
  2. Single file, multiple writers: Parallel I/O with each partition tile writing its data to a single file. In ROMS, this capability is achieved by activating PARALLEL_IO and HDF5. It is only possible with the NetCDF4/HDF5 libraries.
  3. Single file, collective writers: Parallel I/O with either one or a subset of processes performing I/O operations. The I/O operations can be synchronous or asynchronous. In ROMS, this capability uses the PIO or SCORPIO libraries and available when PIO_LIB is activated.
  4. Multiple files, multiple writers: Parallel I/O in which each distributed-memory or shared-memory tile decomposition writes its data into the partition file. Still, post-processing is required to pack the data into a single file. Currently, this capability is not available in ROMS but can be easily implemented within its current I/O infrastructure. However, this strategy is cumbersome and undesired in variational data assimilation (4D-Var) algorithms that require reading forwards and backward, in time, the state trajectories. As a consequence, reading becomes the bottleneck.

Serial I/O

Serial I/O is the standard option that has been in ROMS since the beginning. In this setup, all input and output data flows through the master parallel process. All data for output is collected from all processes by the master process and written to disk. Likewise, all data for input is read by the master process and distributed to the rest of the processes. When using serial I/O, files can be written in NetCDF classic/64-bit offset (NetCDF-3) or NetCDF-4/HDF5 (HDF5 CPP option) formats. File compression is available in the NetCDF-4/HDF5 library.

Parallel I/O with NetCDF-4

Parallel I/O using parallel HDF5 and NetCDF-4 has been available in ROMS for many years. This I/O option requires parallel enabled HDF5 and NetCDF-4 and is activated by defining PARALLEL_IO and HDF5 CPP options. Each parallel tile reads and writes the decomposed data. This approach does not scale well because it requires every process to participate in reading and writing, which quickly overloads the file system with requests as the number of tiles (NtileI x NtileJ) increases.

Parallel I/O with PIO or SCORPIO

PIO and SCORPIO libraries are primarily intended for ROMS distributed-memory (MPI) applications running on a large number of processes in an HPC system with a Parallel File System (like Lustre, GPFS, and so on) for high-performance I/O. The PIO and SCORPIO library uses the MPI-IO interface to facilitate the partitioning of the data across computational or dedicated I/O processes. For example, in an HPC cluster environment with a Parallel File System, the user can instruct PIO or SCORPIO to designate which processes per node to perform I/O. This is a much more reasonable approach for larger applications running on hundreds of processors. To use this Parallel I/O strategy, the PIO or SCORPIO library must be linked to ROMS at compile time by defining the PIO_LIB CPP option. It is only available in distributed-memory applications since it uses MPI-IO.

There are two modes of parallel I/O in PIO and SCORPIO:

  1. Synchronous: MPI intra-communications mode. A subset or all processes does I/O and also computations. The user specifies the total number of I/O tasks and how they are distributed across the HPC nodes as a function of the ROMS MPI-communicator object, OCN_COMM_WORLD. It is often desirable to shift the first I/O task away from the first computation task since it has higher memory requirements than other processes. If the MPI processes are spread over several computer nodes, it is highly recommended to spread all I/O tasks over all nodes. Avoid all I/O processes occupying the same node.
  2. Asynchronous: MPI inter-communications mode. The I/O tasks are a disjointed set of dedicated I/O processes and do not perform computations. It is possible to have groups of computational units running separate models (coupling) where all the I/O data are sent to dedicated processes. In ROMS, the asynchronous mode is possible by activation either ASYNCHRONOUS_PIO or ASYNCHRONOUS_SCORPIO. However, this capability is still under development and not recommended for use at this time.

The Parallel I/O configuration options are set in

  • Choose the input and output NetCDF library to use. For example, the user could choose to use the PIO library for writing but still use the standard library for reading. To use this Parallel I/O strategy, the PIO or SCORPIO library must be linked to ROMS at compile time and the PIO_LIB CPP option needs to be activated. It is only available in distributed-memory applications since it uses MPI-IO.
    ! [1] Standard NetCDF-3 or NetCDF-4 library
    ! [2] Parallel-IO from PIO or SCORPIO library (MPI, MPI-IO applications)

    INP_LIB = 2
    OUT_LIB = 2
  • PIO and SCORPIO offer several methods for reading/writing NetCDF files. SCORPIO also offers ADIOS but that is not implemented in ROMS. Depending on the build of the PIO or SCORPIO libraries, not all the I/O types are available. If the NetCDF library does not support parallel I/O, methods 3 and 4 are not available. Currently, NetCDF4/HDF5 data compression is possible with method 3 during serial write.
    ! [0] parallel read and parallel write of PnetCDF (CDF-5 type files, not recommended because of post-processing)
    ! [1] parallel read and parallel write of NetCDF3 (64-bit offset)
    ! [2] serial read and serial write of NetCDF3 (64-bit offset)
    ! [3] parallel read and serial write of NetCDF4/HDF5
    ! [4] parallel read and parallel write of NETCDF4/HDF5

    PIO_METHOD = 2
  • Parallel-IO tasks control parameters. Typically, it is advantageous and highly recommended to define the I/O decomposition in smaller number of processes for efficiency and to avoid MPI communications bottlenecks.
    PIO_IOTASKS = 1  ! number of I/O processes to define
    PIO_STRIDE = 1  ! stride in the MPI-rank between I/O processes
    PIO_BASE = 0  ! offset for the first I/O process
    PIO_AGGREG = 1  ! number of MPI-aggregators to use
  • Parallel-IO (PIO or SCORPIO) rearranger methods for moving data between computational and I/O processes. It provides the ability to rearrange data between computational and parallel I/O decompositions. Usually the Box rearrangement is more efficient.
    ! [1] Box rearrangement
    ! [2] Subset rearrangement

    PIO_REARR = 1
    • In the box method, data is rearranged from computational to I/O processes in a continuous manner to the data ordering in the file. Since the ordering of data between computational and I/O partitions may be different, the rearrangement will require all-to-all MPI communications. Also, notice that each computing tile may transfer data to one or more I/O processes.
    • In the subset method, each I/O process is associated with a subset of computing processes. The computing tile sends its data to a unique I/O process. The data on I/O processes may be more fragmented to the ordering on disk, which may increase the communications to the storage medium. However, the rearrangement scales better since all-to-all MPI communications are not required.
  • Parallel-IO (PIO or SCORPIO) rearranger flag for MPI communication between computational and I/O processes. In some systems, the Point-to-Point communications is more efficient.
    ! [0] Point-to-Point communications
    ! [1] Collective communications

  • Parallel-IO (PIO or SCORPIO) rearranger flow-control direction flag for MPI communications between computational and I/O processes. The flow algorithm controls the rate and volume of messages sent to any destination MPI process. Optimally, the MPI communications should be designed to send a modest number of messages evenly distributed across a number of processes. An excessive number of messages to a single MPI process can exhaust the buffer space which can affect efficiency or failure due to the slowdown in the retransmitting of dropped messages. It only sends messages (Isend) when the receiver is ready and has sufficient resources.
    ! [0] Enable computational to I/O processes, and vice versa
    ! [1] Enable computational to I/O processes only
    ! [2] Enable I/O to computational processes only
    ! [3] Disable flow control

  • Parallel-IO (PIO or SCORPIO) rearranger options for MPI communications from computational to I/O processes (C2I).
    PIO_C2I_HS = T  ! Enable C2I handshake (T/F)
    PIO_C2I_Send = F  ! Enable C2I Isends (T/F)
    PIO_I2C_HS = 64  ! Maximum pending C2I requests
  • Parallel-IO (PIO or SCORPIO) rearranger options for MPI communications from I/O to computational processes (I2C).
    PIO_I2C_HS = F  ! Enable I2C handshake (T/F)
    PIO_I2C_Send = T  ! Enable I2C Isends (T/F)
    PIO_I2C_Preq = 65  ! Maximum pending I2C requests