From WikiROMS
Jump to navigationJump to search

ROMS uses NetCDF for all its input and output data management. The NetCDF files can be processed using the standard library developed by UNIDATA, the Parallel-IO (PIO) library developed at NCAR (Hartnett and Edwards, 2021; unpublished paper), or the Software for Cashing Output and Reads for Parallel I/O (SCORPIO) library intended for the DOE's Energy Exascale Earth Model System (E3SM). Furthermore, another parallel I/O strategy has been available in ROMS for several years with the NetCDF4/HDF5 libraries by activating the PARALLEL_IO and HDF5 CPP options.

The SCORPIO library was forked from the PIO library several years ago and evolved separately. The generic interface for parallel I/O in ROMS works for both the PIO or SCORPIO libraries and available by activating the PIO_LIB CPP option. However, we recommend using the PIO library because it is more efficient in processing I/O in our benchmark tests.

Generally, writing is usually a more frequent and more complicated operation than reading. There are four strategies for writing (Mendez et al., 2019; ​Preprint):

  1. Single file, single writer: Serial I/O in non-parallel or parallel applications. It is the default strategy in ROMS using the NetCDF3 or NetCDF4 libraries.
  2. Single file, multiple writers: Parallel I/O with each partition tile writing its data to a single file. In ROMS, this capability is achieved by activating PARALLEL_IO and HDF5. It is only possible with the NetCDF4/HDF5 libraries.
  3. Single file, collective writers: Parallel I/O with either one or a subset of processes performing I/O operations. The I/O operations can be synchronous or asynchronous. In ROMS, this capability uses the PIO or SCORPIO libraries and available when PIO_LIB is activated.
  4. Multiple files, multiple writers: Parallel I/O in which each distributed-memory or shared-memory tile decomposition writes its data into the partition file. Still, post-processing is required to pack the data into a single file. Currently, this capability is not available in ROMS but can be easily implemented within its current I/O infrastructure. However, this strategy is cumbersome and undesired in variational data assimilation (4D-Var) algorithms that require reading forwards and backward, in time, the state trajectories. As a consequence, reading becomes the bottleneck.

Serial I/O

Serial I/O is the standard option that has been in ROMS since the beginning. In this setup, all input and output data flows through the master parallel process. All data for output is collected from all processes by the master process and written to disk. Likewise, all data for input is read by the master process and distributed to the rest of the processes. When using serial I/O, files can be written in NetCDF classic/64-bit offset (NetCDF-3) or NetCDF-4/HDF5 (HDF5 CPP option) formats. File compression is available in the NetCDF-4/HDF5 library.

Parallel I/O with NetCDF-4

Parallel I/O using parallel HDF5 and NetCDF-4 has been available in ROMS for many years. This I/O option requires parallel enabled HDF5 and NetCDF-4 and is activated by defining PARALLEL_IO and HDF5 CPP options. Each parallel tile reads and writes the decomposed data. This approach does not scale well because it requires every process to participate in reading and writing, which quickly overloads the file system with requests as the number of tiles (NtileI x NtileJ) increases.

Parallel I/O with PIO or SCORPIO

PIO and SCORPIO libraries are primarily intended for ROMS distributed-memory (MPI) applications running on a large number of processes in an HPC system with a Parallel File System (like Lustre, GPFS, and so on) for high-performance I/O. The PIO and SCORPIO library uses the MPI-IO interface to facilitate the partitioning of the data across computational or dedicated I/O processes. For example, in an HPC cluster environment with a Parallel File System, the user can instruct PIO or SCORPIO to designate which processes per node to perform I/O. This is a much more reasonable approach for larger applications running on hundreds of processors. To use this Parallel I/O strategy, the PIO or SCORPIO library must be linked to ROMS at compile time by defining the PIO_LIB CPP option. It is only available in distributed-memory applications since it uses MPI-IO.

There are two modes of parallel I/O in PIO and SCORPIO:

  1. Synchronous: MPI intra-communications mode. A subset or all processes does I/O and also computations. The user specifies the total number of I/O tasks and how they are distributed across the HPC nodes as a function of the ROMS MPI-communicator object, OCN_COMM_WORLD. It is often desirable to shift the first I/O task away from the first computation task since it has higher memory requirements than other processes. If the MPI processes are spread over several computer nodes, it is highly recommended to spread all I/O tasks over all nodes. Avoid all I/O processes occupying the same node.
  2. Asynchronous: MPI inter-communications mode. The I/O tasks are a disjointed set of dedicated I/O processes and do not perform computations. It is possible to have groups of computational units running separate models (coupling) where all the I/O data are sent to dedicated processes. In ROMS, the asynchronous mode is possible by activation either ASYNCHRONOUS_PIO or ASYNCHRONOUS_SCORPIO. However, this capability is still under development and not recommended for use at this time.

The Parallel I/O configuration options are set in

  • Choose the input and output NetCDF library to use. For example, the user could choose to use the PIO library for writing but still use the standard library for reading. To use this Parallel I/O strategy, the PIO or SCORPIO library must be linked to ROMS at compile time and the PIO_LIB CPP option needs to be activated. It is only available in distributed-memory applications since it uses MPI-IO.
    ! [1] Standard NetCDF-3 or NetCDF-4 library
    ! [2] Parallel-IO from PIO or SCORPIO library (MPI, MPI-IO applications)

    INP_LIB = 2
    OUT_LIB = 2
  • PIO and SCORPIO offer several methods for reading/writing NetCDF files. SCORPIO also offers ADIOS but that is not implemented in ROMS. Depending on the build of the PIO or SCORPIO libraries, not all the I/O types are available. If the NetCDF library does not support parallel I/O, methods 3 and 4 are not available. Currently, NetCDF4/HDF5 data compression is possible with method 3 during serial write.
    ! [0] parallel read and parallel write of PnetCDF (CDF-5 type files, not recommended because of post-processing)
    ! [1] parallel read and parallel write of NetCDF3 (64-bit offset)
    ! [2] serial read and serial write of NetCDF3 (64-bit offset)
    ! [3] parallel read and serial write of NetCDF4/HDF5
    ! [4] parallel read and parallel write of NETCDF4/HDF5

    PIO_METHOD = 2
  • Parallel-IO tasks control parameters. Typically, it is advantageous and highly recommended to define the I/O decomposition in smaller number of processes for efficiency and to avoid MPI communications bottlenecks.
    PIO_IOTASKS = 1  ! number of I/O processes to define
    PIO_STRIDE = 1  ! stride in the MPI-rank between I/O processes
    PIO_BASE = 0  ! offset for the first I/O process
    PIO_AGGREG = 1  ! number of MPI-aggregators to use
  • Parallel-IO (PIO or SCORPIO) rearranger methods for moving data between computational and I/O processes. It provides the ability to rearrange data between computational and parallel I/O decompositions. Usually the Box rearrangement is more efficient.
    ! [1] Box rearrangement
    ! [2] Subset rearrangement

    PIO_REARR = 1
    • In the box method, data is rearranged from computational to I/O processes in a continuous manner to the data ordering in the file. Since the ordering of data between computational and I/O partitions may be different, the rearrangement will require all-to-all MPI communications. Also, notice that each computing tile may transfer data to one or more I/O processes.
    • In the subset method, each I/O process is associated with a subset of computing processes. The computing tile sends its data to a unique I/O process. The data on I/O processes may be more fragmented to the ordering on disk, which may increase the communications to the storage medium. However, the rearrangement scales better since all-to-all MPI communications are not required.
  • Parallel-IO (PIO or SCORPIO) rearranger flag for MPI communication between computational and I/O processes. In some systems, the Point-to-Point communications is more efficient.
    ! [0] Point-to-Point communications
    ! [1] Collective communications

  • Parallel-IO (PIO or SCORPIO) rearranger flow-control direction flag for MPI communications between computational and I/O processes. The flow algorithm controls the rate and volume of messages sent to any destination MPI process. Optimally, the MPI communications should be designed to send a modest number of messages evenly distributed across a number of processes. An excessive number of messages to a single MPI process can exhaust the buffer space which can affect efficiency or failure due to the slowdown in the retransmitting of dropped messages. It only sends messages (Isend) when the receiver is ready and has sufficient resources.
    ! [0] Enable computational to I/O processes, and vice versa
    ! [1] Enable computational to I/O processes only
    ! [2] Enable I/O to computational processes only
    ! [3] Disable flow control

  • Parallel-IO (PIO or SCORPIO) rearranger options for MPI communications from computational to I/O processes (C2I).
    PIO_C2I_HS = T  ! Enable C2I handshake (T/F)
    PIO_C2I_Send = F  ! Enable C2I Isends (T/F)
    PIO_I2C_HS = 64  ! Maximum pending C2I requests
  • Parallel-IO (PIO or SCORPIO) rearranger options for MPI communications from I/O to computational processes (I2C).
    PIO_I2C_HS = F  ! Enable I2C handshake (T/F)
    PIO_I2C_Send = T  ! Enable I2C Isends (T/F)
    PIO_I2C_Preq = 65  ! Maximum pending I2C requests

Output Multi-Files

Sometimes, it is advantageous to time-split ROMS output data (averages, diagnostic, history, and quicksave) into multiple NetCDF files to avoid creating huge files on disk for storage in applications with large grids. Smaller files are easy to handle and can be concatenated in OpenDAP catalogs.

For example, an application for a particular region with a substantial grid size needs to be run for one year. The ROMS timestep is DT = 300 seconds, as shown below. Then, one could split the history and quicksave output data into daily NetCDF files with single or multiple time records every 3, 6, or 24 hours. In this case, ROMS will generate a sequence of files with suffixes to filenames. Therefore, we need the following parameters in

  • Timestepping parameters.
    NTIMES = 105120  ! Number of timesteps (288 steps per day; 365 days simulation)
    DT == 300.0d0  ! Timestep size (seconds)
    NDTFAST == 30  ! Number of barotropic steps
  • Flags controlling the frequency of output.
    NRREC = 0  ! Model restart flag
    LcycleRST == T  ! Switch to recycle restart time records (file with only 2 cycling time records)
    NRST == 288  ! Number of timesteps between writing restart records (daily)
    NSTA == 1  ! Number of timesteps between stations records
    NFLT == 1  ! Number of timesteps between floats records
    NINFO == 1  ! Number of timesteps between printing information diagnostics
  • Output history, average, and diagnostic file parameters.
    LDEFOUT == T  ! File creation/append switch
    NHIS == 72  ! Number of timesteps between writing history records (every 6 hours)
    NDEFHIS == 288  ! Number of timesteps between the creation of new history files (daily files)
    NQCK == 36  ! Number of timesteps between writing quicksave records (every 3 hours)
    NDEFQCK == 288  ! Number of timesteps between the creation of new quicksave file (daily)(daily, single record files)
    NTSAVG == 1  ! Starting averages timestep
    NAVG == 288  ! Number of timesteps between writing averages records (daily averages)
    NDEFAVG == 288  ! Number of timesteps between the creation of new averages file (daily, single record files)
    NTSDIA == 1  ! Starting diagnostics timestep
    NDIA == 288  ! Number of timesteps between writing diagnostics records (daily averages)
    NDEFDIA == 288  ! Number of timesteps between the creation of new diagnostics file (daily, single record files)
  • Timestamp assigned for model initialization, reference time origin for tidal forcing, and model reference time for output NetCDF units attribute.
    DSTART = 365.0d0  ! days (ROMS is initialized on 2007-01-01 00:00:00Z)
    TIDE_START = 0.0d0  ! days (zero phase date is set when preparing the tidal data for 2006-01-01)
    TIME_REF = 20060101.0d0  ! yyyymmdd.dd (Very important: ROMS time is seconds since 2006-01-01 00:00:00Z)

Notice that all the parameters are an exact integer multiple from each other:




where the Fortran intrinsic function MOD(X,Y) computes the remainder of the division of X by Y, and has to be always zero for ROMS multi-file option to work. Notice that the first files in the history data series ( will contain 5-time records because of the initial conditions, and the rest of the files will have 4-time records. Similarly, the first file in the quicksave series ( will contain 9-time records, and the rest will have 8-time records. The time-averaged data in the averages and diagnostic files are single records files representing daily averaged fields. The above analysis also holds when converted to time in seconds since every parameter is multiplied by the DT timestep.

WARNING: If running your application in a supercomputer with limited-time job queues, long simulations will require a restart. You cannot change the values of NTIMES or DSTART because the multi-file will fail. If DSTART is changed, it will reset the timestep internal counter iic(ng), and we are in deep trouble! Notice that the value NRST is crucial for the restart of ROMS with multi-file. In addition, we need the following mathematical equalities:


Also, it will be trivial to restart if NAVG = NDIA = NRST because of the accumulation sums when computing time-averaging fields. If balancing terms and budgets from output data, you will need NAVG = NDIA too.