id	summary	reporter	owner	description	type	status	priority	milestone	component	version	resolution	keywords	cc
424	Tuned parallel I/O capabilities	arango	arango	"As the '''NetCDF-4/HDF5''' libraries continue to evolve, I continue looking at the '''parallel I/O''' interface in ROMS. The '''NetCDF 4.1.1''' made few changes and optimizations. However, the MPI interface in ROMS is efficient with low overhead making the '''serial I/O''' by the '''master''' thread very effective.

In the past, the efficiency of '''parallel I/O''' in ROMS has been affected by various non-tiled variables that are written into output NetCDF files.  I looked at this issue again and discovered that it is very inefficient to write characters in '''parallel I/O'''. Therefore, I changed the character variables representing logical switches to integers ('''0=.FALSE.''' and '''1=.TRUE.''').  This improved the performance.  I get nearly the same performance if I write or not these non-tiled variables.

I added a routine interface, '''netcdf_get_lvar''', to module '''mod_netcdf.F''' to read logical variables into ROMS. It checks if the input variable is an integer ('''0''' or '''1''') or a character ('''F''' or '''T''') and process the data accordingly.  The only logical variable that it is needed in ROMS at input is the '''spherical''' switch.

I also set as default the parallel access to '''collective''' for both non-tiled and tiled variables.  Now, we have in '''mod_netcdf.F''':
{{{
      integer, parameter :: IO_nontiled_access = 1   ! nf90_collective
      integer, parameter :: IO_tiled_access    = 1   ! nf90_collective
}}}
The parallel access flags '''nf90_independent''' and '''nf90_collective''' were missing in module '''netcdf.mod''' in early versions of the '''NetCDF 4.x''' library.  Usually,
{{{
             nf_independent = 0,    nf90_independent = 0
             nf_collective  = 1,    nf90_collective  = 1
}}}
Recall that two modes of '''parallel I/O''' access are possible: '''Independent''' and '''Collective'''. '''Independent I/O''' access means that processing do not depend on or be affected by other parallel processes (nodes). Contrarily, '''Collective I/O''' access implies that all parallel processes participate during processing. This is the case for tiled variables: each node in the group reads/writes their own tile data when '''parallel I/O''' is activated.

I ran the ROMS benchmark with grid size: 512x64x30 on my desktop Linux box (2 cores, Xeon chip) and 8 processors with 4x2 partition, 500 steps, I/O every 50 steps in both history and averages files. I get the following timings when the files are written on my desktop disk:
{{{
serial I/O     817.344u 15.729s 1:45.21 791.8% 0+0k 0+0io 18pf+0w
parallel I/O   833.523u  4.140s 1:51.01 754.5% 0+0k 0+0io 17pf+0w
}}}
The '''serial I/O''' is '''5.81''' elapsed time seconds faster than '''parallel I/O'''.

Now if I write to another disk through a network file system, I get the following timings:
{{{
serial I/O     892.785u 15.661s 1:55.96 783.4% 0+0k 0+0io 19pf+0w
parallel I/O  1149.900u  4.899s 2:31.70 761.2% 0+0k 0+0io 20pf+0w
}}}
The '''serial I/O''' is '''35.74''' elapsed time seconds faster than '''parallel I/O'''. These values may oscillate depending on the network traffic.

'''Therefore, when benchmarking parallel I/O in ROMS you need to take into account the file system'''."	upgrade	closed	major	Parallel Input/Output	Parallelism	3.4	Done