Opened 14 years ago

Closed 14 years ago

#424 closed upgrade (Done)

Tuned parallel I/O capabilities

Reported by: arango Owned by: arango
Priority: major Milestone: Parallel Input/Output
Component: Parallelism Version: 3.4
Keywords: Cc:

Description

As the NetCDF-4/HDF5 libraries continue to evolve, I continue looking at the parallel I/O interface in ROMS. The NetCDF 4.1.1 made few changes and optimizations. However, the MPI interface in ROMS is efficient with low overhead making the serial I/O by the master thread very effective.

In the past, the efficiency of parallel I/O in ROMS has been affected by various non-tiled variables that are written into output NetCDF files. I looked at this issue again and discovered that it is very inefficient to write characters in parallel I/O. Therefore, I changed the character variables representing logical switches to integers (0=.FALSE. and 1=.TRUE.). This improved the performance. I get nearly the same performance if I write or not these non-tiled variables.

I added a routine interface, netcdf_get_lvar, to module mod_netcdf.F to read logical variables into ROMS. It checks if the input variable is an integer (0 or 1) or a character (F or T) and process the data accordingly. The only logical variable that it is needed in ROMS at input is the spherical switch.

I also set as default the parallel access to collective for both non-tiled and tiled variables. Now, we have in mod_netcdf.F:

      integer, parameter :: IO_nontiled_access = 1   ! nf90_collective
      integer, parameter :: IO_tiled_access    = 1   ! nf90_collective

The parallel access flags nf90_independent and nf90_collective were missing in module netcdf.mod in early versions of the NetCDF 4.x library. Usually,

             nf_independent = 0,    nf90_independent = 0
             nf_collective  = 1,    nf90_collective  = 1

Recall that two modes of parallel I/O access are possible: Independent and Collective. Independent I/O access means that processing do not depend on or be affected by other parallel processes (nodes). Contrarily, Collective I/O access implies that all parallel processes participate during processing. This is the case for tiled variables: each node in the group reads/writes their own tile data when parallel I/O is activated.

I ran the ROMS benchmark with grid size: 512x64x30 on my desktop Linux box (2 cores, Xeon chip) and 8 processors with 4x2 partition, 500 steps, I/O every 50 steps in both history and averages files. I get the following timings when the files are written on my desktop disk:

serial I/O     817.344u 15.729s 1:45.21 791.8% 0+0k 0+0io 18pf+0w
parallel I/O   833.523u  4.140s 1:51.01 754.5% 0+0k 0+0io 17pf+0w

The serial I/O is 5.81 elapsed time seconds faster than parallel I/O.

Now if I write to another disk through a network file system, I get the following timings:

serial I/O     892.785u 15.661s 1:55.96 783.4% 0+0k 0+0io 19pf+0w
parallel I/O  1149.900u  4.899s 2:31.70 761.2% 0+0k 0+0io 20pf+0w

The serial I/O is 35.74 elapsed time seconds faster than parallel I/O. These values may oscillate depending on the network traffic.

Therefore, when benchmarking parallel I/O in ROMS you need to take into account the file system.

Change History (1)

comment:1 by arango, 14 years ago

Resolution: Done
Status: newclosed
Note: See TracTickets for help on using tickets.