Opened 15 years ago
Closed 15 years ago
#424 closed upgrade (Done)
Tuned parallel I/O capabilities
Reported by: | arango | Owned by: | arango |
---|---|---|---|
Priority: | major | Milestone: | Parallel Input/Output |
Component: | Parallelism | Version: | 3.4 |
Keywords: | Cc: |
Description
As the NetCDF-4/HDF5 libraries continue to evolve, I continue looking at the parallel I/O interface in ROMS. The NetCDF 4.1.1 made few changes and optimizations. However, the MPI interface in ROMS is efficient with low overhead making the serial I/O by the master thread very effective.
In the past, the efficiency of parallel I/O in ROMS has been affected by various non-tiled variables that are written into output NetCDF files. I looked at this issue again and discovered that it is very inefficient to write characters in parallel I/O. Therefore, I changed the character variables representing logical switches to integers (0=.FALSE. and 1=.TRUE.). This improved the performance. I get nearly the same performance if I write or not these non-tiled variables.
I added a routine interface, netcdf_get_lvar, to module mod_netcdf.F to read logical variables into ROMS. It checks if the input variable is an integer (0 or 1) or a character (F or T) and process the data accordingly. The only logical variable that it is needed in ROMS at input is the spherical switch.
I also set as default the parallel access to collective for both non-tiled and tiled variables. Now, we have in mod_netcdf.F:
integer, parameter :: IO_nontiled_access = 1 ! nf90_collective integer, parameter :: IO_tiled_access = 1 ! nf90_collective
The parallel access flags nf90_independent and nf90_collective were missing in module netcdf.mod in early versions of the NetCDF 4.x library. Usually,
nf_independent = 0, nf90_independent = 0 nf_collective = 1, nf90_collective = 1
Recall that two modes of parallel I/O access are possible: Independent and Collective. Independent I/O access means that processing do not depend on or be affected by other parallel processes (nodes). Contrarily, Collective I/O access implies that all parallel processes participate during processing. This is the case for tiled variables: each node in the group reads/writes their own tile data when parallel I/O is activated.
I ran the ROMS benchmark with grid size: 512x64x30 on my desktop Linux box (2 cores, Xeon chip) and 8 processors with 4x2 partition, 500 steps, I/O every 50 steps in both history and averages files. I get the following timings when the files are written on my desktop disk:
serial I/O 817.344u 15.729s 1:45.21 791.8% 0+0k 0+0io 18pf+0w parallel I/O 833.523u 4.140s 1:51.01 754.5% 0+0k 0+0io 17pf+0w
The serial I/O is 5.81 elapsed time seconds faster than parallel I/O.
Now if I write to another disk through a network file system, I get the following timings:
serial I/O 892.785u 15.661s 1:55.96 783.4% 0+0k 0+0io 19pf+0w parallel I/O 1149.900u 4.899s 2:31.70 761.2% 0+0k 0+0io 20pf+0w
The serial I/O is 35.74 elapsed time seconds faster than parallel I/O. These values may oscillate depending on the network traffic.
Therefore, when benchmarking parallel I/O in ROMS you need to take into account the file system.