Parallel I/O benchmarks?
Parallel I/O benchmarks?
Hernan, do you have benchmark tests that illustrate the performance increases with the new NetCDF4 parallel I/O?
			
			
									
									
						- arango
- Site Admin
- Posts: 1394
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: Parallel I/O benchmarks?
We are currently working on benchmarking the parallel I/O.  I don't have access to a computer with parallel I/O architecture so I cannot use the full extend of the MPI I/O layer instead of simulate it. We hope that users can help us running the BENCHMARK test on different computers so we can collect the data and fine-tune how the output files are written.
My strategy was to first code the parallel I/O infrastructure and then work on the necessary performance improvements. This maybe computer dependent so we all need to learn and analyze.
			
			
									
									
						My strategy was to first code the parallel I/O infrastructure and then work on the necessary performance improvements. This maybe computer dependent so we all need to learn and analyze.
Re: Parallel I/O benchmarks?
I'd love to know the answer to this question too. It likely depends on your application - the last time I ran BENCHMARK, there was no I/O at all!
When we bought our Sun Linux cluster, the two competing vendors told us that parallel I/O is important for scalability. Sun didn't mention that - because they don't support it. Now we have a Cray system that claims to have parallel I/O so I'm hoping to try it out (if I can get it to work).
			
			
									
									
						When we bought our Sun Linux cluster, the two competing vendors told us that parallel I/O is important for scalability. Sun didn't mention that - because they don't support it. Now we have a Cray system that claims to have parallel I/O so I'm hoping to try it out (if I can get it to work).
Re: Parallel I/O benchmarks?
I can't see why it would depend on application.  I guess it would depend on hardware, compiler, etc.
The previous benchmark explicitly left out i/o, so as only to test the computational engine. For an i/o benchmark, we would need a very very large domain that did as little computation as possible (diagnostic, no nonlinear eos, etc). But I guess it should be the same as a netcdf4 benchmark, as the actual writing of the netcdf file depends pretty exclusively on the netcdf library.
			
			
									
									
						The previous benchmark explicitly left out i/o, so as only to test the computational engine. For an i/o benchmark, we would need a very very large domain that did as little computation as possible (diagnostic, no nonlinear eos, etc). But I guess it should be the same as a netcdf4 benchmark, as the actual writing of the netcdf file depends pretty exclusively on the netcdf library.
- arango
- Site Admin
- Posts: 1394
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: Parallel I/O benchmarks?
Yes, I am trying to run now all three benchmarks with I/O.  We write both history and averages NetCDF files frequently and runing the benchmark for longer period of time.  Recall that the benchmarks can be large and we can have more input scripts than the ones provide.  Notice that you just need to activate BENCHMARK and AVERAGES.
We need to set NTIMES to a much larger number than 200, and NHIS and NAVG to multiple of NTIMES, say  NHIS=NAVG=NTIMES/5.
			
			
									
									
						Code: Select all
  
  Benchmark1:   512 x  64 x 30       ROMS/Export/ocean_benchmark1.in
  Benchmark2:  1024 x 128 x 30       ROMS/Export/ocean_benchmark2.in
  Benchmark3:  2048 x 256 x 30       ROMS/Export/ocean_benchmark3.in
Re: Parallel I/O benchmarks?
It appears to be running for me with PARALLEL_IO, but it's so freakishly slow I thought it was hanging. This is without the collective option, which caused the HDF layer to complain and crash:
Hernan, the HDF layer isn't used at all in the old way, right? Is there a switch for making HDF files in serial?
Time for me to run some BENCHMARK3 numbers. I'm thinking NTIMES=200 is good, but maybe NAVG, NHIS, NRST=100.
			
			
									
									
						Code: Select all
!!    integer, parameter :: IO_collective = 1       ! nf90_collective
      integer, parameter :: IO_collective = 0 Time for me to run some BENCHMARK3 numbers. I'm thinking NTIMES=200 is good, but maybe NAVG, NHIS, NRST=100.
- arango
- Site Admin
- Posts: 1394
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: Parallel I/O benchmarks?
Yes, just try the NETCDF4 option.  For serial I/O you need to turn off the PARALLEL_IO option. If you are using OpenMP or serial with partitions, you need to build both the NetCDF-4 and HDF5 libraries with no parallel support.  Otherwise, you will need to link ROMS with the MPI library.
I have made few runs with ROMS benchmarks and I am getting all kind of numbers on my Linux box. The serial I/O is faster for me. I guess that we still need to a lot of fine tuning. However, the parallel I/O is advantageous as the grid size becomes larger and larger since it reduces the memory required by the temporary global I/O arrays to just the tile size.
			
			
									
									
						I have made few runs with ROMS benchmarks and I am getting all kind of numbers on my Linux box. The serial I/O is faster for me. I guess that we still need to a lot of fine tuning. However, the parallel I/O is advantageous as the grid size becomes larger and larger since it reduces the memory required by the temporary global I/O arrays to just the tile size.
Re: Parallel I/O benchmarks?
I believe the NETCDF4 option alone gives you a classic netcdf3 file.
In addition to the choices mentioned above, I turned on AVERAGES for the output cases and asked for these Hout options:
Numbers soon, I hope.
			
			
									
									
						In addition to the choices mentioned above, I turned on AVERAGES for the output cases and asked for these Hout options:
Code: Select all
          T  Hout(idFsur)    Write out free-surface.
          T  Hout(idUbar)    Write out 2D U-momentum component.
          T  Hout(idVbar)    Write out 2D V-momentum component.
          T  Hout(idUvel)    Write out 3D U-momentum component.
          T  Hout(idVvel)    Write out 3D V-momentum component.
          T  Hout(idWvel)    Write out W-momentum component. 
          T  Hout(idTvar)    Write out tracer 01: temp
          T  Hout(idTvar)    Write out tracer 02: salt
          T  Hout(idUsms)    Write out surface U-momentum stress.
          T  Hout(idVsms)    Write out surface V-momentum stress. 
          T  Hout(idTsur)    Write out surface net heat flux.
          T  Hout(idTsur)    Write out surface net salt flux.
          T  Hout(idSrad)    Write out shortwave radiation flux.
          T  Hout(idLrad)    Write out longwave radiation flux.
          T  Hout(idLhea)    Write out latent heat flux.
          T  Hout(idShea)    Write out sensible heat flux. 
- arango
- Site Admin
- Posts: 1394
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: Parallel I/O benchmarks?
No, it will create a NetCDF-4/HDF5 format file.  That is, the output file(s) will be HDF5 file(s).  You can use h5dump to convince yourself that this is the case.  Notice that in mod_netcdf.F I have:
This does the trick.  This is the reason why I pretty much re-wrote the entire I/O structure in ROMS.
			
			
									
									
						Code: Select all
!
!  Netcdf file creation mode flag.
!
#ifdef NETCDF4
      integer :: CMODE = nf90_netcdf4      ! NetCDF-4/HDF5 format file
#else
      integer :: CMODE = nf90_clobber      ! NetCDF classic format file
#endifRe: Parallel I/O benchmarks?
OK, interesting. Not to be confused with USE_NETCDF4 in the makefile.
			
			
									
									
						Re: Parallel I/O benchmarks?
Some numbers from pingo (Cray XT5 supercomputer):
So, output costs and NetCDF costs a little more than HDF. However, the potential benefits of MPI-I/O aren't showing up here.
			
			
									
									
						Code: Select all
32 cores:
        no I/O:                           428 seconds
        creating classic Netcdf3 files:   489 seconds
        creating serial Netcdf4 files:    477 seconds
        creating parallel Netcdf4 files: 1481 seconds
256 cores:
        no I/O:                            90 seconds
        creating classic Netcdf3 files:   138 seconds
        creating serial Netcdf4 files:    124 seconds
        creating parallel Netcdf4 files:  817 seconds
- arango
- Site Admin
- Posts: 1394
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: Parallel I/O benchmarks?
What are those times?  elapsed time or total time?
I am also finding that the serial I/O in ROMS is more efficient. It seems that ROMS distributed-memory communications for scattering/gathering I/O data are much faster when compared with similar operations in the HDF5 library. The overhead is too high.
Muqun Yang from (HDF group) mentioned to me that there is a lot of overhead when writing the several ROMS header scalar variables. I made experiments by commenting out
in the Build/def_*.f90 files to see if this is the case.  However, the serial I/O is still faster.  Perhaps, you should try this and report what numbers do you get.
			
			
									
									
						I am also finding that the serial I/O in ROMS is more efficient. It seems that ROMS distributed-memory communications for scattering/gathering I/O data are much faster when compared with similar operations in the HDF5 library. The overhead is too high.
Muqun Yang from (HDF group) mentioned to me that there is a lot of overhead when writing the several ROMS header scalar variables. I made experiments by commenting out
Code: Select all
!!      CALL wrt_info (ng, iNLM, ncHISid(ng), ncname)
        IF (exit_flag.ne.NoError) RETURN
Re: Parallel I/O benchmarks?
Wallclock time for the whole thing. The tilings are 4x8 and 16x16, respectively.
			
			
									
									
						Re: Parallel I/O benchmarks?
We've just had a very nice class from some Cray specialists. They had me try a different Lustre striping for the directory (stripe of one instead of four) which sped things up by about a factor of two with 32 procs. They actually recommend having a subset of all the procs doing the I/O, maybe sqrt(nprocs) or maybe one per node (our Cray has 8 procs/node). You can set up an MPI communicator for just the I/O. I know, easier said than done.
Edit: I should add that the reason you don't want *all* the procs to be writing is because they all have to talk to the metadata server for Lustre, which means the metadata server becomes the bottleneck.
			
			
									
									
						Edit: I should add that the reason you don't want *all* the procs to be writing is because they all have to talk to the metadata server for Lustre, which means the metadata server becomes the bottleneck.


