IMPORTANT: Parallel Global Reduction and Volume Conservation

Archive of important messages sent via the ROMS mailing list

Moderators: arango, robertson

Post Reply
Message
Author
User avatar
arango
Site Admin
Posts: 1084
Joined: Wed Feb 26, 2003 4:41 pm
Location: IMCS, Rutgers University
Contact:

IMPORTANT: Parallel Global Reduction and Volume Conservation

#1 Post by arango » Thu Mar 05, 2015 1:28 am

WARNING: You cannot replay or post comments in this forum thread. If you have any comments, use the following :arrow: post instead.

In ROMS, we have the option to impose a volume conservation in applications with open boundaries if tidal forcing is not anabled:

Code: Select all

! Set lateral open boundary edge volume conservation switch for
! nonlinear model and adjoint-based algorithms. Usually activated
! with radiation boundary conditions to enforce global mass
! conservation, except if tidal forcing is enabled. [1:Ngrids].

   VolCons(west)  ==  T                            ! western  boundary
   VolCons(east)  ==  T                            ! eastern  boundary
   VolCons(south) ==  T                            ! southern boundary
   VolCons(north) ==  F                            ! northern boundary

ad_VolCons(west)  ==  T                            ! western  boundary
ad_VolCons(east)  ==  T                            ! eastern  boundary
ad_VolCons(south) ==  T                            ! southern boundary
ad_VolCons(north) ==  F                            ! northern boundary
The routines in obc_volcons.F are used to impose volume conservation. The flux integral along the open boundary (bc_flux) and its area (bc_flux) are computed in routine obc_flux_tile. Both values are used to compute the barotropic velocity correction (normal component to open boundary), ubar_xs = bc_flux / bc_area, so the volume integral along the open boundary is conserved. They are scalar values since it represents the accumulative sum along the open boundary.

The computation of bc_flux and bc_area are simple but they are subject to round-off in parallel and serial applications with tile partitions. Both values are usually large numbers, specially in big domains and deep bathymetry. This makes the round-off problem worse.

In parallel applications, we need to have a special code to carry out the global reduction operation (summation):

Code: Select all

      IF (ANY(VolCons(:,ng))) THEN
#ifdef DISTRIBUTE
        NSUB=1                           ! distributed-memory
#else
        IF (DOMAIN(ng)%SouthWest_Corner(tile).and.                      &
     &      DOMAIN(ng)%NorthEast_Corner(tile)) THEN
          NSUB=1                         ! non-tiled application
        ELSE
          NSUB=NtileX(ng)*NtileE(ng)     ! tiled application
        END IF
#endif
!$OMP CRITICAL (OBC_VOLUME)
        IF (tile_count.eq.0) THEN
          bc_flux=0.0_r8
          bc_area=0.0_r8
        END IF
        bc_area=bc_area+my_area
        bc_flux=bc_flux+my_flux
        tile_count=tile_count+1
        IF (tile_count.eq.NSUB) THEN
          tile_count=0
#ifdef DISTRIBUTE
          buffer(1)=bc_area
          buffer(2)=bc_flux
          op_handle(1)='SUM'
          op_handle(2)='SUM'
          CALL mp_reduce (ng, iNLM, 2, buffer, op_handle)
          bc_area=buffer(1)
          bc_flux=buffer(2)
#endif
          ubar_xs=bc_flux/bc_area
        END IF
!$OMP END CRITICAL (OBC_VOLUME)
      END IF
Well, here is where the round-off problem comes from and we get different solutions for different parallel partitions. Therefore, we have a replicability problem. In general, collective parallel floating-point operations can produce different results on different tile (processor) configurations. That is, for example, configuring your application on 2x5 tiles accumulate round-off errors differently than on 4x8.

Notice that in any computer, (A + B + C + D) will not give the same results as (A + D + C + B). Therefore, when we do floating-point parallel reduce operations the order of the reduction between chunks of numbers matters. This order is not deterministic in routine like mpi_allreduce, which is called by the ROMS routine mp_reduce. As matter of fact, there are three different ways in mp_reduce to compute this reduction operation using either mpi_allreduce, mpi_allgather, and mpi_isend/mpi_irecv. All three methods are subject to round-off problems.

What To Do:
  • Currently, there is not much that we can do. Sasha suggested to carry the summation-by-pairs to reduce the round-off error. This summation adds only comparable numbers from the chunk of parallel data.
  • If you are using frequent data from larger scale models, you can suppress volume conservation to see what happens. Specially, you need frequent free-surface (zeta) data. You may get into trouble if running for long (years) periods.
  • Impose tidal forcing at the open boundary since we need to avoid volume conservation.
  • Notice that we not longer can compare NetCDF files byte by byte to check for parallel bugs when volume conservation is activated:

    Code: Select all

    % diff r1/ocean_avg.nc r2/ocean_avg.nc
    Binary files r1/ocean_avg.nc and r2/ocean_avg.nc differ
    
    % diff r1/ocean_his.nc r2/ocean_his.nc
    Binary files r1/ocean_his.nc and r2/ocean_his.nc differ
    
    % diff r1/ocean_rst.nc r2/ocean_rst.nc
    Binary files r1/ocean_rst.nc and r2/ocean_rst.nc differ
    To compute binary differences, you need to activate CPP options: DEBUGGING, OUT_DOUBLE, and POSITIVE_ZERO.

Post Reply