Context Navigation

← Previous Ticket
Next Ticket →

#735 closed upgrade (Done)

Very IMPORTANT: ROMS Profiling Overhault

Reported by:	arango	Owned by:
Priority:	major	Milestone:	Release ROMS/TOMS 3.7
Component:	Nonlinear	Version:	3.7
Keywords:		Cc:

Description (last modified by arango)

This is an important update because I revised the entire profile before I start looking for ways to improve the computational efficiency. In particular, I have been experimenting with ways to accelerate the ROMS nesting algorithms. Currently, I am concentrating on routines mp_assemble and mp_aggregate of distribute.F.

This update also includes a correction with the management of ntend(ng) and few changes to the arguments for routines wclock_on and clock_off.

What it is new?

The routine mp_assemble is a multidimensional version of mp_collect and are use in the nesting and 4D-Var algorithms, respectively. Both routines assemble/collect elements of arrays from all the MPI nodes. Each node process parts of these arrays computed from tiled state variables.

The assembly/collection operation can be code with high-level MPI functions like mpi_allgather or mpi_allreduce (summation since all arrays are initialized from zero). Alternatively, one could use lower-level routines mpi_irecv, mpi_isend, and mpi_bcast similarly at what is done in the tile-halo exchanges (mp_exchange.F). It turns out the lower-level functions are actually more efficient than the higher-level functions. This is the case for us using generic MPI libraries (like OpenMPI). The high-level functions are usually optimized in millions of dollars supercomputers and compilers by the vendors.

Notice that at the top of distribute.F, we have the following internal CPP options to set the desired communication options. The default is to have:

# undef  ASSEMBLE_ALLGATHER /* use mpi_allgather im mp_assemble */
# undef  ASSEMBLE_ALLREDUCE /* use mpi_allreduce in mp_assemble */
# define BOUNDARY_ALLREDUCE /* use mpi_allreduce in mp_boundary */
# undef  COLLECT_ALLGATHER  /* use mpi_allgather in mp_collect  */
# undef  COLLECT_ALLREDUCE  /* use mpi_allreduce in mp_collect  */
# define REDUCE_ALLGATHER   /* use mpi_allgather in mp_reduce   */
# undef  REDUCE_ALLREDUCE   /* use mpi_allreduce in mp_reduce   */

The ROMS internal profiling was modified to include more regions (Pregions) in mod_strings.F:

      character (len=50), dimension(Nregion) :: Pregion =             &
   &    (/'Allocation and array initialization ..............',       & !01
   &      'Ocean state initialization .......................',       & !02
   &      'Reading of input data ............................',       & !03
   &      'Processing of input data .........................',       & !04
   &      'Processing of output time averaged data ..........',       & !05
   &      'Computation of vertical boundary conditions ......',       & !06
   &      'Computation of global information integrals ......',       & !07
   &      'Writing of output data ...........................',       & !08
   &      'Model 2D kernel ..................................',       & !09
   &      'Lagrangian floats trajectories ...................',       & !10
   &      'Tidal forcing ....................................',       & !11
   &      '2D/3D coupling, vertical metrics .................',       & !12
   &      'Omega vertical velocity ..........................',       & !13
   &      'Equation of state for seawater ...................',       & !14
   &      'Biological module, source/sink terms .............',       & !15
   &      'Sediment transport module, source/sink terms .....',       & !16
   &      'Atmosphere-Ocean bulk flux parameterization ......',       & !17
   &      'KPP vertical mixing parameterization .............',       & !18
   &      'GLS vertical mixing parameterization .............',       & !19
   &      'My2.5 vertical mixing parameterization ...........',       & !20
   &      '3D equations right-side terms ....................',       & !21
   &      '3D equations predictor step ......................',       & !22
   &      'Pressure gradient ................................',       & !23
   &      'Harmonic mixing of tracers, S-surfaces ...........',       & !24
   &      'Harmonic mixing of tracers, geopotentials ........',       & !25
   &      'Harmonic mixing of tracers, isopycnals ...........',       & !26
   &      'Biharmonic mixing of tracers, S-surfaces .........',       & !27
   &      'Biharmonic mixing of tracers, geopotentials ......',       & !28
   &      'Biharmonic mixing of tracers, isopycnals .........',       & !29
   &      'Harmonic stress tensor, S-surfaces ...............',       & !30
   &      'Harmonic stress tensor, geopotentials ............',       & !31
   &      'Biharmonic stress tensor, S-surfaces .............',       & !32
   &      'Biharmonic stress tensor, geopotentials ..........',       & !33
   &      'Corrector time-step for 3D momentum ..............',       & !34
   &      'Corrector time-step for tracers ..................',       & !35
   &      'Nesting algorithm ................................',       & !36
   &      'Bottom boundary layer module .....................',       & !37
   &      'GST Analysis eigenproblem solution ...............',       & !38
   &      'Two-way coupling to Atmosphere Model .............',       & !39
   &      'Two-way coupling to Sea Ice Model ................',       & !40
   &      'Two-way coupling to Wave Model ...................',       & !41
   &      'Reading model state vector .......................',       & !42
   &      '4D-Var minimization solver .......................',       & !43
   &      'Background error covariance matrix ...............',       & !44
   &      'Posterior error covariance matrix ................',       & !45
   &      'Unused 01 ........................................',       & !46
   &      'Unused 02 ........................................',       & !47
   &      'Unused 03 ........................................',       & !48
   &      'Unused 04 ........................................',       & !49
   &      'Unused 05 ........................................',       & !50
   &      'Unused 06 ........................................',       & !51
   &      'Unused 07 ........................................',       & !52
   &      'Unused 08 ........................................',       & !53
   &      'Unused 09 ........................................',       & !54
   &      'Unused 10 ........................................',       & !55
   &      'Unused 11 ........................................',       & !56
   &      'Unused 12 ........................................',       & !57
   &      'Unused 13 ........................................',       & !58
   &      'Unused 14 ........................................',       & !59
   &      'Message Passage: 2D halo exchanges ...............',       & !60
   &      'Message Passage: 3D halo exchanges ...............',       & !61
   &      'Message Passage: 4D halo exchanges ...............',       & !62
   &      'Message Passage: lateral boundary exchanges ......',       & !63
   &      'Message Passage: data broadcast ..................',       & !64
   &      'Message Passage: data reduction ..................',       & !65
   &      'Message Passage: data gathering ..................',       & !66
   &      'Message Passage: data scattering..................',       & !67
   &      'Message Passage: boundary data gathering .........',       & !68
   &      'Message Passage: point data gathering ............',       & !69
   &      'Message Passage: nesting point data gathering ....',       & !70
   &      'Message Passage: nesting array data gathering ....',       & !71
   &      'Message Passage: synchronization barrier .........',       & !72
   &      'Message Passage: multi-model coupling ............'/)        !73

Notice that we now have 73 regions including 14 unused regions for later use. We need to separate the Message Passage (MPI) regions from the rest. It was tedious to renumber all the MPI regions from the rest of algorithms. The MPI regions need to located in indices Mregion=60 to Nregion=72. In wclock_off, we have:
# ifdef DISTRIBUTE
          DO imodel=1,4
            DO iregion=Mregion,Nregion
              ...
            END DO
          END DO
# endif

to process all the MPI regions. Notice that regions indices 36, 39, 40, 41, 42, 43, 44, 45, 70, 71, and 72 are the new regions introduced here to help the profiling and identify the bottleneck areas.

There are two additional arguments to routines wclock_on and wclock_off:

     SUBROUTINE wclock_on  (ng, model, region, line, routine)
     SUBROUTINE wclock_off (ng, model, region, line, routine)

so in the calling routine, we have for example:

     CALL wclock_on  (ng, iNLM, 9, __LINE__, __FILE__)
     CALL wclock_off (ng, iNLM, 9, __LINE__, __FILE__)

and the C-preprocessing code will yield:

     CALL wclock_on  (ng, iNLM, 9, 39,  "ROMS/Nonlinear/step2d_LF_AM3.h")
     CALL wclock_off (ng, iNLM, 9, 116, "ROMS/Nonlinear/step2d_LF_AM3.h")

The new arguments line and routine will be used in the future for more ellaborated profiling using third-party libraries.

Below is the profiling statistics for a 5-day simulation with two nested grids using the low-level mpi_irecv/mpi_isend/mpi_bcast in routines mp_assemble and mp_collect. The simulation was run on my latest Mac on 4-CPUS.

 Elapsed CPU time (seconds):

 Node   #  0 CPU:    4150.677
 Node   #  3 CPU:    4209.386
 Node   #  1 CPU:    4209.369
 Node   #  2 CPU:    4209.324
 Total:             16778.756

 Nonlinear model elapsed CPU time profile, Grid: 01

  Allocation and array initialization ..............         1.185  ( 0.0071 %)
  Ocean state initialization .......................         0.837  ( 0.0050 %)
  Reading of input data ............................        63.005  ( 0.3755 %)
  Processing of input data .........................        27.495  ( 0.1639 %)
  Processing of output time averaged data ..........        91.075  ( 0.5428 %)
  Computation of vertical boundary conditions ......         0.636  ( 0.0038 %)
  Computation of global information integrals ......        18.845  ( 0.1123 %)
  Writing of output data ...........................       103.600  ( 0.6174 %)
  Model 2D kernel ..................................       405.983  ( 2.4196 %)
  Tidal forcing ....................................        23.336  ( 0.1391 %)
  2D/3D coupling, vertical metrics .................        58.658  ( 0.3496 %)
  Omega vertical velocity ..........................        35.253  ( 0.2101 %)
  Equation of state for seawater ...................        44.661  ( 0.2662 %)
  Atmosphere-Ocean bulk flux parameterization ......        53.266  ( 0.3175 %)
  GLS vertical mixing parameterization .............       851.479  ( 5.0747 %)
  3D equations right-side terms ....................        62.280  ( 0.3712 %)
  3D equations predictor step ......................       148.728  ( 0.8864 %)
  Pressure gradient ................................        45.963  ( 0.2739 %)
  Harmonic mixing of tracers, geopotentials ........        96.407  ( 0.5746 %)
  Harmonic stress tensor, S-surfaces ...............        38.824  ( 0.2314 %)
  Corrector time-step for 3D momentum ..............        79.510  ( 0.4739 %)
  Corrector time-step for tracers ..................       105.353  ( 0.6279 %)
  Nesting algorithm ................................       205.556  ( 1.2251 %)
  Reading model state vector .......................         0.785  ( 0.0047 %)
                                              Total:      2562.721   15.2736

 Nonlinear model message Passage profile, Grid: 01

  Message Passage: 2D halo exchanges ...............        51.440  ( 0.3066 %)
  Message Passage: 3D halo exchanges ...............        93.536  ( 0.5575 %)
  Message Passage: 4D halo exchanges ...............        36.680  ( 0.2186 %)
  Message Passage: data broadcast ..................       117.041  ( 0.6976 %)
  Message Passage: data reduction ..................         1.395  ( 0.0083 %)
  Message Passage: data gathering ..................        20.711  ( 0.1234 %)
  Message Passage: data scattering..................         0.912  ( 0.0054 %)
  Message Passage: boundary data gathering .........         0.904  ( 0.0054 %)
  Message Passage: point data gathering ............         0.573  ( 0.0034 %)
  Message Passage: nesting point data gathering ....       708.861  ( 4.2248 %)
                                              Total:      1032.054    6.1510

 Nonlinear model elapsed CPU time profile, Grid: 02

  Allocation and array initialization ..............         1.185  ( 0.0071 %)
  Ocean state initialization .......................         0.851  ( 0.0051 %)
  Reading of input data ............................         6.918  ( 0.0412 %)
  Processing of input data .........................        24.180  ( 0.1441 %)
  Processing of output time averaged data ..........       610.139  ( 3.6364 %)
  Computation of vertical boundary conditions ......         3.645  ( 0.0217 %)
  Computation of global information integrals ......        93.566  ( 0.5576 %)
  Writing of output data ...........................       187.852  ( 1.1196 %)
  Model 2D kernel ..................................      2680.264  (15.9742 %)
  Tidal forcing ....................................         0.038  ( 0.0002 %)
  2D/3D coupling, vertical metrics .................       175.925  ( 1.0485 %)
  Omega vertical velocity ..........................       131.463  ( 0.7835 %)
  Equation of state for seawater ...................       213.727  ( 1.2738 %)
  Atmosphere-Ocean bulk flux parameterization ......       274.219  ( 1.6343 %)
  GLS vertical mixing parameterization .............      4496.748  (26.8002 %)
  3D equations right-side terms ....................       414.284  ( 2.4691 %)
  3D equations predictor step ......................       758.085  ( 4.5181 %)
  Pressure gradient ................................       242.797  ( 1.4471 %)
  Harmonic mixing of tracers, geopotentials ........       503.073  ( 2.9983 %)
  Harmonic stress tensor, S-surfaces ...............       219.733  ( 1.3096 %)
  Corrector time-step for 3D momentum ..............       362.818  ( 2.1624 %)
  Corrector time-step for tracers ..................       418.174  ( 2.4923 %)
  Nesting algorithm ................................       842.104  ( 5.0189 %)
  Reading model state vector .......................         1.443  ( 0.0086 %)
                                              Total:     12663.231   75.4718

 Nonlinear model message Passage profile, Grid: 02

  Message Passage: 2D halo exchanges ...............       223.769  ( 1.3336 %)
  Message Passage: 3D halo exchanges ...............       254.173  ( 1.5149 %)
  Message Passage: 4D halo exchanges ...............       113.998  ( 0.6794 %)
  Message Passage: data broadcast ..................       115.124  ( 0.6861 %)
  Message Passage: data reduction ..................         5.276  ( 0.0314 %)
  Message Passage: data gathering ..................        34.649  ( 0.2065 %)
  Message Passage: data scattering..................         0.336  ( 0.0020 %)
  Message Passage: point data gathering ............         0.348  ( 0.0021 %)
  Message Passage: nesting point data gathering ....       565.746  ( 3.3718 %)
  Message Passage: nesting array data gathering ....       485.481  ( 2.8934 %)
                                              Total:      1798.901   10.7213

  Unique code regions profiled .....................     15225.952   90.7454 %
  Residual, non-profiled code ......................      1552.804    9.2546 %


 All percentages are with respect to total time =        16778.756

Notice that the most expensive algorithm in this particular profiling is the GLS vertical mixing parameterization (31.6%), 2D kernel (19.2%), and nesting (6.2%). I don't think that there is much that we can do about the GSL since involve several fractional powers that are very expensive.

On average, the low-level MPI functions yield 1-3% faster code than using mpi_allreduce in mp_assemble and mp_aggregate. Similarly, the low-level MPI functions yield 6-9% faster code than using mpi_allgather in mp_assemble and mp_aggregate.

One must be careful when examining these numbers. It will depend on the type of computer hardware, compiler, the number of parallel nodes, intra-node connectivity, node speed, etc. We always need to investigate the optimal number of nodes for a particular ROMS application. Too many nodes may slow the computations because of the communications overhead.

Therefore, the default is to use the low-level MPI functions in routines mp_assemble, mp_aggregate, and mp_collect. See top of distribute.F.

Change History (2)

comment:1 by arango, 8 years ago

Resolution:	→ Done
Status:	new → closed

comment:2 by arango, 8 years ago

Description:	modified (diff)

Note: See TracTickets for help on using tickets.

Download in other formats: