﻿id	summary	reporter	owner	description	type	status	priority	milestone	component	version	resolution	keywords	cc
735	Very IMPORTANT:  ROMS Profiling Overhault	arango		"This is an important update because I revised the entire profile before I start looking for ways to improve the computational efficiency.  In particular, I have been experimenting with ways to accelerate the ROMS nesting algorithms.  Currently, I am concentrating on routines '''mp_assemble''' and '''mp_aggregate''' of '''distribute.F'''.

This update also includes a correction with the management of ntend(ng) and few changes to the arguments for routines '''wclock_on''' and '''clock_off'''.

== '''What it is new?''' ==


 * The routine '''mp_assemble''' is a multidimensional version of '''mp_collect''' and are use in the nesting and 4D-Var algorithms, respectively.  Both routines assemble/collect elements of arrays from all the MPI nodes.  Each node process parts of these arrays computed from tiled state variables.

 The assembly/collection operation can be code with high-level MPI functions like '''mpi_allgather''' or '''mpi_allreduce''' (summation since all arrays are initialized from zero). Alternatively, one could use lower-level routines '''mpi_irecv''', '''mpi_isend''', and '''mpi_bcast''' similarly at what is done in the tile-halo exchanges ('''mp_exchange.F''').  It turns out the lower-level functions are actually more efficient than the higher-level functions.  This is the case for us using generic MPI libraries (like OpenMPI).   The high-level functions are usually optimized in millions of dollars supercomputers and compilers by the vendors.

 Notice that at the top of '''distribute.F''', we have the following internal CPP options to set the desired communication options.  The default is to have:
 {{{
# undef  ASSEMBLE_ALLGATHER /* use mpi_allgather im mp_assemble */
# undef  ASSEMBLE_ALLREDUCE /* use mpi_allreduce in mp_assemble */
# define BOUNDARY_ALLREDUCE /* use mpi_allreduce in mp_boundary */
# undef  COLLECT_ALLGATHER  /* use mpi_allgather in mp_collect  */
# undef  COLLECT_ALLREDUCE  /* use mpi_allreduce in mp_collect  */
# define REDUCE_ALLGATHER   /* use mpi_allgather in mp_reduce   */
# undef  REDUCE_ALLREDUCE   /* use mpi_allreduce in mp_reduce   */
}}}

 * The ROMS internal profiling was modified to include more regions ('''Pregions''') in '''mod_strings.F''':
  {{{
        character (len=50), dimension(Nregion) :: Pregion =             &
     &    (/'Allocation and array initialization ..............',       & !01
     &      'Ocean state initialization .......................',       & !02
     &      'Reading of input data ............................',       & !03
     &      'Processing of input data .........................',       & !04
     &      'Processing of output time averaged data ..........',       & !05
     &      'Computation of vertical boundary conditions ......',       & !06
     &      'Computation of global information integrals ......',       & !07
     &      'Writing of output data ...........................',       & !08
     &      'Model 2D kernel ..................................',       & !09
     &      'Lagrangian floats trajectories ...................',       & !10
     &      'Tidal forcing ....................................',       & !11
     &      '2D/3D coupling, vertical metrics .................',       & !12
     &      'Omega vertical velocity ..........................',       & !13
     &      'Equation of state for seawater ...................',       & !14
     &      'Biological module, source/sink terms .............',       & !15
     &      'Sediment transport module, source/sink terms .....',       & !16
     &      'Atmosphere-Ocean bulk flux parameterization ......',       & !17
     &      'KPP vertical mixing parameterization .............',       & !18
     &      'GLS vertical mixing parameterization .............',       & !19
     &      'My2.5 vertical mixing parameterization ...........',       & !20
     &      '3D equations right-side terms ....................',       & !21
     &      '3D equations predictor step ......................',       & !22
     &      'Pressure gradient ................................',       & !23
     &      'Harmonic mixing of tracers, S-surfaces ...........',       & !24
     &      'Harmonic mixing of tracers, geopotentials ........',       & !25
     &      'Harmonic mixing of tracers, isopycnals ...........',       & !26
     &      'Biharmonic mixing of tracers, S-surfaces .........',       & !27
     &      'Biharmonic mixing of tracers, geopotentials ......',       & !28
     &      'Biharmonic mixing of tracers, isopycnals .........',       & !29
     &      'Harmonic stress tensor, S-surfaces ...............',       & !30
     &      'Harmonic stress tensor, geopotentials ............',       & !31
     &      'Biharmonic stress tensor, S-surfaces .............',       & !32
     &      'Biharmonic stress tensor, geopotentials ..........',       & !33
     &      'Corrector time-step for 3D momentum ..............',       & !34
     &      'Corrector time-step for tracers ..................',       & !35
     &      'Nesting algorithm ................................',       & !36
     &      'Bottom boundary layer module .....................',       & !37
     &      'GST Analysis eigenproblem solution ...............',       & !38
     &      'Two-way coupling to Atmosphere Model .............',       & !39
     &      'Two-way coupling to Sea Ice Model ................',       & !40
     &      'Two-way coupling to Wave Model ...................',       & !41
     &      'Reading model state vector .......................',       & !42
     &      '4D-Var minimization solver .......................',       & !43
     &      'Background error covariance matrix ...............',       & !44
     &      'Posterior error covariance matrix ................',       & !45
     &      'Unused 01 ........................................',       & !46
     &      'Unused 02 ........................................',       & !47
     &      'Unused 03 ........................................',       & !48
     &      'Unused 04 ........................................',       & !49
     &      'Unused 05 ........................................',       & !50
     &      'Unused 06 ........................................',       & !51
     &      'Unused 07 ........................................',       & !52
     &      'Unused 08 ........................................',       & !53
     &      'Unused 09 ........................................',       & !54
     &      'Unused 10 ........................................',       & !55
     &      'Unused 11 ........................................',       & !56
     &      'Unused 12 ........................................',       & !57
     &      'Unused 13 ........................................',       & !58
     &      'Unused 14 ........................................',       & !59
     &      'Message Passage: 2D halo exchanges ...............',       & !60
     &      'Message Passage: 3D halo exchanges ...............',       & !61
     &      'Message Passage: 4D halo exchanges ...............',       & !62
     &      'Message Passage: lateral boundary exchanges ......',       & !63
     &      'Message Passage: data broadcast ..................',       & !64
     &      'Message Passage: data reduction ..................',       & !65
     &      'Message Passage: data gathering ..................',       & !66
     &      'Message Passage: data scattering..................',       & !67
     &      'Message Passage: boundary data gathering .........',       & !68
     &      'Message Passage: point data gathering ............',       & !69
     &      'Message Passage: nesting point data gathering ....',       & !70
     &      'Message Passage: nesting array data gathering ....',       & !71
     &      'Message Passage: synchronization barrier .........',       & !72
     &      'Message Passage: multi-model coupling ............'/)        !73
}}}

 Notice that we now have '''73 regions''' including '''14 unused regions''' for later use.  We need to separate the Message Passage (MPI) regions from the rest.  It was tedious to renumber all the MPI regions from the rest of algorithms.  The MPI regions need to located in indices '''Mregion=60''' to '''Nregion=72'''. In '''wclock_off''', we have:
 {{{
# ifdef DISTRIBUTE
          DO imodel=1,4
            DO iregion=Mregion,Nregion
              ...
            END DO
          END DO
# endif
}}}

 to process all the MPI regions.  Notice that regions indices '''36''', '''39''', '''40''', '''41''', '''42''', '''43''', '''44''', '''45''', '''70''', '''71''', and '''72''' are the new regions introduced here to help the profiling and identify the bottleneck areas.

 * There are two additional arguments to routines '''wclock_on''' and '''wclock_off''':
 {{{
      SUBROUTINE wclock_on  (ng, model, region, line, routine)
      SUBROUTINE wclock_off (ng, model, region, line, routine)
}}}
 so in the calling routine, we have for example:
 {{{
      CALL wclock_on  (ng, iNLM, 9, __LINE__, __FILE__)
      CALL wclock_off (ng, iNLM, 9, __LINE__, __FILE__)
}}}
 and the C-preprocessing code will yield:
 {{{
      CALL wclock_on  (ng, iNLM, 9, 39,  ""ROMS/Nonlinear/step2d_LF_AM3.h"")
      CALL wclock_off (ng, iNLM, 9, 116, ""ROMS/Nonlinear/step2d_LF_AM3.h"")
}}}
 The new arguments line and routine will be used in the future for more ellaborated profiling using third-party libraries.

----

Below is the profiling statistics for a 5-day simulation with two nested grids using the low-level '''mpi_irecv'''/'''mpi_isend'''/'''mpi_bcast''' in routines '''mp_assemble''' and '''mp_collect'''.  The simulation was run on my latest Mac on 4-CPUS.


{{{
 Elapsed CPU time (seconds):

 Node   #  0 CPU:    4150.677
 Node   #  3 CPU:    4209.386
 Node   #  1 CPU:    4209.369
 Node   #  2 CPU:    4209.324
 Total:             16778.756

 Nonlinear model elapsed CPU time profile, Grid: 01

  Allocation and array initialization ..............         1.185  ( 0.0071 %)
  Ocean state initialization .......................         0.837  ( 0.0050 %)
  Reading of input data ............................        63.005  ( 0.3755 %)
  Processing of input data .........................        27.495  ( 0.1639 %)
  Processing of output time averaged data ..........        91.075  ( 0.5428 %)
  Computation of vertical boundary conditions ......         0.636  ( 0.0038 %)
  Computation of global information integrals ......        18.845  ( 0.1123 %)
  Writing of output data ...........................       103.600  ( 0.6174 %)
  Model 2D kernel ..................................       405.983  ( 2.4196 %)
  Tidal forcing ....................................        23.336  ( 0.1391 %)
  2D/3D coupling, vertical metrics .................        58.658  ( 0.3496 %)
  Omega vertical velocity ..........................        35.253  ( 0.2101 %)
  Equation of state for seawater ...................        44.661  ( 0.2662 %)
  Atmosphere-Ocean bulk flux parameterization ......        53.266  ( 0.3175 %)
  GLS vertical mixing parameterization .............       851.479  ( 5.0747 %)
  3D equations right-side terms ....................        62.280  ( 0.3712 %)
  3D equations predictor step ......................       148.728  ( 0.8864 %)
  Pressure gradient ................................        45.963  ( 0.2739 %)
  Harmonic mixing of tracers, geopotentials ........        96.407  ( 0.5746 %)
  Harmonic stress tensor, S-surfaces ...............        38.824  ( 0.2314 %)
  Corrector time-step for 3D momentum ..............        79.510  ( 0.4739 %)
  Corrector time-step for tracers ..................       105.353  ( 0.6279 %)
  Nesting algorithm ................................       205.556  ( 1.2251 %)
  Reading model state vector .......................         0.785  ( 0.0047 %)
                                              Total:      2562.721   15.2736

 Nonlinear model message Passage profile, Grid: 01

  Message Passage: 2D halo exchanges ...............        51.440  ( 0.3066 %)
  Message Passage: 3D halo exchanges ...............        93.536  ( 0.5575 %)
  Message Passage: 4D halo exchanges ...............        36.680  ( 0.2186 %)
  Message Passage: data broadcast ..................       117.041  ( 0.6976 %)
  Message Passage: data reduction ..................         1.395  ( 0.0083 %)
  Message Passage: data gathering ..................        20.711  ( 0.1234 %)
  Message Passage: data scattering..................         0.912  ( 0.0054 %)
  Message Passage: boundary data gathering .........         0.904  ( 0.0054 %)
  Message Passage: point data gathering ............         0.573  ( 0.0034 %)
  Message Passage: nesting point data gathering ....       708.861  ( 4.2248 %)
                                              Total:      1032.054    6.1510

 Nonlinear model elapsed CPU time profile, Grid: 02

  Allocation and array initialization ..............         1.185  ( 0.0071 %)
  Ocean state initialization .......................         0.851  ( 0.0051 %)
  Reading of input data ............................         6.918  ( 0.0412 %)
  Processing of input data .........................        24.180  ( 0.1441 %)
  Processing of output time averaged data ..........       610.139  ( 3.6364 %)
  Computation of vertical boundary conditions ......         3.645  ( 0.0217 %)
  Computation of global information integrals ......        93.566  ( 0.5576 %)
  Writing of output data ...........................       187.852  ( 1.1196 %)
  Model 2D kernel ..................................      2680.264  (15.9742 %)
  Tidal forcing ....................................         0.038  ( 0.0002 %)
  2D/3D coupling, vertical metrics .................       175.925  ( 1.0485 %)
  Omega vertical velocity ..........................       131.463  ( 0.7835 %)
  Equation of state for seawater ...................       213.727  ( 1.2738 %)
  Atmosphere-Ocean bulk flux parameterization ......       274.219  ( 1.6343 %)
  GLS vertical mixing parameterization .............      4496.748  (26.8002 %)
  3D equations right-side terms ....................       414.284  ( 2.4691 %)
  3D equations predictor step ......................       758.085  ( 4.5181 %)
  Pressure gradient ................................       242.797  ( 1.4471 %)
  Harmonic mixing of tracers, geopotentials ........       503.073  ( 2.9983 %)
  Harmonic stress tensor, S-surfaces ...............       219.733  ( 1.3096 %)
  Corrector time-step for 3D momentum ..............       362.818  ( 2.1624 %)
  Corrector time-step for tracers ..................       418.174  ( 2.4923 %)
  Nesting algorithm ................................       842.104  ( 5.0189 %)
  Reading model state vector .......................         1.443  ( 0.0086 %)
                                              Total:     12663.231   75.4718

 Nonlinear model message Passage profile, Grid: 02

  Message Passage: 2D halo exchanges ...............       223.769  ( 1.3336 %)
  Message Passage: 3D halo exchanges ...............       254.173  ( 1.5149 %)
  Message Passage: 4D halo exchanges ...............       113.998  ( 0.6794 %)
  Message Passage: data broadcast ..................       115.124  ( 0.6861 %)
  Message Passage: data reduction ..................         5.276  ( 0.0314 %)
  Message Passage: data gathering ..................        34.649  ( 0.2065 %)
  Message Passage: data scattering..................         0.336  ( 0.0020 %)
  Message Passage: point data gathering ............         0.348  ( 0.0021 %)
  Message Passage: nesting point data gathering ....       565.746  ( 3.3718 %)
  Message Passage: nesting array data gathering ....       485.481  ( 2.8934 %)
                                              Total:      1798.901   10.7213

  Unique code regions profiled .....................     15225.952   90.7454 %
  Residual, non-profiled code ......................      1552.804    9.2546 %


 All percentages are with respect to total time =        16778.756
}}}

Notice that the most expensive algorithm in this particular profiling is the GLS vertical mixing parameterization ('''31.6%'''), 2D kernel ('''19.2%'''), and nesting ('''6.2%'''). I don't think that there is much that we can do about the GSL since involve several fractional powers that are very expensive. 

On average, the low-level MPI functions yield 1-3% faster code than using '''mpi_allreduce''' in '''mp_assemble''' and '''mp_aggregate'''. Similarly, the low-level MPI functions yield 6-9% faster code than using '''mpi_allgather''' in '''mp_assemble''' and '''mp_aggregate'''.

One must be careful when examining these numbers.  It will depend on the type of computer hardware, compiler, the number of parallel nodes, intra-node connectivity, node speed, etc.  We always need to investigate the optimal number of nodes for a particular ROMS application.  Too many nodes may slow the computations because of the communications overhead.

Therefore, the default is to use the low-level MPI functions in routines '''mp_assemble''', '''mp_aggregate''', and '''mp_collect'''.  See top of '''distribute.F'''."	upgrade	closed	major	Release ROMS/TOMS 3.7	Nonlinear	3.7	Done		
