id	summary	reporter	owner	description	type	status	priority	milestone	component	version	resolution	keywords	cc
789	VERY IMPORTANT: ROMS single precision numerical kernel	arango		"ROMS uses '''double-precision''' arithmetic in its numerical kernel by default. I have been asked several times what is needed to run in '''single precision''' to accelerate the computations. I always answered that it is straightforward since ROMS defines the numerical precision with '''KIND''' type parameters in '''mod_kinds.F''' globally.  That is, of course, very far from reality because we need to look at the accuracy in various computations like:
 * time managing variables, 
 * bounded time interpolation between snapshots,
 * vertical stretching of terrain-following coordinates,
 * area and volume conservation variables,
 * time variables representation in output NetCDF files, and
 * processing of ROMS unique ''namelist'' like parameters from input scripts
Therefore, implementing a '''single-precision''' numerical kernel was not as trivial as I thought since we need to keep/compute a few variables in '''double precision''' with lots of files being modified.

== '''What Is New:''' ==

* A new C-preprocessing '''SINGLE_PRECISION''' is introduced to convert the numerical kernel to '''single precision'''. In '''globaldef.h''', we now have:
 {{{
/*
** Turn ON/OFF double precision arithmetic in numerical kernel (default)
** and floating-point type variables and associated intrinsic functions.
*/

#ifdef SINGLE_PRECISION
# ifdef OUT_DOUBLE
#   undef OUT_DOUBLE
# endif
# ifndef RST_SINGLE
#   define RST_SINGLE
# endif
#else
# define DOUBLE_PRECISION
#endif
}}}
 Notice that if '''OUT_DOUBLE''' and '''RST_SINGLE''' are activated, they are turned off for consistency in the numerical kernel.

* In '''mod_kinds.F''', we now have
 {{{
        integer, parameter :: i1b= SELECTED_INT_KIND(1)        !  8-bit
        integer, parameter :: i2b= SELECTED_INT_KIND(2)        !  8-bit
        integer, parameter :: i4b= SELECTED_INT_KIND(4)        ! 16-bit
        integer, parameter :: i8b= SELECTED_INT_KIND(8)        ! 32-bit
        integer, parameter :: c8 = SELECTED_REAL_KIND(6,30)    ! 32-bit
        integer, parameter :: dp = SELECTED_REAL_KIND(12,300)  ! 64-bit
        integer, parameter :: r4 = SELECTED_REAL_KIND(6,30)    ! 32-bit
# ifdef SINGLE_PRECISION
        integer, parameter :: r8 = SELECTED_REAL_KIND(6,30)    ! 32-bit
# else
        integer, parameter :: r8 = SELECTED_REAL_KIND(12,300)  ! 64-bit
# endif
}}}
 Notice that the '''!r8''' parameter is defined for '''32'''-bit ('''4'''-bytes) floating-point variables in '''single precision''' computations. A new parameter, '''dp''', is introduced to keep few variable in '''double precision'''.  For example, in '''mod_scalars.F''' we have:
 {{{
        real(dp), allocatable :: tdays(:)                ! days
        real(dp), allocatable :: time(:)                 ! seconds

        real(dp), allocatable :: dt(:)                   ! seconds
        real(dp), allocatable :: dtfast(:)               ! seconds

        real(dp), allocatable :: TimeEnd(:)              ! seconds
        real(dp), allocatable :: AVGtime(:)              ! seconds
        real(dp), allocatable :: DIAtime(:)              ! seconds
        real(dp), allocatable :: IMPtime(:)              ! seconds
        real(dp), allocatable :: ObsTime(:)              ! seconds
        real(dp), allocatable :: FrcTime(:)              ! seconds

        real(dp) :: dstart = 0.0_dp                      ! days
        real(dp) :: io_time = 0.0_dp                     ! seconds
        real(dp) :: run_time = 0.0_dp                    ! seconds
        real(dp) :: tide_start = 0.0_dp                  ! days
}}}

* A new module, '''inp_decode.F''', is introduced to process and decode ROMS '''!KeyWord''' parameters from input script files. It has the following module procedures:
 {{{
      INTERFACE load_i
        MODULE PROCEDURE load_1d_i       ! 1D integer array
        MODULE PROCEDURE load_2d_i       ! 2D integer array
        MODULE PROCEDURE load_3d_i       ! 3D integer array
      END INTERFACE load_i

      INTERFACE load_l
        MODULE PROCEDURE load_1d_l       ! 1D logical array
        MODULE PROCEDURE load_2d_l       ! 2D logical array
        MODULE PROCEDURE load_3d_l       ! 3D logical array
      END INTERFACE load_l

      INTERFACE load_r
#ifdef SINGLE_PRECISION
        MODULE PROCEDURE load_1d_dp      ! 1D real(dp) array
        MODULE PROCEDURE load_2d_dp      ! 2D real(dp) array
        MODULE PROCEDURE load_3d_dp      ! 3D real(dp) array
#endif
        MODULE PROCEDURE load_1d_r8      ! 1D real(r8) array
        MODULE PROCEDURE load_2d_r8      ! 2D real(r8) array
        MODULE PROCEDURE load_3d_r8      ! 3D real(r8) array
      END INTERFACE load_r
}}}
 to process '''1D''', '''2D''', and '''3D''' variables since we are using the strong typing properties of modules.  Before we didn't use modules because all the arrays arguments were reshaped to '''1D''' arrays ('''F77''' style).  This part is very tricky but more robust since we need to consider single and double precision variables that are written to input scripts with '''8'''-bytes double representation.  I considered polymorphic objects (available in Fortran 2003, 2008, 2018), but I rejected it because of backward compatibility with older Fortran compilers.

 '''WARNING:''' The input script processing routines ('''read_asspar.F''', '''read_biopar.F''', '''read_phypar.F''', and '''read_stapar.F''') were modified, and the functions '''load_i''', '''load_l''', and '''load_r''' have a different number of arguments depending if we are processing '''1D''', '''2D''', or '''3D''' variables. For Example for '''load_r''' we can have:
 {{{
            CASE ('DT')
              Npts=load_r(Nval, Rval, Ngrids, dt)
            CASE ('TNU2')
              Npts=load_r(Nval, Rval, NAT+NPT, Ngrids, Rtracer)
            CASE ('HdecayB(isTvar)')
              Npts=load_r(Nval, Rval, 4, MT, Ngrids, Rboundary)
}}}
 for processing '''dt(:)''', '''Rtracer(:,:,:)''', and '''Rboundary(:,:,:)''' respectively.

----

'''NOTICE:''' on principle '''single-precision''' computations run twice as fast as '''double-precision''' computations. However, since some double precision computations are carried out for accuracy in '''SINGLE_PRECISION''' solutions the gain is less than '''50''' percent.
For example, in a 3-nested grid application on 4 CPUs, I get
 {{{
 Elapsed CPU time (seconds):

 Node   #    0 CPU:    4923.242
 Node   #    1 CPU:    4953.730
 Node   #    2 CPU:    4953.504
 Node   #    3 CPU:    4953.539
 Total:               19784.015    for double-precision computations

 Elapsed CPU time (seconds):

 Node   #    0 CPU:    2628.964
 Node   #    1 CPU:    2683.894
 Node   #    3 CPU:    2683.867
 Node   #    2 CPU:    2683.818
 Total:               10680.543    for single-precision computations
}}}
That is, the single-precision is '''46 percent''' faster than the double-precision simulation.

In 4D-Var simulations, this number get much smaller because the computations are dominated by I/O."	upgrade	new	major	Release ROMS/TOMS 3.7	Nonlinear	3.7