Probable bug in I/O when MPI and NS_PERIODIC are used

General scientific issues regarding ROMS

Moderators: arango, robertson

Post Reply
Message
Author
Paul_Budgell
Posts: 18
Joined: Wed Apr 23, 2003 1:34 pm
Location: IMR, Bergen, Norway

Probable bug in I/O when MPI and NS_PERIODIC are used

#1 Post by Paul_Budgell » Thu Feb 05, 2009 10:35 pm

I have found that if I use MPI in combination with NS_PERIODIC, version #306 crashes. I have been able to replicate the problem with one of the analytical test cases, RIVERPLUME1. If I run the RIVERPLUME1 case as it comes with the version 306 distribution, it runs - no problem. But if I read the grid parameters in from a grid file, the program crashes with the same error I've been getting in my global application. I created the grid file using the ROMS matlab tools routines c_grid.m and w_grid.m with parameters taken from the ana_grid.h and ana_mask.h settings for RIVERPLUME1.

I turned on the floating point exception checking on the ifort compiler and found that pm has zero values at (0,Mm) and (21,Mm) (thus producing zero divide in metrics.F), with a 2x2 tiling with 4 processors. It looks like information is not getting transferred properly in Utility/get_grid.F. Perhaps there is a problem in the new mp_exchange2d routine when NSperiodic is true. It is likely that the same error occurs with EW_PERIODIC, but I haven't had time to check that yet.

I would attach the grid file I created but the file extension nc does not seem to be permitted for upload on this forum. The output from the run is:

------------

Code: Select all

 Process Information:

 Node #  0 (pid=    8430) is active.
 Node #  2 (pid=    8428) is active.
 Node #  1 (pid=    8427) is active.
 Node #  3 (pid=    8429) is active.

 Model Input Parameters:  ROMS/TOMS version 3.2
                          Thursday - February 5, 2009 -  9:37:48 PM
 -----------------------------------------------------------------------------

 River Plume Test

 Operating system : Linux
 CPU/hardware     : x86_64
 Compiler system  : ifort
 Compiler command : /usr/local/mpich2-1.0.5p4/bin/mpif90
 Compiler flags   : -heap-arrays -g -fpe0 -traceback -free

 Input Script  : Apps/RIVERPLUME1/ocean_riverplume1.in

 SVN Root URL  : https://www.myroms.org/svn/omlab/branches/kate
 SVN Revision  :

 Local Root    : /heim/paul/roms_3.1k_912
 Header Dir    : Apps/RIVERPLUME1
 Header file   : riverplume1.h
 Analytical Dir: /heim/paul/roms_3.1k_912/ROMS/Functionals

 Resolution, Grid 01: 0039x0067x013,  Parallel Nodes:   4,  Tiling: 002x002


 Physical Parameters, Grid: 01
 =============================

      21600  ntimes          Number of timesteps for 3-D equations.
    120.000  dt              Timestep size (s) for 3-D equations.
         20  ndtfast         Number of timesteps for 2-D equations between
                               each 3D timestep.
          1  ERstr           Starting ensemble/perturbation run number.
          1  ERend           Ending ensemble/perturbation run number.
          0  nrrec           Number of restart records to read from disk.
          T  LcycleRST       Switch to recycle time-records in restart file.
        360  nRST            Number of timesteps between the writing of data
                               into restart fields.
          1  ninfo           Number of timesteps between print of information
                               to standard output.
          T  ldefout         Switch to create a new output NetCDF file(s).
        360  nHIS            Number of timesteps between the writing fields
                               into history file.
          1  ntsAVG          Starting timestep for the accumulation of output
                               time-averaged data.
        360  nAVG            Number of timesteps between the writing of
                               time-averaged data into averages file.
 0.0000E+00  tnu2(01)        Horizontal, harmonic mixing coefficient (m2/s)
                               for tracer 01: temp
 0.0000E+00  tnu2(02)        Horizontal, harmonic mixing coefficient (m2/s)
                               for tracer 02: salt
 5.0000E-06  Akt_bak(01)     Background vertical mixing coefficient (m2/s)
                               for tracer 01: temp
 5.0000E-06  Akt_bak(02)     Background vertical mixing coefficient (m2/s)
                               for tracer 02: salt
 5.0000E-06  Akv_bak         Background vertical mixing coefficient (m2/s)
                               for momentum.
 3.0000E-04  rdrg            Linear bottom drag coefficient (m/s).
 3.0000E-03  rdrg2           Quadratic bottom drag coefficient.
 2.0000E-02  Zob             Bottom roughness (m).
          1  lmd_Jwt         Jerlov water type.
 3.0000E+00  theta_s         S-coordinate surface control parameter.
 4.0000E-01  theta_b         S-coordinate bottom  control parameter.
     50.000  Tcline          S-coordinate surface/bottom layer width (m) used
                               in vertical coordinate stretching.
   1025.000  rho0            Mean density (kg/m3) for Boussinesq approximation.
      0.000  dstart          Time-stamp assigned to model initialization (days).
       0.00  time_ref        Reference time for units attribute (yyyymmdd.dd)
 0.0000E+00  Tnudg(01)       Nudging/relaxation time scale (days)
                               for tracer 01: temp
 0.0000E+00  Tnudg(02)       Nudging/relaxation time scale (days)
                               for tracer 02: salt
 0.0000E+00  Znudg           Nudging/relaxation time scale (days)
                               for free-surface.
 0.0000E+00  M2nudg          Nudging/relaxation time scale (days)
                               for 2D momentum.
 0.0000E+00  M3nudg          Nudging/relaxation time scale (days)
                               for 3D momentum.
 0.0000E+00  obcfac          Factor between passive and active
                               open boundary conditions.
      4.000  T0              Background potential temperature (C) constant.
     32.000  S0              Background salinity (PSU) constant.
     -1.000  gamma2          Slipperiness variable: free-slip (1.0) or
                                                    no-slip (-1.0).
          T  Hout(idFsur)    Write out free-surface.
          T  Hout(idUbar)    Write out 2D U-momentum component.
          T  Hout(idVbar)    Write out 2D V-momentum component.
          T  Hout(idUvel)    Write out 3D U-momentum component.
          T  Hout(idVvel)    Write out 3D V-momentum component.
          T  Hout(idWvel)    Write out W-momentum component.
          T  Hout(idOvel)    Write out omega vertical velocity.
          T  Hout(idTvar)    Write out tracer 01: temp
          T  Hout(idTvar)    Write out tracer 02: salt
          T  Hout(idDano)    Write out density anomaly.
          T  Hout(idVvis)    Write out vertical viscosity coefficient.
          T  Hout(idTdif)    Write out vertical T-diffusion coefficient.
          T  Hout(idSdif)    Write out vertical S-diffusion coefficient.
          T  Hout(idHsbl)    Write out depth of surface boundary layer.
          T  Hout(idHbbl)    Write out depth of bottom boundary layer.

 Output/Input Files:

             Output Restart File:  ocean_rst.nc
             Output History File:  ocean_his.nc
            Output Averages File:  ocean_avg.nc
                 Input Grid File:  Apps/RIVERPLUME1/riverplume1_grid.nc
    IO Variable Information File:  ROMS/External/varinfo.dat

 Tile partition information for Grid 01:  0039x0067x0013  tiling: 002x002

     tile     Istr     Iend     Jstr     Jend     Npts

 Number of tracers:            2
        0        1       20        1       34     8840
        1       21       39        1       34     8398
        2        1       20       35       67     8580
        3       21       39       35       67     8151

 Tile minimum and maximum fractional grid coordinates:
   (interior points only)

     tile     Xmin     Xmax     Ymin     Ymax     grid

        0     0.50    20.50     0.50    34.50  RHO-points
        1    20.50    39.50     0.50    34.50  RHO-points
        2     0.50    20.50    34.50    67.50  RHO-points
        3    20.50    39.50    34.50    67.50  RHO-points

        0     1.00    20.50     0.50    34.50    U-points
        1    20.50    39.00     0.50    34.50    U-points
        2     1.00    20.50    34.50    67.50    U-points
        3    20.50    39.00    34.50    67.50    U-points

        0     0.50    20.50     1.00    34.50    V-points
        1    20.50    39.50     1.00    34.50    V-points
        2     0.50    20.50    34.50    67.00    V-points
        3    20.50    39.50    34.50    67.00    V-points

 Maximum halo size in XI and ETA directions:

               HaloSizeI(1) =      93
               HaloSizeJ(1) =     141
                TileSide(1) =      41
                TileSize(1) =    1025


 Activated C-preprocessing Options:

  RIVERPLUME1        River Plume Test
  ANA_BSFLUX         Analytical kinematic bottom salinity flux.
  ANA_BTFLUX         Analytical kinematic bottom temperature flux.
  ANA_INITIAL        Analytical initial conditions.
  ANA_PSOURCE        Analytical point sources and sinks.
  ANA_SMFLUX         Analytical kinematic surface momentum flux.
  ANA_SRFLUX         Analytical kinematic shortwave radiation flux.
  ANA_SSFLUX         Analytical kinematic surface salinity flux.
  ANA_STFLUX         Analytical kinematic surface temperature flux.
  ASSUMED_SHAPE      Using assumed-shape arrays.
  AVERAGES           Writing out time-averaged fields.
  AVERAGES_AKS       Writing out time-averaged vertical S-diffusion.
  AVERAGES_AKT       Writing out time-averaged vertical T-diffusion.
  DJ_GRADPS          Parabolic Splines density Jacobian (Shchepetkin, 2002).
  DOUBLE_PRECISION   Double precision arithmetic.
  EASTERN_WALL       Wall boundary at Eastern edge.
  LMD_BKPP           KPP bottom boundary layer mixing.
  LMD_CONVEC         LMD convective mixing due to shear instability.
  LMD_MIXING         Large/McWilliams/Doney interior mixing.
  LMD_NONLOCAL       LMD convective nonlocal transport.
  LMD_RIMIX          LMD diffusivity due to shear instability.
  LMD_SKPP           KPP surface boundary layer mixing.
  MASKING            Land/Sea masking.
  MIX_GEO_TS         Mixing of tracers along geopotential surfaces.
  MPI                MPI distributed-memory configuration.
  NONLINEAR          Nonlinear Model.
  NONLIN_EOS         Nonlinear Equation of State for seawater.
  NS_PERIODIC        North-South periodic boundaries.
  POWER_LAW          Power-law shape time-averaging barotropic filter.
  PROFILE            Time profiling activated .
  !RST_SINGLE        Double precision fields in restart NetCDF file.
  SALINITY           Using salinity.
  SOLVE3D            Solving 3D Primitive Equations.
  SPLINES            Conservative parabolic spline reconstruction.
  TS_A4HADVECTION    Fourth-order Akima horizontal advection of tracers.
  TS_A4VADVECTION    Fourth-order Akima vertical advection of tracers.
  TS_DIF2            Harmonic mixing of tracers.
  TS_PSOURCE         Tracers point sources and sinks.
  UV_ADV             Advection of momentum.
  UV_COR             Coriolis term.
  UV_U3HADVECTION    Third-order upstream horizontal advection of 3D momentum.
  UV_C4VADVECTION    Fourth-order centered vertical advection of momentum.
  UV_QDRAG           Quadratic bottom stress.
  UV_PSOURCE         Mass point sources and sinks.
  VAR_RHO_2D         Variable density barotropic mode.
  WESTERN_WALL       Wall boundary at Western edge.

 INITIAL: Configuring and initializing forward nonlinear model ...


 Vertical S-coordinate System:

 level   S-coord     Cs-curve          at_hmin  over_slope     at_hmax

    13   0.0000000   0.0000000           0.000       0.000       0.000
    12  -0.0769231  -0.0253369          -1.154      -3.371      -5.588
    11  -0.1538462  -0.0568884          -2.308      -7.285     -12.263
    10  -0.2307692  -0.0971871          -3.462     -11.965     -20.469
     9  -0.3076923  -0.1484861          -4.615     -17.608     -30.600
     8  -0.3846154  -0.2119251          -5.769     -24.313     -42.856
     7  -0.4615385  -0.2867031          -6.923     -32.010     -57.096
     6  -0.5384615  -0.3700543          -8.077     -40.457     -72.836
     5  -0.6153846  -0.4585665          -9.231     -49.355     -89.480
     4  -0.6923077  -0.5502087         -10.385     -58.528    -106.671
     3  -0.7692308  -0.6456884         -11.538     -68.036    -124.534
     2  -0.8461538  -0.7485087         -12.692     -78.187    -143.681
     1  -0.9230769  -0.8642669         -13.846     -89.470    -165.093
     0  -1.0000000  -1.0000000         -15.000    -102.500    -190.000

 Time Splitting Weights: ndtfast =  20    nfast =  29

    Primary            Secondary            Accumulated to Current Step

  1-0.0009651193358779 0.0500000000000000-0.0009651193358779 0.0500000000000000
  2-0.0013488780126037 0.0500482559667939-0.0023139973484816 0.1000482559667939
  3-0.0011514592651645 0.0501156998674241-0.0034654566136461 0.1501639558342180
  4-0.0003735756740661 0.0501732728306823-0.0038390322877122 0.2003372286649003
  5 0.0009829200513762 0.0501919516143856-0.0028561122363360 0.2505291802792859
  6 0.0029141799764308 0.0501428056118168 0.0000580677400948 0.3006719858911028
  7 0.0054132615310267 0.0499970966129953 0.0054713292711215 0.3506690825040981
  8 0.0084687837865133 0.0497264335364439 0.0139401130576348 0.4003955160405420
  9 0.0120633394191050 0.0493029943471183 0.0260034524767397 0.4496985103876603
 10 0.0161716623600090 0.0486998273761630 0.0421751148367487 0.4983983377638233
 11 0.0207585511322367 0.0478912442581626 0.0629336659689855 0.5462895820219859
 12 0.0257765478740990 0.0468533167015507 0.0887102138430845 0.5931428987235365
 13 0.0311633730493854 0.0455644893078458 0.1198735868924699 0.6387073880313823
 14 0.0368391158442262 0.0440063206553765 0.1567127027366961 0.6827137086867587
 15 0.0427031802506397 0.0421643648631652 0.1994158829873358 0.7248780735499240
 16 0.0486309868367617 0.0400292058506332 0.2480468698240975 0.7649072794005571
 17 0.0544704302037592 0.0375976565087951 0.3025173000278567 0.8025049359093522
 18 0.0600380921294286 0.0348741349986072 0.3625553921572853 0.8 WPB:metrics:pm=0 at i,j =           21
    67
373790709079594
 19 0.0651152103984763 0.0318722303921357 0.4276706025557617 0.8692513013000951
 20 0.0694434033194840 0.0286164698722119 0.4971140058752457 0.8978677711723070
 21 0.0727201499285570 0.0251442997062377 0.5698341558038027 0.9230120708785448
 22 0.0745940258796570 0.0215082922098099 0.6444281816834597 0.9445203630883546
 23 0.0746596950216180 0.0177785909158270 0.7190878767050777 0.9622989540041816
 24 0.0724526566618460 0.0140456061647461 0.7915405333669236 0.9763445601689278
 25 0.0674437485167025 0.0104229733316538 0.8589842818836262 0.9867675335005817
 26 0.0590334053485719 0.0070507859058187 0.9180176872321981 0.9938183194064003
 27 0.0465456732896125 0.0040991156383901 0.9645633605218106 0.9979174350447904
 28 0.0292219798521904 0.0017718319739095 0.9937853403740009 0.9996892670186999
 29 0.0062146596259991 0.0003107329813000 1.0000000000000000 0.9999999999999998

 ndtfast, nfast =   20  29   nfast/ndtfast = 1.45000

 Centers of gravity and integrals (values must be 1, 1, approx 1/2, 1, 1) WPB:metrics:pm=0 at i,j =
  0          67
:

    1.000000000000 1.060707743385 0.530353871693 1.000000000000 1.000000000000
Power filter parameters, Fgamma, gamma =  0.28400   0.14200
rank 3 in job 11 valborg_34829 caused collective abort of all ranks
exit status of rank 3: killed by signal 6
rank 2 in job 11 valborg_34829 caused collective abort of all ranks
exit status of rank 2: killed by signal 6

------------------------------------------------------------------------------------------

And the run time error output (from using the -g -fpe0 -traceback ifort options) is:

------------------------------------------------------------------------------------------

/heim/paul/roms_3.2_trunk % mpiexec -n 4 oceanG Apps/RIVERPLUME1/ocean_riverplume1.in > riverplume1.out
forrtl: error (73): floating divide by zero
Image PC Routine Line Source
oceanG 00000000005780D2 metrics_mod_mp_me 197 metrics.f90
oceanG 00000000005758F6 metrics_mod_mp_me 57 metrics.f90
oceanG 00000000004053ED initial_ 129 initial.f90
oceanG 0000000000404867 Unknown Unknown Unknown
oceanG 000000000040462D MAIN__ 97 master.f90
oceanG 00000000004044EA Unknown Unknown Unknown
libc.so.6 0000003DA8B1C40B Unknown Unknown Unknown
oceanG 000000000040442A Unknown Unknown Unknown
forrtl: error (73): floating divide by zero
Image PC Routine Line Source
oceanG 00000000005780D2 metrics_mod_mp_me 197 metrics.f90
oceanG 00000000005758F6 metrics_mod_mp_me 57 metrics.f90
oceanG 00000000004053ED initial_ 129 initial.f90
oceanG 0000000000404867 Unknown Unknown Unknown
oceanG 000000000040462D MAIN__ 97 master.f90
oceanG 00000000004044EA Unknown Unknown Unknown
libc.so.6 0000003DA8B1C40B Unknown Unknown Unknown
oceanG 000000000040442A Unknown Unknown Unknown
[cli_0]: aborting job:
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(140).............................: MPI_Wait(request=0x7fbfffe010, status0x96a414) failed
MPIDI_CH3_Progress_wait(212)..............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(413):
MPIDU_Socki_handle_read(633)..............: connection failure (set=0,sock=4,errno=104:Connection reset by peer)
[cli_1]: aborting job:
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(140).............................: MPI_Wait(request=0x7fbfffe010, status0x96a414) failed
MPIDI_CH3_Progress_wait(212)..............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(413):
MPIDU_Socki_handle_read(633)..............: connection failure (set=0,sock=4,errno=104:Connection reset by peer)
10 /heim/paul/roms_3.2_trunk %


--------------------------------------------------------------------

I used to be able to run MPI with the NS_PERIODIC option and grid files, but I'm not sure when this problem arose. I will run with some earlier versions to try to identify when the problem first cropped up.

Paul_Budgell
Posts: 18
Joined: Wed Apr 23, 2003 1:34 pm
Location: IMR, Bergen, Norway

Re: Probable bug in I/O when MPI and NS_PERIODIC are used

#2 Post by Paul_Budgell » Fri Feb 06, 2009 1:52 am

I just ran the same set up on version 298 and it worked fine. So, it looks as though the bug(s) were introduced in the last major upgrade with changes in parallel I/O.

User avatar
arango
Site Admin
Posts: 1131
Joined: Wed Feb 26, 2003 4:41 pm
Location: IMCS, Rutgers University
Contact:

Re: Probable bug in I/O when MPI and NS_PERIODIC are used

#3 Post by arango » Fri Feb 06, 2009 3:14 am

I just ran RIVERPLUME1 using a 2x2 partition, as you did, and it is working for me with pgi and ifort. I also used both MPICH1 and MPICH2. I was not able to reproduce your problem. I also checked pm in the debugger at the bottom of metrics.F and it looks fine. I don't have any zeros, intead I have:

Code: Select all

(17,66)    0.000666666666666667
(18,66)    0.000666666666666667
(19,66)    0.000666666666666667
(20,66)    0.000666666666666667
(21,66)    0.000666666666666667
(22,66)    0.000666666666666667
(0,67)     0.000666666666666667
(1,67)     0.000666666666666667
(2,67)     0.000666666666666667
(3,67)     0.000666666666666667
(4,67)     0.000666666666666667
(5,67)     0.000666666666666667
Are you sure that you updated the code and using the latest version? Yes, version 306 is the latest one. It was loaded on 2009-02-03 16:47:10 -0500 (Tue, 03 Feb 2009). By the way, I used ifort with -check uninit -ftrapuv -traceback. So I don't know where your problem is.

By the way, it also work for gfortran. Which version ifort are you using? I am using:

Code: Select all

ifort (IFORT) 10.1 20080801

Paul_Budgell
Posts: 18
Joined: Wed Apr 23, 2003 1:34 pm
Location: IMR, Bergen, Norway

Re: Probable bug in I/O when MPI and NS_PERIODIC are used

#4 Post by Paul_Budgell » Fri Feb 06, 2009 3:37 am

Did you #undef ANA_GRID and read in a grid file? The default RIVERPLUME1 configuration works fine because it doesn't input the grid data.

User avatar
arango
Site Admin
Posts: 1131
Joined: Wed Feb 26, 2003 4:41 pm
Location: IMCS, Rutgers University
Contact:

Re: Probable bug in I/O when MPI and NS_PERIODIC are used

#5 Post by arango » Fri Feb 06, 2009 6:28 am

Yes, I corrected the problem. I was using the wrong indices in mp_scatter2d and mp_scatter3d. I need to use the indices in the IOBOUNDS(ng) structure. See the following :arrow: trac ticket for more details.

Thank for reporting this problem :oops: Please update.

Post Reply