Problems running re-start solutions.

Bug reports, work arounds and fixes

Moderators: arango, robertson

Post Reply
Message
Author
colucix
Posts: 7
Joined: Fri Jan 26, 2007 7:27 pm
Location: Università Politecnica delle Marche, ITALY

Problems running re-start solutions.

#1 Unread post by colucix »

Dear ROMS community,

I have got a problem running the current stable release 3.3 using a re-start solution (NRREC ≠ 0). We have been successfully running an operational forecasting system in the Adriatic Sea for over two years, using release 3.0 and I have never encountered the problem I'm going to explain.

The brief story: I run a long simulation from 28-May-2007 to 30-Nov-2009. The model (latest revision 442) has been compiled using the PERFECT_RESTART directive. After that I tried a restart from 29-Nov-2009 to have an overlap of 24-hours and check the differences in the output. Unfortunately (and this is the problem) the model blows up at the very first time step, after the initialization phase.

Trying to investigate, I re-compiled the model without the PERFECT_RESTART directive, run it and it completed successfully! At this point I suspect the presence of a bug, since the model equations have been solved in the interval between 29-Nov-2009 and 30-Nov-2009 - both from the long simulation and the “no perfect restart” build – and I can reasonably exclude a problem in my implementation.

Currently I use the Nonlinear model and the same issue applies to the last revisions of release 3.2 (I previously tried them).

Here is my question: did anyone notice the same behavior trying to run a hot restart?

Any hint to track down the alleged bug is welcome. I can provide many more details if needed. In the meanwhile my hunting continues...

Thank you,
Alex

User avatar
arango
Site Admin
Posts: 1350
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: Problems running re-start solutions.

#2 Unread post by arango »

Well, this looks like a field is missing in the perfect restart. This all depends on the C-preprocessing options used in your application. Perfect restart required several fields and it is possible that we don't have all the restart fields required for all ROMS options. This will be pretty straight forward to follow by looking the standard output carefully. Look for the range (min, max) of the fields processed during initialization. Also check the field in the restart file.

You didn't provide enough information for us to get an idea where the problem is. A debugging session will help a lot here since the model blows-up right away.

colucix
Posts: 7
Joined: Fri Jan 26, 2007 7:27 pm
Location: Università Politecnica delle Marche, ITALY

Re: Problems running re-start solutions.

#3 Unread post by colucix »

Dear Prof. Arango,

thank you for your suggestions. Indeed, I checked the min/max values in the standard output, comparing them with the min/max from the actual fields in the restart file and they look good. Moreover I didn't notice anything strange in the restart fields after looking at them one by one. Nor I cannot see anything missing (however this can be my fault).

There are news, anyway. The production machine is a Linux cluster which uses the Portland fortran compiler. I tried to compile using the DEBUG flag in the makefile and after that the re-start run went straight to its completion!

After that I tried to play with the compilation options and in addition I tried to compile on a multi-processor workstation which in turn uses Intel fortran. Here is a synthesis of my results (all with the PERFECT_RESTART directive activated):

1) Using DEBUG on Linux cluster: it RUNS.
Compiler system : pgi
Compiler command : /usr/bin/mpif90
Compiler flags : -g -C -Mfree

2) Changing the optimization to level 1 on Linux cluster: it RUNS
Compiler system : pgi
Compiler command : /usr/bin/mpif90
Compiler flags : -O1 -tp k8-64 -Mfree

3) Changing the optimization to level 2 (or 3 as per default) on Linux cluster: BLOW-UP
Compiler system : pgi
Compiler command : /usr/bin/mpif90
Compiler flags : -O2 -tp k8-64 -Mfree

4) Trying a shared memory configuration on Linux cluster with OpenMP: BLOW-UP
Compiler system : pgi
Compiler command : /usr/local/pgi/linux86-64/last/bin/pgf90
Compiler flags : -mp -O3 -tp k8-64 -Mfree

5) Using Intel Fortran on Linux Workstation with OpenMP: it RUNS!
Compiler system : ifort
Compiler command : /opt/intel/fce/10.0.025/bin/ifort
Compiler flags : -heap-arrays -fp-model precise -openmp -fpp -ip -O3 -xW -free

At this point I feel that I was looking for a problem in the wrong place. Maybe, actually it's not a bug in the code but it's something related to the optimization scheme of a specific Fortran compiler. Unfortunately I cannot try the Intel fortran on the Linux cluster, right now.

I have to dig deeply into the way a compiler manages the ROMS code and eventually try to play with the compiler's options, but I'm afraid it is far beyond my knowledge.

For completion I report the model set-up:

Code: Select all

#define PERFECT_RESTART

#define UV_ADV
#define UV_COR
#define UV_PSOURCE
#define DJ_GRADPS
#define TS_MPDATA
#define TS_DIF2
#define MIX_GEO_TS
#define TS_PSOURCE
#define NONLIN_EOS
#define SALINITY
#define MASKING
#define SOLVE3D
#define SPLINES
#define CURVGRID

#define AVERAGES
#define AVERAGES_AKV
#define AVERAGES_AKT
#define AVERAGES_AKS
#define AVERAGES_FLUXES

#define NOSEDBBL
#ifdef  NOSEDBBL
# undef  SEDIMENT
# undef  SUSPLOAD
# undef  ANA_SEDIMENT
# undef  ANA_WWAVE
# undef  RIVER_SEDIMENT
#else
# define SEDIMENT
# define SUSPLOAD
# undef  ANA_SEDIMENT
# undef  ANA_WWAVE
# define RIVER_SEDIMENT
#endif

#define UV_LOGDRAG

#define GLS_MIXING
#ifdef GLS_MIXING
# define KANTHA_CLAYSON
# define N2S2_HORAVG
# define CRAIG_BANNER
# define CHARNOK
#endif

#define ANA_SSFLUX
#undef  ANA_SPFLUX
#define ANA_BSFLUX
#define ANA_BTFLUX
#undef  ANA_BPFLUX

#define BULK_FLUXES
#ifdef BULK_FLUXES
# define LONGWAVE
# define SOLAR_SOURCE
# undef  ANA_RAIN
# undef  COOL_SKIN
# define EMINUSP
# undef  ATM_PRESS
#endif

#define WESTERN_WALL
#define NORTHERN_WALL
#define EASTERN_WALL
#define RADIATION_2D
#define SOUTH_M3RADIATION
#define SOUTH_TRADIATION

#define SOUTH_TNUDGING
#define SOUTH_M3NUDGING

#define SSH_TIDES
#ifdef SSH_TIDES
# define SOUTH_FSCHAPMAN
# define ANA_FSOBC
#else
# define SOUTH_FSGRADIENT
#endif

#define UV_TIDES
#ifdef UV_TIDES
# define SOUTH_M2FLATHER
# define ANA_M2OBC
#else
# define SOUTH_M2RADIATION
#endif
Any hint is highly appreciated (I feel a bit lost, now...).

Sorry for the long post. Thank you,
Alex

ggerbi
Posts: 14
Joined: Thu Jun 12, 2008 6:03 pm
Location: University of Maine

Re: Problems running re-start solutions.

#4 Unread post by ggerbi »

Alex,

I've had a nearly identical problem recently. I would run a model for two weeks. Then use perfect restart to run for a few days more.

I was only using PGI. When I optimized at level 3 (or 2) it blew up (sometimes it would go through a couple time-steps, I believe).

When I optimize at level 1 it runs fine. My runs were relatively short, so I stuck with that solution.

Greg

colucix
Posts: 7
Joined: Fri Jan 26, 2007 7:27 pm
Location: Università Politecnica delle Marche, ITALY

Re: Problems running re-start solutions.

#5 Unread post by colucix »

Greg,

thank you for reporting this issue. Now it definitively looks like a problem related to the PGI optimization. Furthermore, I succeeded in compiling with Intel fortran on the Linux cluster (shared-memory only) and it runs flawlessly:

Using Intel Fortran on Linux Cluster with OpenMP: it RUNS!
Compiler system : ifort
Compiler command : /lhome/swan/intel/Compiler/11.0/074/bin/intel64/ifort
Compiler flags : -heap-arrays -fp-model precise -openmp -fpp -ip -O3 -xW -free

Now I'm trying to investigate if some specific compiler option can solve this issue and at the same time I'm looking for some structure in the code (maybe a loop or a floating point exception) that triggers this error when using PERFECT_RESTART.

Alex

robertson
Site Admin
Posts: 219
Joined: Wed Feb 26, 2003 3:12 pm
Location: IMCS, Rutgers University

Re: Problems running re-start solutions.

#6 Unread post by robertson »

A question and a suggestion.

Question(s):
Do your Linux cluster compute nodes have AMD Opteron processors? If not, you should take out the '-tp k8-64' flag or replace the 'k8-64' with a more appropriate value. Even if they are Opterons, you might get better performance with 'k8-64e' Check the man pages for pgf90 for more details.

The '-tp' flag sets the target processor type to aid in optimization. This refers to the processor type that you intend to RUN the model on. In certain circumstances, getting this value correct will improve the computing speed of your model. If you remove this flag PGI will set the target processor optimization for the type of processor used to compile the model. This probably isn't causing your problem but could be slowing your code down, especially if you have a newer Intel processor (XEON 5500 series for example).

I'm also curious why you have added '-Mfree' to your FFLAGS.

Suggestion:
You might want to add some strictness to your floating point operations by using '-Kieee' with '-O3' in your FFLAGS. This might help you because debug mode may be working because it disables floating point optimizations (read short-cuts). The '-Kieee' flag forces the compiler to follow the IEEE 754 standards for floating point math, but will not stop other optimizations like vectorizing looops, etc..

Dave

colucix
Posts: 7
Joined: Fri Jan 26, 2007 7:27 pm
Location: Università Politecnica delle Marche, ITALY

Re: Problems running re-start solutions.

#7 Unread post by colucix »

Dear Dave,

thank you for your suggestions. After digging into the meaning of the compiler options, I realized the k8-64 flag is not correct. This machine mounts Intel Xeon 5500 cpus. I changed the processor optimization to p7-64, but the result is the same. Moreover, I tried the -Kieee flag to conform to IEEE 754 and it does not help, as well. Nevertheless, my main concern is still about floating point optimization, as you suggest.

Regarding the -Mfree option, I admit I simply inerithed it from the available configuration template “Linux-pgi.mk”. As I can see, it is applied only to some source files to permit usage of long strings and it is reported along with the other FFLAGS in the standard output, but I did not applied it to the whole source code.

I'm going to try the Portland debugger now, to see if I can retrieve some useful information about the arcane.

Alex

colucix
Posts: 7
Joined: Fri Jan 26, 2007 7:27 pm
Location: Università Politecnica delle Marche, ITALY

Re: Problems running re-start solutions.

#8 Unread post by colucix »

Dear ROMS people,

after a while I resumed looking at this problem and finally I can give answers to my own questions. Here we go.

Synopsis – I use a Nonlinear implementation of ROMS in the Adriatic Sea and every time I run a restart solution using PERFECT RESTART the model blows up at the very first time step. This applies to all the last ROMS releases and when the following conditions are matched:

1. PERFECT RESTART is defined
2. Fortran compiler is PGI or gfortran (Intel fortran works instead)
3. Optimization level is -O2 or higher
4. GLS_MIXING is active (MY25 works instead)
5. NRREC ≠ 0

Caveats - I compiled the model with debugging options, specifically by means of

Code: Select all

      FFLAGS += -O3 -Ktrap=fp -g
in Linux-pgi.mk and it discovered a floating point exception arising in module gls_corstep. In particular the first non-finite number appears in the dissipation term BCK.

A little off-topic: please let me to infinitely thank Davide Cesari (the Genius) from Hydro-Meteo-Clima Service of ARPA Emilia Romagna who helped me a lot in the core analysis and the debugging work.

On topic now. A careful analysis revealed that INF and NAN values strike out only in land points and they are due to 0 values in turbulent kinetic energy (tke) and turbulent generic length scale (gls). The fractional and the power terms in the equation for BCK generate the offending exception.

Anyway, the model's blow-up arises further in the computation due to NANs propagated in the vertical diffusion coefficient for tracers (Akt). Indeed, if I apply the land/sea mask to Akt toward the end of gls_corstep and I assign a low value of 1.0E-08 to the land points, the model runs flawlessly:

Code: Select all

!-----------------------------------------------------------------------
!  Apply Land/Sea mask and set a minimum value in land points.
!-----------------------------------------------------------------------
      DO k=1,N(ng)-1
        DO j=Jstr,Jend
          DO i=Istr,Iend
            DO itrc=1,NAT
              Akt(i,j,k,itrc)=MAX(Akt(i,j,k,itrc)*                      &
     &                        rmask(i,j),1.0E-08_r8)
            END DO
          END DO
        END DO
      END DO
Please note that multiplying a NAN by 0 in fortran gives 0.

Questions & Answers

Q. Why does the model blow-up in a restart solution using PERFECT RESTART?
A.
The module mod_mixing initializes the mixing variables and sets tke and gls to their minimum, as defined by the parameters gls_Kmin and gls_Pmin respectively. After the inizialization the model calls get_state, that reads the fields from the restart file. The subroutine nf_fread4d assigns 0 to the elements matching the _FillValue in the NetCDF (land points). The 0 values propagate in the calculations inside gls_prestep and gls_corstep and give the floating point exceptions mentioned above. I don't know if we can call this a ROMS bug, but it is undoubtedly the cause of the problem.

Q. Why does the model run if using the Mellor-Yamada scheme?
A.
In the Mellor-Yamada formulation we don't get divisions by zero or powers of zero raised to a negative exponent (as in Generic Lenght Scale). Hence floating point exceptions don't occur.

Q. Why does the level of optimization trigger the model blow-up? And why does the Intel fortran compiler work at any level of optimization?
A.
This is the interesting one. I should have asked myself this question early, since the real issue is the reason why Intel works and not why PGI doesn't.

At some point in the module gls_corstep we have:

Code: Select all

!
!  Compute turbulent length scale (m).
!
            tke(i,j,k,nnew)=MAX(tke(i,j,k,nnew),gls_Kmin(ng))
            gls(i,j,k,nnew)=MAX(gls(i,j,k,nnew),gls_Pmin(ng))
Well... if one of the arguments of the MAX intrinsic function is NAN the Portland and the GNU compilers assign NAN to the result. This happens only at optimization levels -O2 and -O3. Moreover, I suspect it is related to the MPI libraries and/or some weird interaction with math libraries. The result is that NANs are left untouched and they propagate to the calculation of the vertical mixing coefficients.

On the contrary, the Intel compiler assigns the finite value, so that the minimum gls_Kmin or gls_Pmin is restored over land points and the NANs disappear from the subsequent calculations. In summary, even in the model compiled with Intel the exceptions arise, but the offending values disappear before the model blows up.

Workarounds – Right now I overcome the problem by reassigning gls_Kmin and gls_Pmin to tke and gls, after they are read from the RESTART file and before they enter in the calculations. In gls_prestep.F I added the following lines immediately after the variable declaration part:

Code: Select all

# ifdef PERFECT_RESTART
      DO k=0,N(ng)
        DO j=Jstr,Jend
          DO i=Istr,Iend
            tke(i,j,k,1)=MAX(tke(i,j,k,1),gls_Kmin(ng))
            tke(i,j,k,2)=MAX(tke(i,j,k,2),gls_Kmin(ng))
            gls(i,j,k,1)=MAX(gls(i,j,k,1),gls_Pmin(ng))
            gls(i,j,k,2)=MAX(gls(i,j,k,2),gls_Pmin(ng))
          END DO
        END DO
      END DO
# endif
but there must be a more standard way to do it following the ROMS coding specifications.

Another workaround might be to force the compilation of gls_corstep at optimization level -O1, for example by adding the following line to Linux-pgi.mk

Code: Select all

#
# Change optimization level for gls_corstep module
#
$(SCRATCH_DIR)/gls_corstep.o: FFLAGS := $(FFLAGS:-O3=-O1)
but I tend to reject this solution, since it doesn't avoid the floating point exceptions and it might change other numbers. Not to mention possible performance losses.

Again, sorry for the very long post. Any comment and suggestions are really appreciated.

Kind regards,
Alex

jcwarner
Posts: 1180
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

Re: Problems running re-start solutions.

#9 Unread post by jcwarner »

WOW!
looks like you spent some time in the code! If we handed out star awards for debugging, you would be at the top of the list. I remember when it was decided to put all this fill_value attributes, and many people thought it was going to just be straight forward to apply it. Over the years there have been many issues that have arose due to the fill_values, such as for the wetting/drying and the morphology. This appears to be another case. I see in wrt_rst that there is a way for the bathymetry to have SetFillVal=.false. for writing out. Maybe there should be a similar way when variables are read back in to not have the fill values replaced with zeros. In mod_mixing the Akt and Akv are initialized to the background values. So if those values are read in from a netcdf restart file they should not be set to zero, they should be set to the miniumum _bak values. Please post a bug ticket and ask that the Akv and Akt values be set to the _bak values for land, not set to zero on land.

colucix
Posts: 7
Joined: Fri Jan 26, 2007 7:27 pm
Location: Università Politecnica delle Marche, ITALY

Re: Problems running re-start solutions.

#10 Unread post by colucix »

Dear JCWarner,

thank you for your comments. Actually, I didn't recall the SetFillVal=.FALSE. trick for WET/DRY and dynamic morphology. I tried to set them in def_rst and wrt_rst for all the variables related to the vertical mixing processes (but I think tke and gls would be enough) and it works, without any other modification.

Still we get some 0 values in the RESTART NetCDF, but they are limited to the boundaries and don't affect calculations, that are made only on interior points (at least for vertical mixing). They come from the subroutine tkebc, where land/sea mask is applied.

Following your advice I will open a bug ticket. Thank you.

Alex

Post Reply