The perils of floating-point time-step arithmetic

Message

m.hadfield · Tue Jan 24, 2012 5:09 am

I have a simulation with the non-linear, forward model driven by the M2 tide only. Accordingly I have made the simulation length a multiple of the M2 period, which is 44714.1 s. The simulation ends by writing a record to the restart file. Parameters are:

Code: Select all

      NTIMES =  23040
          DT == 7.7628819D0
     NDTFAST == 1
        NRST == 23040

This worked OK a year ago, but with the current version of ROMS the simulation ends at 23039 time steps and therefore fails to write a restart record. I strongly suspect that the problem is with the code controlling the time step loop in maind2d.F (it's the same in main3d.F):

Code: Select all

      my_StepTime=0.0_r8
      STEP_LOOP : DO WHILE (my_StepTime.le.RunInterval)
        my_StepTime=my_StepTime+MAXVAL(dt)
        ...
      END DO STEP_LOOP

Combining tests for exact equality with floating-point arithmetic is generally a bad idea. How about a more tolerant test, like the following?

Code: Select all

      my_StepTime=0.0_r8
      STEP_LOOP : DO WHILE (my_StepTime.le.RunInterval+0.5_r8*MAXVAL(dt))
        my_StepTime=my_StepTime+MAXVAL(dt)
        ...
      END DO STEP_LOOP

It fixed the problem in my case.

arango · #2 Unread post by **arango** » Wed Jan 25, 2012 9:27 pm

Indeed, the misery of floating-point round-off in computers... I made the suggested change to the stepping loop so it is more robust. See following trac

ticket.

Thank you for bringing this to my attention...

shchepet · #3 Unread post by **shchepet** » Thu Jan 26, 2012 12:26 am

Hernan and Mark,

Are you serious about your idea of counting REAL-type numbers, like it is proposed in the latest fix,

Code: Select all

      my_StepTime=0.0_r8
      MaxDT=MAXVAL(dt)

      STEP_LOOP : DO WHILE (my_StepTime.le.(RunInterval+0.5_r8*MaxDT))

        my_StepTime=my_StepTime+MaxDT

        ...

      END DO STEP_LOOP

?

It is a bad idea in any case because you are accumulating roundoff errors no matter how you fix it.

...Interestingly enough, recently a person in our group run into trouble by doing something like that
in matlab, basically because he kept adding a small "dt" to a large number.

In addition, recently I myself had a lot of fun investigating an MPI deadlock, which turned out
to be caused by logical error due to roundoff of REAL-type number -- just comparing using .gt.
and .lt. operations -- I absolutely never use .eq. or .ge. or .le. on computed real numbers, but
still... a small roundoff error flipped logical switch differently on different MPI nodes resulting
in loss of synchronization, MPI deadlock, and a lot of fun to find the cause.

Counters should always be of an integer type.

The above code should be rewritten as

Code: Select all

      StartTime=0.0_r8   ! <-- not necessarily 0.0_r8, may be just time from the initial condition
      MaxDT=MAXVAL(dt)
      ntimes=(RunInterval+0.5*MaxDT)/MaxDT     !<-- 0.5*MaxDT is added to make roundoff upward. 
      MaxDT=RunInterval/dble(ntimes)           !<-- this adjusts "dt" a little bit to make sure
                                               !    that at the end of time stepping my_StepTime
      STEP_LOOP : DO itime=1,ntimes            !    will be exactly  StartTime+RunInterval
        my_StepTime=StartTime + dble(itime)*MaxDT

        ...
      END DO STEP_LOOP

arango · #4 Unread post by **arango** » Thu Jan 26, 2012 7:32 pm

Thank you for the comment. I will be revisiting this soon when I add the nesting calls to main2d and main3d. I will take your suggestion into account. You are right about the issue of dealing with large numbers

Yes, the round-off in Matlab is very delicate. I struggled with this about a month ago. Just try this simple statement and you will be surprised about the result:

Code: Select all

>> format long eng
>> double(single(1.0d+12))

ans = 999.999995904000e+009

>> eps(1.0d+12)

ans = 122.070312500000e-006

In the script to read NetCDF data that we distribute, matlab/utility/nc_read.m, I have to put the following logic to process _FillValue and missing_value attributes:

Code: Select all

if (got.FillValue || got.missing_value),
  if (iscellstr(f) || ischar(f)),
    ind=find(f == spval);
    f(ind)=spval;
  else
    ind=find(abs(f-spval) < 4*eps(double(spval)));
  end
end

Notice that to find the special values correctly, I need to have abs(f-spval) < 4*eps(double(spval)) to get the correct solution that accounts for Matlab round-off. They have made a lot of decisions to accelerate computations that compromise precision. There is nothing like IEEE floating-point arithmetic in Matlab. Here f is the data read from the NetCDF file and spval is the attribute value in its native file precision.

In Fortran, we need a function with similar capabilities as eps to do this comparisons according to the computer round-off.

I was completely surprised that all the Matlab interfaces out there and scripts to read NetCDF data are not taken into account the round-off inherent to Matlab. This gives surprising results sometimes. See the following

ticket for details about processing NetCDF data.

m.hadfield · Thu Jan 26, 2012 7:57 pm

shchepet wrote:Counters should always be of an integer type.

Quite right. The quick fix should do for now. If it doesn't (I have some runs with a much larger number of time steps) I'll send Hernan some code for all the different drivers.

shchepet · #6 Unread post by **shchepet** » Fri Jan 27, 2012 8:51 pm

As of right now, ROMS expects to read the 3D-mode time step "dt" and mode-splitting ratio
"ndtfast" from the input script.

A potentially attractive alternative would be to specify "dt" and fast-time step "dtfast", and let
the model figure out the mode-splitting ratio within the code (along with a slight adjustment to
"dtfast", so ndtfast*dtfast=dt made by the model via a simple algorithm described above).

This way it makes it a bit simple to handle: users often change "dt" experimentally because it
is hard to predict a priori for a given configuration (its limit depends on stratification, topography,
tides, if any). In contrast, "dtfast" is very predictable -- basically it depends on the maximum
depth and grid resolution. When user changes "dt", he should also change "ndtfast" to keep
"dtfast" about the same. Specifying it in the input script would be a bit more straightforward.

m.hadfield · Sun Jan 29, 2012 8:34 pm

shchepet wrote:A potentially attractive alternative would be to specify "dt" and fast-time step "dtfast", and let the model figure out the mode-splitting ratio within the code (along with a slight adjustment to "dtfast", so ndtfast*dtfast=dt made by the model via a simple algorithm described above).

+1 to that.

I was explaining to a new user the other day that dtfast is limited by the barotropic CFL criterion, but you don't specify that, you specify dt and ndtfast. We both thought that was a bit odd.

gta · #8 Unread post by **gta** » Mon Jan 30, 2012 10:42 am

Very interesting topic !

shchepet wrote:

Code: Select all

      StartTime=0.0_r8   ! <-- not necessarily 0.0_r8, may be just time from the initial condition
      MaxDT=MAXVAL(dt)
      ntimes=(RunInterval+0.5*MaxDT)/MaxDT     !<-- 0.5*MaxDT is added to make roundoff upward. 
      MaxDT=RunInterval/dble(ntimes)           !<-- this adjusts "dt" a little bit to make sure
                                               !    that at the end of time stepping my_StepTime
      STEP_LOOP : DO itime=1,ntimes            !    will be exactly  StartTime+RunInterval
        my_StepTime=StartTime + dble(itime)*MaxDT

        ...
      END DO STEP_LOOP

I was just wondering how you would generalize your formulation if you don't know a priori the number of time steps, for instance in the case of a varying time step.
Eventually, it could be a subject of interest for automatic time step monitoring during simulation in future ROMS release.
Thank you for your insight !
Best Regards

arango · #9 Unread post by **arango** » Mon Jan 30, 2012 4:32 pm

Well, the rational for dt and ndtfast was to force the user to think and have the baroclinic time-steps (dt) divisible by the number of barotropic time steps (dtfast). That is, dt = ndtfast * dtfast. We wanted to have a full number. This has been in ROMS since the beginning.

kate · #10 Unread post by **kate** » Tue Mar 06, 2012 6:45 pm

I've been wondering about the new main loop code since you brought it to our attention. It seems that if different patches don't have the same timestep, they will all run for the number of steps needed by the biggest timestep. So I whipped up an upwelling input file for two domains. I had to make this change to get it to go:

Code: Select all

diff --git a/ROMS/Utility/inp_par.F b/ROMS/Utility/inp_par.F
index e4c3f92..50221da 100644
--- a/ROMS/Utility/inp_par.F
+++ b/ROMS/Utility/inp_par.F
@@ -1777,7 +1777,7 @@
 !  symbol in the input "line" is used to advance the counters.
 !
       IF (.not.load) THEN
-        IF (ifile.le.nFfiles(igrid)) THEN
+        IF (ifile.lt.nFfiles(igrid)) THEN
           ifile=ifile+MIN(1,Icont)
         ELSE
           ifile=1

Indeed, it ran both domains for the smaller number of timesteps then quit:

Code: Select all

   1439     4 23:55:00  2.181924E-02  6.585713E+02  6.585931E+02  3.884376E+11
            (01,01,11)  1.751682E-01  2.616835E-03  2.911273E-02  6.295591E-01
   1439     3 07:56:40  1.230684E-02  6.585685E+02  6.585808E+02  1.894817E+11
            (01,01,08)  8.060289E-02  4.904456E-03  9.185044E-02  5.322489E-01
   1440     5 00:00:00  2.184275E-02  6.585713E+02  6.585931E+02  3.884376E+11
            (01,01,11)  1.752055E-01  2.609796E-03  2.904356E-02  6.296896E-01
   1440     3 08:00:00  1.232267E-02  6.585685E+02  6.585808E+02  1.894817E+11
            (01,01,08)  8.062519E-02  4.899209E-03  9.192351E-02  5.324433E-01
      WRT_HIS   - wrote history  fields (Index=1,1) into time record = 0000021
      WRT_AVG   - wrote averaged fields into time record =             0000020
      WRT_DIAGS - wrote diagnostics fields into time record =          0000020
      WRT_RST   - wrote re-start fields (Index=1,1) into time record = 0000001

 Elapsed CPU time (seconds): 

 Thread #  0 CPU:     587.431

I had asked for:

Code: Select all

      NTIMES == 1440 2160 
          DT == 300.0d0 200.0d0

arango · #11 Unread post by **arango** » Tue Mar 06, 2012 7:55 pm

I already changed this logic. It will be available in the next ROMS release. I am almost done coding nesting Phase III. This stuff is quite complex. Once that I am done coding, the testing will start. The plan is to release this as ROMS 3.7 soon, so everybody may start planning for what it is coming in your way. I am very happy with the design. The full tested version will be released as ROMS 4.0. I really hope that you guys help me testing these capabilities.

If you are interested, I gave a talk at NOAA last week about this. It is quite technical and it is going to require some training. However, you can now check out the full capabilities of ROMS nesting and start thinking how it can be used in your applications (old and new). The nesting presentation can be

downloaded from ROMS website. Some parts have ActiveX controls that only work on Windows. You may deactivate them, but you will not be able to roll the attached code windows.

Now, main3d.F looks like:

Code: Select all

      ...

      NEST_LAYER : DO nl=1,NestLayers
!
!  In composite and mosaic grids, it is assumed that the donor and
!  receiver grids have the same time-step size. This is done to avoid
!  the time interpolation between donor and receiver grids. Only
!  spatial interpolations are possible in the current nesting design.
!
!  In grid refinement, it is assumed that the donor and receiver grids
!  are an interger factor of the grid size and time-step size.
!
!  Determine number of time steps to compute in each nested grid layer
!  based on the specified time interval (seconds), RunInterval. Notice
!  that RunInterval is set in the calling driver. Its value may span
!  the full period of the simulation, or a multi-model coupling
!  interval, or just a single step for the main grid.
!
        Nsteps=0
        DO ig=1,GridsInLayer(nl)
          ng=GridNumber(ig,nl)
          Nsteps=MAX(Nsteps, INT((RunInterval+0.5_r8*dt(ng))/dt(ng))+1)
        END DO
!
!  Time-step governing equations for Nsteps.
!
        STEP_LOOP : DO istep=1,Nsteps
!
!  Set time indices and time clock.
!
          DO ig=1,GridsInLayer(nl)
            ng=GridNumber(ig,nl)
            iic(ng)=iic(ng)+1
            nstp(ng)=1+MOD(iic(ng)-ntstart(ng),2)
            nnew(ng)=3-nstp(ng)
            nrhs(ng)=nstp(ng)
            time(ng)=time(ng)+dt(ng)
            tdays(ng)=time(ng)*sec2day
            CALL time_string (time(ng), time_code(ng))
          END DO
          ...

Notice that we now have an integer operation which is not subject to floating-point round off. I don't understand the point that Kate is bringing out. Are you running in a nested grid set-up? How? There are still missing parts of code in the version that you are using to do this correctly.

kate · #12 Unread post by **kate** » Tue Mar 06, 2012 8:08 pm

I'm not running nested, just two independent copies of UPWELLING - but using different timesteps.

The patch I included is for the forcing files to transition from one grid to the next in read_phypar.

shchepet · #13 Unread post by **shchepet** » Tue Mar 06, 2012 8:28 pm

Hernan,

by adding a small increment dt to time as follows from your code,

Code: Select all

...
time(ng)=time(ng)+dt(ng)
...

(see close to the end) you are back to the original problem: you accumulate roundoff errors in counting time.

What you should do instead is to change the above into something of the sort of

Code: Select all

...
time(ng)=time_init(ng) + dt(ng)*(iic(ng)-ntstart(ng))
...

where time_init(ng) is the time corresponding to the initial condition.

arango · #14 Unread post by **arango** » Tue Mar 06, 2012 10:03 pm

OK, Sasha. I see what you mean. Let me see what I can do. I have so much testing on my way. I am stuck now on how to do the MPI communications efficiently in the fine2coarse coupling (two-way). It is trivial in serial and shared-memory computations because we have access to the finer grid global array. This gets complicated in higher refinement ratios 1:5 and 1:7 because the averaging of data in the fine2coarse step may be in different tiles. The 1:3 is trivial. Right now, I am only allowing refinement ratios of 1:3, 1:5, 1:7. I don't think that it is advisable to use higher rations.

I am trying an elegant way to do this. It is kind of a tricky MPI reduction operation. I welcome any suggestions, if you have any. Anything that I tried so far involves collecting and communicating the global arrays for each state variable that need to be averaged at the interior of the finer grid and send back to the coarse grid. Not trivial at all. This is the last thing that I need to figure out...

jcwarner · #15 Unread post by **jcwarner** » Thu Mar 08, 2012 7:58 pm

Just to bring everyone up to speed - last year I have developed a fully functional 2-way mass conserving grid refinement method for ROMS as part of my COAWST modeling system. This is available to anyone who asks, and I have made this available to Rutgers. I had also struggled with many of these same issues, such as the float problem, and identified a method that uses integer counts to keep track of the grid stepping. For the exchanges that are listed as "This is the last thing that I need to figure out..." What I coded is for the parent master to do an mp_gather and mpi_bcast of the child data to all parent tiles. The parent tiles have a flag that identifies if it contains any child ranges. If so, then it does an average of the child data and replaces the parent values.

For now, i am comfortable with this. I am looking into making this more efficient potentially with my knowledge of the mct sparse matrix multiplication tools or some sort. Not sure. I think there are other issues that are more important.

The timing of these calls in main3d loop is critical. It will violate the constancy preserving aspects and mass conservation if not done correctly. My time stepping computes the mass flux from the parent, adjusts the 2D barotropic fields at the grid connectivity regions to ensure 100% mass (volume exchange) from parent to child, and performs a 3D vertically distributed tracer correction to the parent to ensure tracer conservation properties.

When you reach the next hurdle, be sure to ask. I already have this working.

-john

Ocean Modeling Discussion

The perils of floating-point time-step arithmetic

The perils of floating-point time-step arithmetic

Re: The perils of floating-point time-step arithmetic

Re: The perils of floating-point time-step arithmetic

Re: The perils of floating-point time-step arithmetic

Re: The perils of floating-point time-step arithmetic

Re: The perils of floating-point time-step arithmetic

Re: The perils of floating-point time-step arithmetic

Re: The perils of floating-point time-step arithmetic

Re: The perils of floating-point time-step arithmetic

Re: The perils of floating-point time-step arithmetic

Re: The perils of floating-point time-step arithmetic

Re: The perils of floating-point time-step arithmetic

Re: The perils of floating-point time-step arithmetic

Re: The perils of floating-point time-step arithmetic

Re: The perils of floating-point time-step arithmetic