Nasty bug in the computation of ROMS grid metrics

Message

arango · #1 Unread post by **arango** » Mon Feb 14, 2005 4:55 am

We just discovered a nasty bug in all versions of ROMS/TOMS which affects curvilinear coordinates applications. The code is fine for any uniform grid application. This bug is in both Rutgers and UCLA versions of ROMS. As a matter of fact, this bug is also in SCRUM. I haven't checked SPEM yet. So the bug has been in all these models for more than 10 years. I can believe that nobody spotted this one before

.

Many thanks to John Warner for his curiosity in discovering this bug

.

I haven't explored the repercussion of this bug yet but it affects viscosity and diffusion. It can break the symmetries of the stress and diffusion tensors. It can affect momentum and tracer balances.

I need to make an analysis of the entire code and check the averaging the metrics pm and pn horizontally with the meaning of grid spacing. Recall that pm and pn are the inverse grid spacing at RHO-points.

dx = 1 / pm
dy = 1 / pn

This always annoyed me but we kept it like that for historical reasons.

Please, use the following metrics.F for now on:

http://www.myroms.org/links/metrics.F.gz

This bug will affect also all tangent linear and adjoint applications. Fortunately, we do not need to take the adjoint of this routine. In the UCLA version of ROMS these metrics are computed in setup_grid1.F.

By the way, we need to examine all the grid generation programs to see if these bug is also there.

Now, let's examine this bug by looking two continuous grid cells centered at RHO-points with unusual different grid size, say 50 and 80 km:

Code: Select all

 j       ----- v(i-1)----------- v(i  )-------
         |                 |                 |
         |                 |                 |
 j-1  u(i-1)   r(i-1)   u(i  )   r(i  )   u(i+1)
         |                 |                 |
         |                 |                 |
 j-1     ----- v(i-1)----------- v(i  )-------


xr:            125               190                     km
xu:    100                150                230         km

dx_r:           50                80                     km  (1/pm)
dx_u:                      65                            km  (om_u)

Let's look to the computation of the grid spacing at u-points, om_u. In metrics, we had:

Code: Select all

        om_u(i,j) = 2 / (pm(i-1,j) + pm(i,j))
                  = 2 / (1/50 + 1/80)
                  = 2 / (0.02 + 0.0125)
                  = 2 / 0.0325
                  = 61.538462

Oops... obviously, 65 is not equal to 61.538462.

Notice that if the continuous cells have the same grid spacing, it will yield the correct om_u. Also notice that this error is larger as the curvilinear grid gets more distorted. The error is grid dependent and can be small on grids that are almost uniform. I wonder if this bug exits in other curvilinear models out there. It is very easy to make this mistake due to the need of inverse grid spacing in the computations.

The same goes for other terms in metrics.F:

on_u, om_v, on_v, pmon_u, pnom_u, pmon_v, pmon_p, pnom_p

Well, I am speechless about this bug. Please replace metrics.F with the new routine immediately

Good luck, H

arango · #2 Unread post by **arango** » Mon Feb 14, 2005 5:02 pm

There are other issues that we need to consider about this problem. We need to conserve first and second moments. We need the get the same interior volume when computing it from rho-, psi-, u-, and v-points. I am testing this aspect right now.

I wrote a Matlab script to test the area integral using old and new way of computing the metrics. Please use the following script:

http://www.myroms.org/links/check_metrics.m.gz

> [G]=check_metrics(gname);

Here G is a structure array containing several quantities that can be plot it and gname is an input NetCDF containing grid variables: h, pm, pn.

For an uniform grid, I get:

Code: Select all

 Domain area (km2) computed using inverse metrics pm, pn (old way):

    Interior rho-points = 2000000
    Interior psi-points = 2000000
    Interior   u-points = 2000000
    Interior   v-points = 2000000

 Domain area (km2) computed using grid spacing dx, dy (new way):

    Interior rho-points = 2000000
    Interior psi-points = 2000000
    Interior   u-points = 2000000
    Interior   v-points = 2000000

For a curvilinear grid, I get:

Code: Select all

 Domain area (km2) computed using inverse metrics pm, pn (old way):

    Interior rho-points = 74137.8015
    Interior psi-points = 74105.706
    Interior   u-points = 74114.6482
    Interior   v-points = 74128.829

 Domain area (km2) computed using grid spacing dx, dy (new way):

    Interior rho-points = 74137.8015
    Interior psi-points = 74132.7916
    Interior   u-points = 74133.1297
    Interior   v-points = 74137.4592

I don't know yet why I don't get exact values. The error is too big for accumulated roundoff in the area integrals. Perhaps, I still have a bug in
my script. I will try with other applications.

shchepet · #3 Unread post by **shchepet** » Tue Feb 15, 2005 12:59 am

Dear All,

I was well aware of this feature of SPEM/SCRUM/ROMS codes for years (at least 5 years) and do not consider this feature as a bug. Neither I consider the latest fix as an urgent measure to be done immediately, and I do not see any reason for panic or statements like I cannot believe that we missed it for so many years. Furthermore, implementing this fix alone (just replacing the metrics.F, as proposed by Hernan) without changing several places thoughout the rest of the code is dangerous because it may potentially destroy the conservation properties in the momentum equation and utimately threatens stability of the code (see below).

SPEM/SCRUM/ROMS codes (all of them) were indeed very sloppy in handling curvilinear grids. There are numerous examples where consecutive averaging of already averaged quantities takes place, resulting in potential loss of accuracy. For example, inside the grid generation software (this is seagrid for those who remember) the metric coefficients pm=1/dx, and pn=1/dy are computed first at their naturally on the C-grid (at U- and V-points, respectively) and then averaged onto RHO-points: algebraic averaging of pm, pn, e.g.,

Code: Select all

            pm_(i,j) = [ pm_(i,j) + pm_(i+1,j) ]/2

hence harmonic averaging of associated grid spacings dx, dy. These averaged values of pm, pn are then stored in the grid file, and whenever necessary are averaged again, see for example routine metrics, old or new, does not matter. Effectively, we are computing dx(i+1/2,j) using 3-grid-interval stencil, rather than the natural 1-interval stencil.

Regarding to the bug fix proposed by Hernan, basically what we talking here is replacing harmonic averaging of dx_ and dy_ with algebraic averaging,

Code: Select all

                              dx_(i,j) + dx_(i+1,j)
              dx_(i+1/2,j) = -----------------------
                                       2
      vs.

                                          2
              dx_(i+1/2,j) = ---------------------------
                              1/dx_(i,j) + 1/dx_(i+1,j)

These two agree with each other with second order of accuracy and the difference between them is asymptotically small with controlling smallness parameter

Code: Select all

                      dx_(i,j) - dx_(i+1,j)
           epsilon = -----------------------
                      dx_(i,j) + dx_(i+1,j)

(second order agreement means that the difference vanishes as second power of epsilon above).

From the accuracy point of view, neither formula has a decisive advantage over the other. Consider, for example, grid with exponential stretching:

Code: Select all

                x_(j)= X * exp (j/N)

where j is grid index, X is length scale, while N controls stretching. Then

Code: Select all

           dx_(j) = 2X * sinh(1/(2N)) * exp(j/N)

and, similarly, the exact value of halfway midpoint dx_(j+1/2)

Code: Select all

           dx_(j+1/2) = 2X * sinh(1/(2N)) * exp((j+1/2)/N)

on the other hand, algebraic averaging of dx_(j) and dx_(j+1) yields

Code: Select all

      dx_(j)+dx_(j+1)
     ----------------- = 2X * sinh(1/(2N)) * cosh(1/(2N)) * exp((j+1/2)/N)
            2

which has an extra multiplier cosh(1/(2N)); while harmonic averaging yields

Code: Select all

            2                 2X * sinh(1/(2N))
   ----------------------- = ------------------- * exp((j+1/2)/N)
    1/dx_(j) + 1/dx_(j+1)       cosh(1/(2N))

where the same term appears in the denominator. One cannot say that one approximation is better than the other, however one can verify that their mean produces a more accurate approximation than the other two:

Code: Select all

  algebraic + harmonic                            [cosh(1/(2N))]^2 + 1
 ---------------------- = [dx_(j+1/2)]_"exact" * ----------------------
            2                                        2 * cosh(1/(2N))

where the extra multiplier on the right behaves like

Code: Select all

        [cosh(1/(2N))]^2 + 1         1
       ---------------------- = 1 + --- * (1/(2N))^4 + ...
          2 * cosh(1/(2N))           8

In this case, truncation errors of algebraic and harmonic averages are comparable in order, but have the opposite sign, so that they cancel each other.

One of the features of harmonic averaging is that it produces values which are systematically less than the algebraic averaging (the results are equal only if the numbers to be averaged are the same). This leads to the fact that the sum of all areas as seen from above of U- and V- control volumes is systematically less than the sum of RHO-areas. In principle, this effect does not jeopardize the conservation properties of ROMS code, because we multiply by area and divide by exactly the same number, for example in U-equation it is

Code: Select all

a multiplier of             om_u(i,j)*om_u(i,j)

followed by multiplier

Code: Select all

             0.25*(pm(i,j)+pm(i-1,j))*(pn(i,j)+pn(i-1,j))

See, for example, step2d and tracer computation of rhs_ubarand its subsequent final use to compute the updated value of ubar(i,j,knew). As long ad it is guaranteed that the product of the two multipliers above is equal to unity, the conservation properties are respected, regardless of the details how the multipliers are computed. The old way of doing it is defining

Code: Select all

               om_u(i,j)=2./(pm(i,j)+pm(i-1,j))
and
               on_u(i,j)=2./(pn(i,j)+pn(i-1,j))

However conservation properties are lost, if, for example, one replaces the metrics.F routine with the one proposed by Hernan and leaving step2d unchanged.

Code: Select all

recall that     0.25*(pm(i,j)+pm(i-1,j))*(1./pm(i,j)+1./pm(i-1,j) > 1

with 1 achieved only if pm(i,j)=pm(i-1,j). This kind of multiplier may show up if one implements metrics.F as suggested by Hernan, while keeping the rest of the code as before. Depending on specific version of the code, it may be equivalent to multiplying u and ubar by this factor at every time step, hence instability. I cheched ROMS 2.1, and in that code actually only the right-hand-side are multiplied and divided by horizontal surfaces, while ubar and u are not; this is in contrast to the algorithm for tracer equations, where t(:,:,:,:) are also multiplied and divided then during time stepping. So such instability would not occur on ROMS 2.1. The conservation properies, however, will be lost in any code.

Where do we go from here?

It is my vision that the correct way to address the problem is to reconsider the whole technological chain of grid generation and the format of storing of grid data in the files. The current strategy is to store a lot of redundant information, but at the same time allow irreversible loss of information, for example averaging of pn and pm to RHO-points inside grid-generation package.

Grid generation:

We should recall that SPEM/SCRUM/ROMS codes employ orthogonal curvilinear coordinates. This means that there is exists a conformal mapping procedure between physical coordinates X, Y and the corresponding logical coordinates xi, eta (basically these are grid-index coordinate, with, perhaps, only one scaling factor deciding aspect ratio on logical coordinate space.

Since this is conformal mapping, X(xi,eta) and Y(xi,eta) are harmonic functions of the logical coordinates,

Code: Select all

          d^2 X      d^2 X                     d^2 X      d^2 X
         -------- + -------- = 0      and     -------- + -------- = 0
          d xi^2     d eta^2                   d xi^2     d eta^2

which are naturally discretized as follows:

Code: Select all

at RHO-points,


     Xu_(i+1/2,j) + Xu_(i-1/2,j) - 2 Xr_(i,j)
    ----------------------------------------- 
                    dxi^2

                    Xv_(i,j+1/2) + Xv_(i,j-1/2) - 2 Xr_(i,j)
                 + ----------------------------------------- = 0
                                     deta^2

where dxi and deta are globally defined constants (actually only ratio of dxi/deta makes sense, because there is no right-hand-side, and the equations can be scaled by an arbitrary number).

Similarly,

Code: Select all

U-points

     Xr_(i+1,j) + Xr_(i,j) - 2 Xu_(i+1/2,j)
    --------------------------------------- 
                    dxi^2

               Xp_(i+1/2,j+1/2) + Xp_(i+1/2,j-1/2) - 2 Xu_(i+1/2,j)
            + ------------------------------------------------------ = 0
                                    deta^2


V-points

     Xp_(i+1/2,j+1/2) + Xp_(i-1/2,j+1/2) - 2 Xv_(i,j+1/2)
    ----------------------------------------------------- 
                            dxi^2

                         Xr_(i,j+1) + Xr_(i,j) - 2 Xv_(i,j+1/2)
                      + ---------------------------------------- = 0
                                         deta^2

PSI-points

     Xv_(i+1,j+1/2) + Xv_(i,j+1/2) - 2 Xp_(i+1/2,j+1/2)
    --------------------------------------------------- 
                    dxi^2

                  Xu_(i+1/2,j+1) + Xu_(i+1/2,j) - 2 Xp_(i+1/2,j+1/2)
               + ---------------------------------------------------- = 0
                                     deta^2

The above notation places the four relevant variables on the grid as follows:

Code: Select all

         Xu_(i-1/2,j+1)     Xr_(i,j+1)     Xu_(i+1/2,j+1)

         Xp_(i-1/2,j+1/2)   Xv_(i,j+1/2)   Xp_(i+1/2,j+1/2)

         Xu_(i-1/2,j)       Xr_(i,j)       Xu_(i+1/2,j)

         Xp_(i-1/2,j-1/2)   Xv_(i,j-1/2)   Xp_(i+1/2,j-1/2)

Where basically one can define an array of twice the size in each direction and just solve for regular 5-point Laplacian. (Y is discretized in a similar way).

This leads to the natural interpolation rules for U-, V-, and PSI-points, as well as subsequent computation of all other metric terms: suppose we store X and both RHO- and PSI-points in the file. Then, X at U- and V-points, can be reconstructed as

Code: Select all

               [[Xr_(i,j)+Xr_(i+1,j)]/dxi^2 + [Xp_(i+1/2,j+1/2)+Xp_(i+1/2,j 1/2)]/deta^2
Xu(i+1/2,j) = ---------------------------------------------------------------------------
                             2/dxi^2 + 2/deta^2

similarly

               [[Xp_(i+1/2,j+1/2)+Xp_(i-1/2,j+1/2)]/dxi^2 + [[Xr_(i,j)+Xr-(i,j+1)]/deta^2
Xv(i,j+1/2) = ---------------------------------------------------------------------------
                             2/dxi^2 + 2/deta^2

which are exact in sense that it identically respects the harmonic property (zero-Laplacian) defined on a 5-point stencil specified above.

Once all four values, Xr, Xu, Xv, Xp are known, all relevant grid spacings can be computed by simple differencing without any averaging at all, for example,

Code: Select all

               om_u(i+1/2,j)=Xr(i+1,j)-Xr(i,j)

                 pm_u=1./[Xr(i+1,j)-Xr(i,j)]

Basically it all translates into storing the four arrays, say Xr, Xp, Yr, Yp (alternatively Xu, Xv, Yu, Yv) in the grid file and ...that is it. Everything else (pm, pn, Coriolis, angles, areas, etc... ) can be computed from them on the fly. In the case of spherical coordinates either lon-lat coordinates, or their Mercator projections can be stored.

Doing everything accurately is possible, and does not lead to extra computational burden inside the hydrodynamic code of ROMS, but once again, it is not just code: the whole supporting infrastructure (grid generation, data file formats, pre- post processing matlab scripts, etc) should adopt a different standard.

Alexander Shchepetkin

arango · #4 Unread post by **arango** » Tue Feb 15, 2005 2:50 am

Thank you Sasha for the detailed explanation. I learned something new today. Yes, I was aware that the changes to metrics.F will involve other changes in the code, as I mentioned in my first posting.

Please disregard earlier changes to metrics.F and use the old routine for the reasons explained very well by Sasha. We need to think how to proceed from here. We need to have a consistent code and pre- and post-processing software.

kate · #5 Unread post by **kate** » Wed Feb 16, 2005 8:18 pm

Hernan's original example of dx=50 next to dx=80 brings up another point. We have known since the days of John Wilkin's thesis that sudden jumps in dx can lead to spurious results. The whole discretization depends on it being a smoooooth transition in gridspacing. Just as there is a metric for allowable transitions in depth, there is doubtless a metric for allowable transitions in dx. I've seen results from a grid with a uniform field of dx=5 adjacent to a uniform field of dx=10. Lo and behold, there was a boundary current at the join.

Then again, my dad would say that if you have a smooth transition from coarse to fine, there will still be waves that are supported on one side and not the other. Those waves will be reflected regardless of the grid details.

arango · #6 Unread post by **arango** » Sun Feb 20, 2005 6:46 pm

Yes, my example is exagerated so I can have nice inverse numbers. It is not advisable to have such distorted grids anyway. I am still thinking about this grid spacing issue. In the past, I did not have to worry about the relationship between grid spacing dx and dy and their inverses pm and pn at different staggered grid locations. Nowadays, I need to consider this because of variational data assimilation nonlinear and adjoint observational operators, term diagnostics in a specified subregion involving horizontal integrals:

B=integral(A dx), from x1 to x2

C=integral(A dy), from y1 to y2

We can get different values if A is at u-points or v-points, which can affect balances. Other issues will continue appearing as our applications expand.

Is it worth to modify the entire code in order to have consistent grid spacing meaning? I have not desided that yet. I am planning to re-read the Arakawa and Lamb long technical report. I read it when I was in graduate school.