This time the question came up about perfect restarts. The option PERFECT_RESTART promises that a run from beginning to end all in one go should match one made with a restart in the middle. With this option, more fields need to be written to the restart field, but were we saving enough? I had done this exercise before for the sea ice model, but not with WET_DRY as well.
The way to test for perfect restarts is to run for some number of timesteps X, save a restart file, run for one more step and save a history file. Then in another directory, copy the first restart file and use it to start at timestep X and run for one step, saving a history file. Now compare the history files with ncdiff and ncview.
What value of X to use? I didn’t try X=1 because ROMS doesn’t get into its default timestep until timestep 3. I therefore started with X=3 and found lots of differences between the two history files.
Time to go to a debugger – actually, duelling debuggers. One debugger starts from time 0, the other from step X. To do this, you don’t want to try a large X because then you’d be running all those steps in the debugger for the first debugger window. I found that for X=3, the u and v fields matched perfectly at step X, but the value of nrhs differed. nrhs is used in the computation of W, the vertical velocity. With the vertical velocities mismatched, everything evolved differently.
Rather than fix this, I decided to try again with an even value for X, say X=2. After all, I usually save after an even number of timesteps and I just wanted to see if something else was off. Indeed, the differences were more localized, but there still were differences, especially along the shallow parts where WET_DRY would be involved.
After much digging around, I found two things going on. The first is that the wet-dry masks are computed during initialization and also in calls to wetdry.F from step2d. One needs to make sure that on restart, one either needs to have saved all those masks or to recompute them consistently.
The other source of trouble comes from how ROMS writes out fields to the NetCDF files. Most fields are written with masked areas getting the _FillValue. For WET_DRY, many are being saved with rmask_io, which masks out the dry cells. This can cause a loss of information of the state of the dry cells. For instance, tracer values in the dry cells are timestepped in ROMS, including advection terms if flow happens to be entering a dry cell. If these dry cell tracers are read from masked out values in the restart file, they will be set to zero.
In the end, I didn’t have to save any more fields, but simply to update the mask initialization and to change the masks used in the saving of restart fields. I also found that I had a lingering ice restart bug, also fixed by the writing of restart fields. You might find it to be interesting, so I’ll try to describe it:
- The ice is timestepped before the call to output because it contributes to the computation of surface fluxes.
- For PERFECT_RESTART, I wrote a routine to overwrite surface fluxes from the saved ones in the restart file, skipping the call to seaice on the first restart step.
- I found a mismatch at one lone point next to a land mask (not near the WET_DRY cells though).
- The mismatch stemmed from the computation of the vertical mixing (GLS_MIXING) near the surface.
- Near-surface mixing gets a contribution from the surface stresses.
- The stresses are averaged from velocity points to rho points.
- On restart, the stresses were set to zero for the velocity points at the land-sea mask boundary.
- Turning off the masking while writing the stresses fixed this restart issue.
One more thing: ROMS has some funny features with the writing of history files. If you run for X steps, asking for a history save at step X+1, you won’t get any history file at all. Then on restart, it will complain that it can’t find one and fail.
If you run for X+1 steps, asking for a history record at step X+1, you will get one record for the initial conditions, one for time X+1. For my runs, I ask for one file per record.
If you restart from step X, giving it a history file to build on, it will save to the first file. I then need to compare file 0002 from one run with file 0001 from the other run.