Yes, debugging can be fun. It’s a job where you know there’s an answer to be found in some finite amount of time. It’s not always as quick as you expect, though.
Our Arctic domain has been run for years and doesn’t look too horrible, but there’s a bit too much ice in the summers. I’ve tried a couple of different albedo schemes, then found one more to try. It’s by Ebert and Curry and is the only one to include the effects of melt ponds. Why hadn’t I ever tried it before? A quick test with PGI Fortran showed me why – the thing blew up in a few timesteps. Maybe that had happened before and I found it easier to just go back to another albedo option.
This time, I recompiled for the debugger and fired up totalview. The thing ran this time, no problem. My favorite trick in this situation is to try another compiler. I’ve also got gfortran, so I gave that a whirl. That one ran fine too – but then it blew up three steps after a restart. How strange to run for hundreds of days, then fail right after a restart. Both this and the earlier PGI failure reported trouble from the ice_frazil routine, so I looked to the temperature field.
Totalview and gfortran don’t play nice, so it’s back to PGI and this time it failed in the same way as gfortran, in the debugger even. I focused on the i,j point reported by the ice_frazil error message and I discovered that in the very first timestep, the surface salinity value became nan, then the entire water column of salinity was set to zero. I’m not quite sure how that loss of nan happened… Those zero salinities with sub-zero temperatures generated a bunch of ice at just the one point and the water column then became unstable.
The nan at the surface led me to check the surface tracer forcing, which indeed was nan for salt. On restart, the ice concentration contained a very tiny non-zero value which introduced trouble to the computation of the ice albedo there. I have an ice_limit routine which keeps ice fields well-behaved, but I wasn’t running it from within get_state. I looked into calling it there, but ice_limit works on the MPI tile while get_state does not. I managed to add a call to ice_limit from the ini_fields routine instead. It’s running again – in both PGI and gfortran. The problem this time was due to a single precision restart file, but I already knew I wanted a call to ice_limit on startup.
While I was poking around in there, I also fixed it so that trouble in ice_frazil really does cause the job to end. I had added a check on exit_flag, but was then broadcasting the value from process 0, not broadcasting the bad value to the rest of the nodes – unless process 0 was the one in trouble. It needed a “MAX” reduction so that any process in trouble made that trouble known to the head node.
Anyway, I learned something, which is always fun.