Ocean Modeling Discussion

ROMS/TOMS

Search for:
It is currently Thu Mar 23, 2017 8:11 pm




Post new topic Reply to topic  [ 18 posts ] 

All times are UTC

Author Message
PostPosted: Thu Mar 16, 2017 1:46 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 40
Location: CCRC (UNSW), ARCCSS, ACE CRC
I am running MetROMS (ROMS coupled with CICE) and since updating from ROMS 3.6 to the 3.7 trunk, the model has stopped being bit reproducible. i.e. if I run exactly the same simulation twice it gives me different answers. Not a good sign...

I'm almost positive that I've ruled out CICE or the coupling being the issue, by writing to ocean_his every timestep (there are 6 ocean timesteps to every CICE coupling timestep). The first ocean_his output (timestep 0) is always the same. But the second one (i.e. after the first full timestep) always diverges, even though CICE coupling doesn't happen until after the sixth ROMS timestep. There is an initial coupling call but I did a test run where ROMS ignores these fields from CICE, and it still wasn't bit-reproducible. So unless something really funny is going on with memory, I think the problem is with ROMS.

This is interesting - only some fields are affected in that first "real" output (i.e. time index 2) to ocean_his. Obviously if I leave it for long enough all variables diverge because they interact. But at that first timestep, this is what happens:

Affected (i.e. different between the two runs): zeta, ubar, vbar, u, w, temp, rho, Hsbl, AKv, AKt, AKs, shflux, lwrad, bustr, bvstr
Unaffected: m (ice shelf melt rate), salt, ssflux, swrad, sustr, svstr

I looked at this further and discovered the anomalies were only happening along the northern boundary: the northernmost 3 rows for temp, 8 cells for u and v, 9 for zeta. Again, just for the first timestep, the anomalies propogate south later on. This is why m was unaffected even though it depends on temperature, because the ice shelf cavities are so far south in the domain. For temperature it was just 2 cells affected off the east coast of Australia; for velocity it was most of the northern boundary (makes sense for it to propogate more, because of the 30 barotropic timesteps for every baroclinic).

I thought the issue might be my northern boundary sponge because I coded it myself. So I turned it off, and it was still not bit-reproducible, but temperature was affected in the northernmost 2 rows instead of 3, u 7 instead of 8, v 9 instead of 8, zeta 7 instead of 9.

Okay, so not the sponge. I kept it turned off anyway, for now. Next I set all the northern boundary conditions to closed in my .in file. Previously they were:

Code:
   
   LBC(isFsur) ==   Per     Clo     Per     Cha         ! free-surface
   LBC(isUbar) ==   Per     Clo     Per     Cla         ! 2D U-momentum
   LBC(isVbar) ==   Per     Clo     Per     Fla         ! 2D V-momentum
   LBC(isUvel) ==   Per     Clo     Per     Cla         ! 3D U-momentum
   LBC(isVvel) ==   Per     Clo     Per     RadNud         ! 3D V-momentum
   LBC(isMtke) ==   Per     Clo     Per     Rad         ! mixing TKE

   LBC(isTvar) ==   Per     Clo     Per     RadNud \       ! temperature
                    Per     Clo     Per     RadNud         ! salinity


With all of the northern boundary options (rightmost of the 4 columns) set to "Clo", suddenly temperature was affected in the northernmost 26 rows! which included those cells off Australia as well as the 3 westernmost and 3 easternmost columns (i.e. along the periodic boundary). Also, salinity was affected whereas it wasn't before (northernmost 20 rows). 33 rows for u, 34 for v, 35 for zeta.

Then I tried setting all the northern boundary options to "Gra". Now it wasn't the northern boundary for temperature, but the periodic boundary: 7 columns on either side. For salinity, 2 columns on the west side and 1 on the east.

What is going on??!!


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 16, 2017 3:18 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 40
Location: CCRC (UNSW), ARCCSS, ACE CRC
Possibly related issue: Since I updated to 3.7 I have also occasionally had segfaults that don't reproduce themselves. That is, it will segfault, I resubmit the job without even recompiling, and then it doesn't segfault. I understand segfaults that show up inconsistently are often related to variables not being initialised. Might also be causing the lack of bit reproducibility?


Last edited by k.alexander on Thu Mar 16, 2017 5:04 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 16, 2017 3:58 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 40
Location: CCRC (UNSW), ARCCSS, ACE CRC
Something else interesting: if I turn off UV_ADV, temperature and some other variables are now unaffected (i.e. bit-reproducible at least in time index 2 of ocean_his). Here is the full list:

Affected: ubar, u, v, w, AKt, AKs
Unaffected: m, zeta, vbar, temp, salt, rho, Hsbl, AKv, shflux, ssflux, lwrad, swrad, sustr, svstr, bustr, bvstr

ubar and u now only diverge at the very northernmost row (all depths of u are affected). v is affected at exactly one cell, one away from the northern boundary, 6 layers below the surface. w is affected at 4 cells in the northernmost 2 rows. AKt and AKs are affected in lots of cells in the northernmost 4 rows.

So at least this narrows down the part(s) of the code where this is originating from.


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 16, 2017 4:46 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 40
Location: CCRC (UNSW), ARCCSS, ACE CRC
I just switched to a single processor for ROMS (NtileI=1, NtileJ=1 instead of 32 and 16 as I usually have) and now all variables are bit reproducible.

So it is something about the parallel decomposition. Note that USE_MPI, DISTRIBUTE etc are still defined, so all the same calls are made to mp_exchange2d etc.


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 16, 2017 6:22 am 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3023
Location: IMS/UAF, USA
On only one tile, that call to mp_exchange2d isn't going to be doing a whole lot.

It sounds like time to be adding strategic print statements and running parallel/serial to be looking for output differences. I have been known to run the debugger with 1x4 tiling vs. 4x1 tiling in two windows. The last time I went down that path I beat back everything until answers did match, but that wasn't metroms. I guess I could try that with metroms for a few steps.

I know of a model where setting VERBOSITY = 9 makes it spill its guts, creating lots of output for comparing to later.


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 16, 2017 7:02 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 40
Location: CCRC (UNSW), ARCCSS, ACE CRC
I understand parallel vs serial will usually give output differences though? But parallel vs parallel with the same tiling should be the same.


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 16, 2017 7:09 am 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3023
Location: IMS/UAF, USA
No, they should *all* be the same.


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 16, 2017 7:48 am 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3023
Location: IMS/UAF, USA
I just ran a few steps of my branch of metroms with 1x10 and 10x1 tilings. After four steps, there are no diffs for T,S or u_eastward. This is for a pan-Arctic domain. What code is this, you ask? It's a metroms branch of my roms github repo on the ROMS side and a metroms branch of my CICE repo on the other side. I was going to fork the official github CICE once they put it out there because mine should be a fork of the official code. Well, maybe I'll just have to put mine on github because Someone is taking too long.


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 16, 2017 10:06 pm 
Offline
User avatar

Joined: Tue Jul 01, 2003 4:12 am
Posts: 470
Location: NIWA
k.alexander wrote:
Possibly related issue: Since I updated to 3.7 I have also occasionally had segfaults that don't reproduce themselves. That is, it will segfault, I resubmit the job without even recompiling, and then it doesn't segfault. I understand segfaults that show up inconsistently are often related to variables not being initialised. Might also be causing the lack of bit reproducibility?


I trust you have tried running with bounds checking and the other debug checks turned on?

ROMS segfaults are often due to silly little issues: badly formed input files and/or incomplete checking by the routines that read them.


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 17, 2017 2:19 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 40
Location: CCRC (UNSW), ARCCSS, ACE CRC
So this is interesting, some parallel tilings are okay and others are not.

The following have exactly the same output as 1x1:
2x1
1x2
2x2
16x1
1x16
8x2
24x1
30x1
31x1
8x4
12x4
15x4
16x4
16x8

The following don't:
16x16
16x2
32x1
32x16 (this is what I usually run with)

I was confused as to why 16x4 and 16x8 were fine and 16x2 wasn't. So I ran 16x2 again and it was fine this time. And then a third time and it was not. And then 16x4 and 16x8 again and they were still fine. If it matters, I have Lm=1440 and Mm=428.

I think it must be something which is more likely to be triggered with increasing numbers of processors. If it has to do with an uninitialised variable it makes sense that it's inconsistent.

When I run with -check all (and 32x16 processors), first of all I get tons of these warnings during initialisation:

Code:
forrtl: warning (402): fort: (1): In call to LOAD_L, an array temporary was created for argument #4


but no indication of what file they come from. Then during the first timestep, the code dies with this error:

Code:
forrtl: severe (408): fort: (3): Subscript #2 of the array T has value -1 which is less than the lower bound of 0


The traceback points to line 77 in pre_step3d.f90. Here is that line:

Code:
      CALL pre_step3d_tile (ng, tile,                                   &
     &                      LBi, UBi, LBj, UBj,                         &
     &                      IminS, ImaxS, JminS, JmaxS,                 &
     &                      nrhs(ng), nstp(ng), nnew(ng),               &
     &                      GRID(ng) % rmask,                           &
     &                      GRID(ng) % umask,                           &
     &                      GRID(ng) % vmask,                           &
     &                      GRID(ng) % pm,                              &
     &                      GRID(ng) % pn,                              &
     &                      GRID(ng) % Hz,                              &
     &                      GRID(ng) % Huon,                            &
     &                      GRID(ng) % Hvom,                            &
     &                      GRID(ng) % z_r,                             &
     &                      GRID(ng) % z_w,                             &
     &                      FORCES(ng) % btflx,                         &
     &                      FORCES(ng) % bustr,                         &
     &                      FORCES(ng) % bvstr,                         &
     &                      FORCES(ng) % stflx,                         &
     &                      FORCES(ng) % sustr,                         &
     &                      FORCES(ng) % svstr,                         &
     &                      MIXING(ng) % Akt,                           &
     &                      MIXING(ng) % Akv,                           &
     &                      MIXING(ng) % ghats,                         &
     &                      OCEAN(ng) % W,                              &
     &                      OCEAN(ng) % ru,                             &
     &                      OCEAN(ng) % rv,                             &
     &                      OCEAN(ng) % t,                              &
     &                      OCEAN(ng) % u,                              &
     &                      OCEAN(ng) % v)


I think it's complaining about OCEAN(ng) % t which is dimension (LBi:UBi,LBj:UBj,N(ng),3,NT(ng)) as stated in mod_ocean. So subscript 2 is LBj:UBj, and it's saying it has lower bound -1. But this doesn't make sense because in get_bounds LBj should have lower bound Jmin which is either 0 or 1 depending on the grid type, since my grid is not north-south periodic. Also, shouldn't OCEAN(ng)%t have been accessed before the first call to pre_step3d?


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 17, 2017 3:04 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 40
Location: CCRC (UNSW), ARCCSS, ACE CRC
Update: this bounds error also happens when running with a single processor. I'm wondering if it's a red herring.


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 17, 2017 4:32 am 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3023
Location: IMS/UAF, USA
Maybe we need to ask which compiler? Is there another compiler you can try?


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 17, 2017 5:10 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 40
Location: CCRC (UNSW), ARCCSS, ACE CRC
I'm using the mpif90 compiler from openmpi 1.8.4, which is a wrapper for ifort, using intel-fc version 12.1.9.293. However I used the same compiler for ROMS 3.6 which was bit-reproducible even for 32x16 tiling. It's possible that the intel-fc version has changed on the HPC cluster since then, but the sysadmins are usually really good about testing those things.

My compiler flags are:

Code:
-r8 -i4 -align all -w -ftz -convert big_endian -assume byterecl -no-vec -xHost -g -traceback -O3


again, the same as for ROMS 3.6. And yes I have tried -O0, and it doesn't fix the problem.


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 17, 2017 6:01 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 40
Location: CCRC (UNSW), ARCCSS, ACE CRC
Just FYI, I will have no internet access for the next week (off to see the Great Barrier Reef while it's still there!) so there won't be any updates from me here.


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 17, 2017 6:08 am 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3023
Location: IMS/UAF, USA
I'm also using ifort as my default compiler for ROMS, but I have the option of trying gfortran as well. Do you?


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 17, 2017 6:10 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 40
Location: CCRC (UNSW), ARCCSS, ACE CRC
I've never tried it, I expect it would involve a lot of messing with compiler flags, plus gfortran runs more slowly than ifort. I don't think the compiler is the issue anyway.


Top
 Profile  
Reply with quote  
PostPosted: Sat Mar 18, 2017 11:19 am 
Offline

Joined: Thu Oct 01, 2015 4:41 am
Posts: 3
Location: KAUST
I've seen the same behavior with ROMS svn 797, i.e. sensitivity to the number of tiles. We also now that identical setups are not exactly reproducible on a old and new hardware.
As you've mentioned, serial runs are bit-reproducible, while DM (distributed memory parallelism) are not. The way I think about DM approach sensitivity to the decomposition is the following. Consider 1D array being averaged on multiple cores (say velocity profile). Having different order for the each element summation will result in slightly different answer due to truncation error. This error is small and present for each x and y. Take inverse matrix of that, which is often approximated with iterative methods, and now you have a much larger error. Keep integrating forward in time and this error will propagate. So the end result would be identical to the epsilon perturbation of the initial conditions. I'm not saying this is representative for ROMS, that is just a general idea.


Top
 Profile  
Reply with quote  
PostPosted: Sun Mar 19, 2017 3:58 pm 
Offline

Joined: Mon Oct 26, 2009 3:06 am
Posts: 9
Location: Portland State University
There is some literature on maintaining bit-reproducible simulations under MPI, e.g.,
http://www.sciencedirect.com/science/ar ... 9114000507

While the lack of bit-level reproducibility is annoying, hopefully the divergence of solutions is not physically significant.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 18 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group