Ocean Modeling Discussion

ROMS/TOMS

Search for:
It is currently Wed Dec 13, 2017 5:07 am




Post new topic Reply to topic  [ 33 posts ] 

All times are UTC

Author Message
PostPosted: Thu Mar 16, 2017 1:46 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
I am running MetROMS (ROMS coupled with CICE) and since updating from ROMS 3.6 to the 3.7 trunk, the model has stopped being bit reproducible. i.e. if I run exactly the same simulation twice it gives me different answers. Not a good sign...

I'm almost positive that I've ruled out CICE or the coupling being the issue, by writing to ocean_his every timestep (there are 6 ocean timesteps to every CICE coupling timestep). The first ocean_his output (timestep 0) is always the same. But the second one (i.e. after the first full timestep) always diverges, even though CICE coupling doesn't happen until after the sixth ROMS timestep. There is an initial coupling call but I did a test run where ROMS ignores these fields from CICE, and it still wasn't bit-reproducible. So unless something really funny is going on with memory, I think the problem is with ROMS.

This is interesting - only some fields are affected in that first "real" output (i.e. time index 2) to ocean_his. Obviously if I leave it for long enough all variables diverge because they interact. But at that first timestep, this is what happens:

Affected (i.e. different between the two runs): zeta, ubar, vbar, u, w, temp, rho, Hsbl, AKv, AKt, AKs, shflux, lwrad, bustr, bvstr
Unaffected: m (ice shelf melt rate), salt, ssflux, swrad, sustr, svstr

I looked at this further and discovered the anomalies were only happening along the northern boundary: the northernmost 3 rows for temp, 8 cells for u and v, 9 for zeta. Again, just for the first timestep, the anomalies propogate south later on. This is why m was unaffected even though it depends on temperature, because the ice shelf cavities are so far south in the domain. For temperature it was just 2 cells affected off the east coast of Australia; for velocity it was most of the northern boundary (makes sense for it to propogate more, because of the 30 barotropic timesteps for every baroclinic).

I thought the issue might be my northern boundary sponge because I coded it myself. So I turned it off, and it was still not bit-reproducible, but temperature was affected in the northernmost 2 rows instead of 3, u 7 instead of 8, v 9 instead of 8, zeta 7 instead of 9.

Okay, so not the sponge. I kept it turned off anyway, for now. Next I set all the northern boundary conditions to closed in my .in file. Previously they were:

Code:
   
   LBC(isFsur) ==   Per     Clo     Per     Cha         ! free-surface
   LBC(isUbar) ==   Per     Clo     Per     Cla         ! 2D U-momentum
   LBC(isVbar) ==   Per     Clo     Per     Fla         ! 2D V-momentum
   LBC(isUvel) ==   Per     Clo     Per     Cla         ! 3D U-momentum
   LBC(isVvel) ==   Per     Clo     Per     RadNud         ! 3D V-momentum
   LBC(isMtke) ==   Per     Clo     Per     Rad         ! mixing TKE

   LBC(isTvar) ==   Per     Clo     Per     RadNud \       ! temperature
                    Per     Clo     Per     RadNud         ! salinity


With all of the northern boundary options (rightmost of the 4 columns) set to "Clo", suddenly temperature was affected in the northernmost 26 rows! which included those cells off Australia as well as the 3 westernmost and 3 easternmost columns (i.e. along the periodic boundary). Also, salinity was affected whereas it wasn't before (northernmost 20 rows). 33 rows for u, 34 for v, 35 for zeta.

Then I tried setting all the northern boundary options to "Gra". Now it wasn't the northern boundary for temperature, but the periodic boundary: 7 columns on either side. For salinity, 2 columns on the west side and 1 on the east.

What is going on??!!


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 16, 2017 3:18 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
Possibly related issue: Since I updated to 3.7 I have also occasionally had segfaults that don't reproduce themselves. That is, it will segfault, I resubmit the job without even recompiling, and then it doesn't segfault. I understand segfaults that show up inconsistently are often related to variables not being initialised. Might also be causing the lack of bit reproducibility?


Last edited by k.alexander on Thu Mar 16, 2017 5:04 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 16, 2017 3:58 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
Something else interesting: if I turn off UV_ADV, temperature and some other variables are now unaffected (i.e. bit-reproducible at least in time index 2 of ocean_his). Here is the full list:

Affected: ubar, u, v, w, AKt, AKs
Unaffected: m, zeta, vbar, temp, salt, rho, Hsbl, AKv, shflux, ssflux, lwrad, swrad, sustr, svstr, bustr, bvstr

ubar and u now only diverge at the very northernmost row (all depths of u are affected). v is affected at exactly one cell, one away from the northern boundary, 6 layers below the surface. w is affected at 4 cells in the northernmost 2 rows. AKt and AKs are affected in lots of cells in the northernmost 4 rows.

So at least this narrows down the part(s) of the code where this is originating from.


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 16, 2017 4:46 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
I just switched to a single processor for ROMS (NtileI=1, NtileJ=1 instead of 32 and 16 as I usually have) and now all variables are bit reproducible.

So it is something about the parallel decomposition. Note that USE_MPI, DISTRIBUTE etc are still defined, so all the same calls are made to mp_exchange2d etc.


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 16, 2017 6:22 am 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3264
Location: IMS/UAF, USA
On only one tile, that call to mp_exchange2d isn't going to be doing a whole lot.

It sounds like time to be adding strategic print statements and running parallel/serial to be looking for output differences. I have been known to run the debugger with 1x4 tiling vs. 4x1 tiling in two windows. The last time I went down that path I beat back everything until answers did match, but that wasn't metroms. I guess I could try that with metroms for a few steps.

I know of a model where setting VERBOSITY = 9 makes it spill its guts, creating lots of output for comparing to later.


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 16, 2017 7:02 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
I understand parallel vs serial will usually give output differences though? But parallel vs parallel with the same tiling should be the same.


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 16, 2017 7:09 am 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3264
Location: IMS/UAF, USA
No, they should *all* be the same.


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 16, 2017 7:48 am 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3264
Location: IMS/UAF, USA
I just ran a few steps of my branch of metroms with 1x10 and 10x1 tilings. After four steps, there are no diffs for T,S or u_eastward. This is for a pan-Arctic domain. What code is this, you ask? It's a metroms branch of my roms github repo on the ROMS side and a metroms branch of my CICE repo on the other side. I was going to fork the official github CICE once they put it out there because mine should be a fork of the official code. Well, maybe I'll just have to put mine on github because Someone is taking too long.


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 16, 2017 10:06 pm 
Offline
User avatar

Joined: Tue Jul 01, 2003 4:12 am
Posts: 481
Location: NIWA
k.alexander wrote:
Possibly related issue: Since I updated to 3.7 I have also occasionally had segfaults that don't reproduce themselves. That is, it will segfault, I resubmit the job without even recompiling, and then it doesn't segfault. I understand segfaults that show up inconsistently are often related to variables not being initialised. Might also be causing the lack of bit reproducibility?


I trust you have tried running with bounds checking and the other debug checks turned on?

ROMS segfaults are often due to silly little issues: badly formed input files and/or incomplete checking by the routines that read them.


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 17, 2017 2:19 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
So this is interesting, some parallel tilings are okay and others are not.

The following have exactly the same output as 1x1:
2x1
1x2
2x2
16x1
1x16
8x2
24x1
30x1
31x1
8x4
12x4
15x4
16x4
16x8

The following don't:
16x16
16x2
32x1
32x16 (this is what I usually run with)

I was confused as to why 16x4 and 16x8 were fine and 16x2 wasn't. So I ran 16x2 again and it was fine this time. And then a third time and it was not. And then 16x4 and 16x8 again and they were still fine. If it matters, I have Lm=1440 and Mm=428.

I think it must be something which is more likely to be triggered with increasing numbers of processors. If it has to do with an uninitialised variable it makes sense that it's inconsistent.

When I run with -check all (and 32x16 processors), first of all I get tons of these warnings during initialisation:

Code:
forrtl: warning (402): fort: (1): In call to LOAD_L, an array temporary was created for argument #4


but no indication of what file they come from. Then during the first timestep, the code dies with this error:

Code:
forrtl: severe (408): fort: (3): Subscript #2 of the array T has value -1 which is less than the lower bound of 0


The traceback points to line 77 in pre_step3d.f90. Here is that line:

Code:
      CALL pre_step3d_tile (ng, tile,                                   &
     &                      LBi, UBi, LBj, UBj,                         &
     &                      IminS, ImaxS, JminS, JmaxS,                 &
     &                      nrhs(ng), nstp(ng), nnew(ng),               &
     &                      GRID(ng) % rmask,                           &
     &                      GRID(ng) % umask,                           &
     &                      GRID(ng) % vmask,                           &
     &                      GRID(ng) % pm,                              &
     &                      GRID(ng) % pn,                              &
     &                      GRID(ng) % Hz,                              &
     &                      GRID(ng) % Huon,                            &
     &                      GRID(ng) % Hvom,                            &
     &                      GRID(ng) % z_r,                             &
     &                      GRID(ng) % z_w,                             &
     &                      FORCES(ng) % btflx,                         &
     &                      FORCES(ng) % bustr,                         &
     &                      FORCES(ng) % bvstr,                         &
     &                      FORCES(ng) % stflx,                         &
     &                      FORCES(ng) % sustr,                         &
     &                      FORCES(ng) % svstr,                         &
     &                      MIXING(ng) % Akt,                           &
     &                      MIXING(ng) % Akv,                           &
     &                      MIXING(ng) % ghats,                         &
     &                      OCEAN(ng) % W,                              &
     &                      OCEAN(ng) % ru,                             &
     &                      OCEAN(ng) % rv,                             &
     &                      OCEAN(ng) % t,                              &
     &                      OCEAN(ng) % u,                              &
     &                      OCEAN(ng) % v)


I think it's complaining about OCEAN(ng) % t which is dimension (LBi:UBi,LBj:UBj,N(ng),3,NT(ng)) as stated in mod_ocean. So subscript 2 is LBj:UBj, and it's saying it has lower bound -1. But this doesn't make sense because in get_bounds LBj should have lower bound Jmin which is either 0 or 1 depending on the grid type, since my grid is not north-south periodic. Also, shouldn't OCEAN(ng)%t have been accessed before the first call to pre_step3d?


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 17, 2017 3:04 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
Update: this bounds error also happens when running with a single processor. I'm wondering if it's a red herring.


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 17, 2017 4:32 am 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3264
Location: IMS/UAF, USA
Maybe we need to ask which compiler? Is there another compiler you can try?


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 17, 2017 5:10 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
I'm using the mpif90 compiler from openmpi 1.8.4, which is a wrapper for ifort, using intel-fc version 12.1.9.293. However I used the same compiler for ROMS 3.6 which was bit-reproducible even for 32x16 tiling. It's possible that the intel-fc version has changed on the HPC cluster since then, but the sysadmins are usually really good about testing those things.

My compiler flags are:

Code:
-r8 -i4 -align all -w -ftz -convert big_endian -assume byterecl -no-vec -xHost -g -traceback -O3


again, the same as for ROMS 3.6. And yes I have tried -O0, and it doesn't fix the problem.


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 17, 2017 6:01 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
Just FYI, I will have no internet access for the next week (off to see the Great Barrier Reef while it's still there!) so there won't be any updates from me here.


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 17, 2017 6:08 am 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3264
Location: IMS/UAF, USA
I'm also using ifort as my default compiler for ROMS, but I have the option of trying gfortran as well. Do you?


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 17, 2017 6:10 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
I've never tried it, I expect it would involve a lot of messing with compiler flags, plus gfortran runs more slowly than ifort. I don't think the compiler is the issue anyway.


Top
 Profile  
Reply with quote  
PostPosted: Sat Mar 18, 2017 11:19 am 
Offline

Joined: Thu Oct 01, 2015 4:41 am
Posts: 4
Location: KAUST
I've seen the same behavior with ROMS svn 797, i.e. sensitivity to the number of tiles. We also now that identical setups are not exactly reproducible on a old and new hardware.
As you've mentioned, serial runs are bit-reproducible, while DM (distributed memory parallelism) are not. The way I think about DM approach sensitivity to the decomposition is the following. Consider 1D array being averaged on multiple cores (say velocity profile). Having different order for the each element summation will result in slightly different answer due to truncation error. This error is small and present for each x and y. Take inverse matrix of that, which is often approximated with iterative methods, and now you have a much larger error. Keep integrating forward in time and this error will propagate. So the end result would be identical to the epsilon perturbation of the initial conditions. I'm not saying this is representative for ROMS, that is just a general idea.


Top
 Profile  
Reply with quote  
PostPosted: Sun Mar 19, 2017 3:58 pm 
Offline

Joined: Mon Oct 26, 2009 3:06 am
Posts: 9
Location: Portland State University
There is some literature on maintaining bit-reproducible simulations under MPI, e.g.,
http://www.sciencedirect.com/science/ar ... 9114000507

While the lack of bit-level reproducibility is annoying, hopefully the divergence of solutions is not physically significant.


Top
 Profile  
Reply with quote  
PostPosted: Mon Mar 27, 2017 4:38 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
Thanks for the replies - interesting to see that others are having the same issues. The explanation of how truncation errors propagate makes sense, however I don't think it applies here, because ROMS 3.6 was bit-reproducible for me (and the current version is bit-reproducible for certain parallel tilings but not others). Also, the initial conditions are always read the same way, because the first time index in output_his (before a full timestep has passed) is always the same, it diverges on #2.


Top
 Profile  
Reply with quote  
PostPosted: Mon Mar 27, 2017 5:16 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
Just to confirm, I have svn 820. I took a look at the updates since then (the trunk is now at svn 839) and I don't see anything relevant, mainly just 4Dvar stuff that I have turned off anyway. I posted as a bug report now that we know it's not just me (i.e. Serega.Osipov has had the same problems).


Top
 Profile  
Reply with quote  
PostPosted: Mon Mar 27, 2017 11:27 pm 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
I asked the advice of one of my colleagues who is a software engineer working with different ocean models, and he had this to say:

Quote:
Those results are somewhat reassuring, if you can get any parallel run
to match the 1x1 results.

It is possible that your collectives are not reproducible, like
MPI_Reduce (used for sums, max, min, etc). They use binary trees over
the MPI ranks, like this:

sum = s1 + s2

s1 = s11 + s12
s2 = s21 + s22

s11 = rank 1 + rank 2
s12 = rank 3 + rank 4
s21 = rank 5 + rank 6
s22 = rank 7 + rank 8

and it may be that most of your layouts are "normal" and fit evenly over
the tree, but that the others have a slightly different layout.

It's possible that ROMS changed some of its collectives, and didn't
check this out. There are tricks to making these reproducible, e.g.
doing it manually with point-to-point. You also might be able to
control this within the MPI library.


Does this sound likely to any of the ROMS developers here?


Top
 Profile  
Reply with quote  
PostPosted: Tue Mar 28, 2017 12:21 am 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3264
Location: IMS/UAF, USA
Doing gitk on distribute.F, the last real change was ticket 653 in 2014.

Ticket 584 changed calls to mpi_alltoall to mpi_allgather (2013).

There was ticket 554 in 2012 adding aggregate functions.


Top
 Profile  
Reply with quote  
PostPosted: Tue Mar 28, 2017 1:00 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
For the ROMS 3.6 distribution that was bit-reproducible for me, distribute.F is stamped with

Code:
!svn $Id: distribute.F 645 2013-01-22 23:21:54Z arango $


and yes, it uses mpi_alltoall rather than mpi_allgather.


Top
 Profile  
Reply with quote  
PostPosted: Tue Mar 28, 2017 4:32 am 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3264
Location: IMS/UAF, USA
All three of those tickets have something to do with moving to the nested code: 584, for instance.


Top
 Profile  
Reply with quote  
PostPosted: Tue Mar 28, 2017 5:05 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
Hmmm, very interesting. That ticket says "We get identical nesting solutions in serial with partitions, shared-memory (OpenMP), and distributed-memory (MPI) for tile partitions of 1x1, 2x2, and 3x3." So some tile partitions have been tested but not an exhaustive list. Looks like even if nesting is off (as it is for my configuration) the kernel does not reduce to the way it was in 3.6.


Top
 Profile  
Reply with quote  
PostPosted: Tue Mar 28, 2017 5:47 am 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3264
Location: IMS/UAF, USA
Well, can you simply go back to the old distribute.F?


Top
 Profile  
Reply with quote  
PostPosted: Tue Mar 28, 2017 5:51 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
I will try that tomorrow and let you know what happens. Hopefully I can just revert that file without having everything else break.


Top
 Profile  
Reply with quote  
PostPosted: Wed Mar 29, 2017 3:29 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
Turns out you can just switch to the old version of distribute.F and nothing breaks! Unfortunately it doesn't fix the problem, it's still not bit-reproducible.


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 10, 2017 2:31 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
I finally got around to setting up standalone ROMS with no modifications. Same grid, forcing, boundary conditions as before, but no sea ice, no ice shelves, none of the other miscellaneous things I've messed with. It is the most recent version of the code as of today.

As before, ROMS is bit-reproducible for serial simulations (1x1) but not for certain parallel simulations (32x16, I didn't bother trying anything else).

So basically I have just confirmed the problem is actually in the ROMS source code and not any of the MetROMS modifications or my own modifications.


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 10, 2017 12:20 pm 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3264
Location: IMS/UAF, USA
With that, you should be able to do "git bisect" to find out when things changed. You can get a git version of the history from "git svn clone" (which will take a while to download the whole history), or do the same operations manually from an svn repo (which will be slow at every step).


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 11, 2017 4:53 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
I seem to have resolved the problem by switching from openmpi 1.8.4 to openmpi 1.10.2. I haven't tested all possible tilings, of course, but it is now bit-reproducible for 32x16. I have confirmed this twice.

Comparing 32x16 to 1x1, all the variables are the same except for swrad on one patch of the periodic boundary. This is true right from the first timestep, and strangely enough doesn't seem to impact any of the other variables. So there is still something funny going on. But as long as I keep a consistent tiling it should be okay.

Given that ROMS 3.6 was bit-reproducible using openmpi 1.8.4 (at least for me), I expect some part of the kernel update to 3.7 newly implemented features that didn't work properly in older versions of openmpi. Or something like that.


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 14, 2017 6:46 pm 
Offline
Site Admin
User avatar

Joined: Wed Feb 26, 2003 4:41 pm
Posts: 1019
Location: IMCS, Rutgers University
Well, it seems that there is a problem with the latest version of OpenMPI. I wonder what did they change. This kind of problem is difficult to track and frustrating. Compilers and libraries are known to have bugs. One thing to check is if the OpenMPI library was compiled correctly with the same compiler that you are using with ROMS.

Now, exact reproduction of application independent of parallel partitions is not possible if we have global reduction. Couple of years ago I posted the following :arrow: warning. Please check it out.


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 18, 2017 12:37 am 
Offline

Joined: Tue Jun 28, 2016 2:08 pm
Posts: 54
Location: CCRC (UNSW), ARCCSS, ACE CRC
Thanks Hernan for the link to your warning, I hadn't seen it before. I'm not running with volume conservation enabled, but the same principles hold for global reductions.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 33 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group