Opened 14 years ago
Closed 14 years ago
#512 closed bug (Fixed)
Corrected shared-memory bug in step3d_uv.F
Reported by: | arango | Owned by: | arango |
---|---|---|---|
Priority: | major | Milestone: | Release ROMS/TOMS 3.5 |
Component: | Nonlinear | Version: | 3.5 |
Keywords: | Cc: |
Description
I have been looking for a shared-memory parallel bug, on and off for several months, in North-South periodic (NS_PERIODIC) applications. This bug appeared when TS_MPDATA is activated. I have re-written and analyzed mpdata_adiff.F several times but I still get the same problem. I don't get identical solutions with different partitions. I came to the conclusion that the problem was somewhere else. This has been a very difficult bug to track in the debugger.
Currently, I am removing all the CPP options associated with lateral boundary conditions which allow me to look at the code in great detail. I found the problem in step3d_uv.F. We need to have the following conditionals around line 1122:
# if !defined EW_PERIODIC && !defined COMPOSED_GRID IF (DOMAIN(ng)%Western_Edge(tile)) THEN ... END IF IF (DOMAIN(ng)%Eastern_Edge(tile)) THEN ... END IF # endif # if !defined NS_PERIODIC && !defined COMPOSED_GRID IF (DOMAIN(ng)%Southern_Edge(tile)) THEN ... END IF IF (DOMAIN(ng)%Northern_Edge(tile)) THEN ... END IF # endif
instead of
# if !defined EW_PERIODIC && !defined COMPOSED_GRID IF (Istr.eq.1)THEN ... END IF IF (Iend.eq.Lm(ng)) THEN ... END IF # endif # if !defined NS_PERIODIC && !defined COMPOSED_GRID IF (j.eq.0) THEN ... END IF IF (j.eq.Mm(ng)+1) THEN ... END IF # endif
when replacing the incorrect vertical mean with more accurate barotropic component at only the boundary points. I introduced this bug long time ago when removing redundant operations that are illegal in the adjoint algorithms. This bug only affected NS_PERIODIC applications. Recall that I- and J-ranges are different in periodic applications. Well, shared-memory code is very delicate.
After the above fix, I can get identical solutions with TS_MPDATA in shared-memory applications. However, there is still a bug in serial with partitions when both NS_PERIODIC and TS_MPDATA are activated together. In the past, I thought that both problems were related... Well, I will continue looking. This type of bugs takes a lot of time to fix. Sometimes we are lucky and find them pretty fast. I think that this one is related to the 3 ghost-points required for TS_MPDATA or more probably an illegal call inside a parallel region. This will be even nastier because implies splitting the call to mpdata_adiff.F to a different parallel region. A nasty proposition. This is the kind of solutions that were required in the past for serial with partitions bugs...
I also corrected a small bug in ana_smflux.h when the LMD_TEST is activated. Many thanks to Chris Edwards for reporting this bug.