Shared-memory parallel bug with TS_MPDATA and NS_PERIODIC

Known Problems with the ROMS source code

Moderators: arango, robertson

Post Reply
Message
Author
User avatar
arango
Site Admin
Posts: 1081
Joined: Wed Feb 26, 2003 4:41 pm
Location: IMCS, Rutgers University
Contact:

Shared-memory parallel bug with TS_MPDATA and NS_PERIODIC

#1 Post by arango » Wed Sep 22, 2010 9:25 pm

There is a shared-memory and serial with partition parallel bug when activating both TS_MPDATA and NS_PERIODIC. I have been looking for this bug for a while and I have not been able to find it and fix it. It is a very weird bug and it is only found in all the N-S periodic test cases distributed in ROMS. It displays different behaviors making it more difficult to track it. Each test case gives us a very different clue. The only thing that it is common is that the bug doesn't start right away, but appears after several timesteps which makes it much harder to find. I have never seen a parallel bug of this kind before so it is new territory for me. It is really weird to have parallel bugs associated with round-off. In my experience, parallel bugs always appear right away. All clues tell me that the problem is in mpdata_adiff.F. However, I am starting to suspect that it is somewhere else. I have made the backward parallel dependencies analysis for private and global arrays in this routine several times and re-written the ranges several times and I still get the parallel bug.

TS_MPDATA is still fine with any other type of boundary condition. If you are using TS_MPDATA and NS_PERIODIC, you need to run in serial with no partitions or distributed-memory (MPI). We will continue hunting for this elusive parallel bug :shock:

By the way, TS_MPDATA and EW_PERIODIC is fine for any partition with shared-memory, distributed-memory, and serial with partitions.

Post Reply