Pair and MPI version has problem with regridding

Bug reports, work arounds and fixes

Moderators: arango, robertson

Post Reply
Message
Author
User avatar
jivica
Posts: 169
Joined: Mon May 05, 2003 2:41 pm
Location: The University of Western Australia, Perth, Australia
Contact:

Pair and MPI version has problem with regridding

#1 Unread post by jivica »

Having problem with application where I am using Pair on atmo model native grid and *ONLY* MPI-parallel version of the latest code.
In serial version is regridding and working OK, but parallel version still has problem..

Trying to nail that, and by digging I think I've found bug in ./Modules/mod_forces.F line 509
should be like this as we are having 2 time steps in PairG:
# ifndef ANA_PAIR
allocate ( FORCES(ng) % PairG(LBi:UBi,LBj:UBj,2) )
Dmem(ng)=Dmem(ng)+2.0_r8*size2d
# endif


This is still not fixing parallel tile problem and Pair :(

Cheers
Ivica

User avatar
arango
Site Admin
Posts: 1347
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: Pair and MPI version has problem with regridding

#2 Unread post by arango »

Yes, the Dmem is a diagnostic quantity to estimate the memory requirement for an application. It has nothing to do with the numerical kernel.

It is in my TODO list to look at your problem in the debugger. The issues that you are talking about sound like a parallel bug. However, the regrid subroutine is generic for all variables. It is impossible to interpolate correctly for all the other forcing variables and not for Pair. It doesn't make sense.

The problem must be somewhere else. The fact that it is happening when you use the new option PRESS_COMPENSATE, it tells me that a parallel exchange is missing for Pair. Notice that the pressure is averaged at U- and V-points in u2dbc_im.F and v2dbc_im.F. I need to look what it is going on in set_2dfield.F when Pair is time interpolated from snapshots. The MPI exchange is always done at the bottom of the subroutine.

User avatar
jivica
Posts: 169
Joined: Mon May 05, 2003 2:41 pm
Location: The University of Western Australia, Perth, Australia
Contact:

Re: Pair and MPI version has problem with regridding

#3 Unread post by jivica »

Hernan,

I know you are quite busy and that this bug is on your todo list,
this post was more to have others aware of the problem in the latest version of ROMS code (if they use MPI and Pair as I am).

Using in addition (to ATM_PRESS) option for bry pressure correction --> PRESS_COMPENSATE doesn't change anything.

I am confused with regridding as well, sustr/svstr is regridded OK just having problem with Pair (!),
smells like wrong memory allocation? Do not have totalview so stuck here.

For example, serial ROMS Pair field at first time step:

Image

MPI version of ROMS and the same Pair at first time step:
Image

MPI version of ROMS and sustr which is OK:
Image


Thanks for your time !
Ivica

jcwarner
Posts: 1172
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

Re: Pair and MPI version has problem with regridding

#4 Unread post by jcwarner »

what happens if you do MPI with one processor?
i can look at the press compensate, we want to use that for our hurricane simulations.
-j

User avatar
arango
Site Admin
Posts: 1347
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: Pair and MPI version has problem with regridding

#5 Unread post by arango »

I took a look in the debugger on our US East Coast application and I cannot find anything wrong. I activated both ATM_PRESS and PRESS_COMPENSATE. I am also using BULK_FLUXES, which also need Pair. I don't see a parallel bug.
Pair.png
shflux.png
The Pair is kind of jagged but it is because of the coarse resolution of the NCEP dataset. I cannot reproduce your parallel problem. I am clueless about what it is going on your application. Are you also activating BULK_FLUXES? What is the range of your longitude in the Pair data?

User avatar
jivica
Posts: 169
Joined: Mon May 05, 2003 2:41 pm
Location: The University of Western Australia, Perth, Australia
Contact:

Re: Pair and MPI version has problem with regridding

#6 Unread post by jivica »

John,

tried mpirun -np 1 and it is working OK, gives identical result as serial run.

It is southern hemisphere system, NO BULK_FLUX, only storm surge case with surface wind stress and pressure.

Reading of original data is OK as well (in all ceases, MPI or serial), reasonable values within range.

Will try other constellations i.e. different NX * NY

User avatar
jivica
Posts: 169
Joined: Mon May 05, 2003 2:41 pm
Location: The University of Western Australia, Perth, Australia
Contact:

Re: Pair and MPI version has problem with regridding

#7 Unread post by jivica »

It is getting even more interesting;
it is working for certain tile (2x2, 3x2) configuration and crashing for 6x4, and then working for 32, 36, 48 but having wrong pressure as I wrote.
I recompiled roms with debug option and gfortran + openmpi and bomb turns to be in inp_par.f90 which in line 77 is the one with IF statement complaining about logical statement of type kind=4:

!-----------------------------------------------------------------------
! Set lower and upper bounds indices per domain partition for all
! nested grids.
!-----------------------------------------------------------------------
!
! Determine the number of ghost-points in the halo region.
!
NghostPoints=2
IF (ANY(CompositeGrid).or.ANY(RefinedGrid)) THEN
NghostPoints=MAX(3,NghostPoints)
END IF
!

The error:

inp_par.f90:77: runtime error: load of null pointer of type 'logical(kind=4)'
ASAN:DEADLYSIGNAL
=================================================================
==2257==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x564f18654538 bp 0x7ffff1babab0 sp 0x7ffff1ba7d50 T0)
==2257==The signal is caused by a READ memory access.
==2257==Hint: address points to the zero page.

User avatar
jivica
Posts: 169
Joined: Mon May 05, 2003 2:41 pm
Location: The University of Western Australia, Perth, Australia
Contact:

Re: Pair and MPI version has problem with regridding

#8 Unread post by jivica »

Not sure if I am right, but think the problem is in new version of regrid and new variable "myxout"

After compiling in debug mode (with mpi) I managed to trap it for 6x6 tile configuration (4x4 and 8x8 works?!):

At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
Fortran runtime error: Index '1' of dimension 1 of array 'myxout' below lower bound of 335

Error termination. Backtrace:
At line 155 of file regrid.f90
Fortran runtime error: Index '171' of dimension 1 of array 'myxout' above upper bound of 170

Error termination. Backtrace:
At line 155 of file regrid.f90At line 155 of file regrid.f90
Fortran runtime error: Index '1' of dimension 1 of array 'myxout' below lower bound of 167

Error termination. Backtrace:
At line 155 of file regrid.f90At line 155 of file regrid.f90At line 155 of file regrid.f90At line 155 of file regrid.f90
Fortran runtime error: Index '1' of dimension 1 of array 'myxout' below lower bound of 671
Fortran runtime error: Index '1' of dimension 1 of array 'myxout' below lower bound of 503

Fortran runtime error: Index '0' of dimension 2 of array 'myxout' below lower bound of 319

Error termination. Backtrace:

Fortran runtime error: Index '0' of dimension 2 of array 'myxout' below lower bound of 399

Error termination. Backtrace:
At line 155 of file regrid.f90#0 0x7ff967e11d1d in ???
#1 0x7ff967e12825 in ???
#2 0x7ff967e12bca in ???
#0 0x7fe6e95fcd1d in ???

Fortran runtime error: Index '0' of dimension 2 of array 'myxout' below lower bound of 399

Error termination. Backtrace:
#3 0x55f7829eb993 in regrid_
at /home/ivica/NORTH_TC/Build/regrid.f90:155
#4 0x55f7829cc6df in __nf_fread2d_mod_MOD_nf_fread2d
at /home/ivica/NORTH_TC/Build/nf_fread2d.f90:309
#5 0x55f782670b2d in get_2dfld_
at /home/ivica/NORTH_TC/Build/get_2dfld.f90:227
#6 0x55f7823095da in get_data_
at /home/ivica/NORTH_TC/Build/get_data.f90:95
#7 0x55f782230117 in initial_
at /home/ivica/NORTH_TC/Build/initial.f90:229
#8 0x55f781e11ee2 in __ocean_control_mod_MOD_roms_initialize
at /home/ivica/NORTH_TC/Build/ocean_control.f90:133
#9 0x55f781e0e43d in ocean
at /home/ivica/NORTH_TC/Build/master.f90:95
#10 0x55f781e0eab2 in main
at /home/ivica/NORTH_TC/Build/master.f90:50


and so on....


Error termination. Backtrace:

and regrid.f90 155 line is MyXout(i,j)=Xout(i,j) :
DO j=Jmin,Jmax
DO i=Imin,Imax
MyXout(i,j)=Xout(i,j) ! range [-180 180]
END DO
END DO



If you want I can put the example on my server so you can grab it.

Thanks!

Ivica

User avatar
arango
Site Admin
Posts: 1347
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: Pair and MPI version has problem with regridding

#9 Unread post by arango »

It doesn't make sense to me. MyXout is a state, tiled variable and it is allocated as the others, and the pointer is passed correctly. It is the only way that this can be done. I bet that the problem is not in regrid. It seems like a memory leakage somewhere else.

Yes, you can put the application in somewhere for me to access. I don't know what I can do but compile with the strict flags in ifort and gfortan. I cannot debug with that many processors. We need to put print statements for Imin, Imax, Jmin, Jmax, LBi, UBi, LBj, and UBj to check what got corrupted with so many processors.

User avatar
jivica
Posts: 169
Joined: Mon May 05, 2003 2:41 pm
Location: The University of Western Australia, Perth, Australia
Contact:

Re: Pair and MPI version has problem with regridding

#10 Unread post by jivica »

I've sent you the link privately on the email..

Ivica

User avatar
arango
Site Admin
Posts: 1347
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: Pair and MPI version has problem with regridding

#11 Unread post by arango »

I updated the code to correct the bug in regrid.F. Check the following trac ticket :arrow: src:ticket:808 for more details. The parallel bug was corrected. Good luck.

Post Reply