Ocean Modeling Discussion

ROMS/TOMS

Search for:
It is currently Fri Aug 23, 2019 6:22 pm




Post new topic Reply to topic  [ 11 posts ] 

All times are UTC

Author Message
PostPosted: Fri Mar 08, 2019 4:49 am 
Offline
User avatar

Joined: Mon May 05, 2003 2:41 pm
Posts: 122
Location: The University of Western Australia, Perth, Australia
Having problem with application where I am using Pair on atmo model native grid and *ONLY* MPI-parallel version of the latest code.
In serial version is regridding and working OK, but parallel version still has problem..

Trying to nail that, and by digging I think I've found bug in ./Modules/mod_forces.F line 509
should be like this as we are having 2 time steps in PairG:
# ifndef ANA_PAIR
allocate ( FORCES(ng) % PairG(LBi:UBi,LBj:UBj,2) )
Dmem(ng)=Dmem(ng)+2.0_r8*size2d
# endif


This is still not fixing parallel tile problem and Pair :(

Cheers
Ivica


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 08, 2019 5:16 am 
Offline
Site Admin
User avatar

Joined: Wed Feb 26, 2003 4:41 pm
Posts: 1081
Location: IMCS, Rutgers University
Yes, the Dmem is a diagnostic quantity to estimate the memory requirement for an application. It has nothing to do with the numerical kernel.

It is in my TODO list to look at your problem in the debugger. The issues that you are talking about sound like a parallel bug. However, the regrid subroutine is generic for all variables. It is impossible to interpolate correctly for all the other forcing variables and not for Pair. It doesn't make sense.

The problem must be somewhere else. The fact that it is happening when you use the new option PRESS_COMPENSATE, it tells me that a parallel exchange is missing for Pair. Notice that the pressure is averaged at U- and V-points in u2dbc_im.F and v2dbc_im.F. I need to look what it is going on in set_2dfield.F when Pair is time interpolated from snapshots. The MPI exchange is always done at the bottom of the subroutine.


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 08, 2019 7:09 am 
Offline
User avatar

Joined: Mon May 05, 2003 2:41 pm
Posts: 122
Location: The University of Western Australia, Perth, Australia
Hernan,

I know you are quite busy and that this bug is on your todo list,
this post was more to have others aware of the problem in the latest version of ROMS code (if they use MPI and Pair as I am).

Using in addition (to ATM_PRESS) option for bry pressure correction --> PRESS_COMPENSATE doesn't change anything.

I am confused with regridding as well, sustr/svstr is regridded OK just having problem with Pair (!),
smells like wrong memory allocation? Do not have totalview so stuck here.

For example, serial ROMS Pair field at first time step:

Image

MPI version of ROMS and the same Pair at first time step:
Image

MPI version of ROMS and sustr which is OK:
Image


Thanks for your time !
Ivica


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 08, 2019 2:23 pm 
Offline

Joined: Wed Dec 31, 2003 6:16 pm
Posts: 788
Location: USGS, USA
what happens if you do MPI with one processor?
i can look at the press compensate, we want to use that for our hurricane simulations.
-j


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 08, 2019 9:04 pm 
Offline
Site Admin
User avatar

Joined: Wed Feb 26, 2003 4:41 pm
Posts: 1081
Location: IMCS, Rutgers University
I took a look in the debugger on our US East Coast application and I cannot find anything wrong. I activated both ATM_PRESS and PRESS_COMPENSATE. I am also using BULK_FLUXES, which also need Pair. I don't see a parallel bug.

Attachment:
Pair.png
Pair.png [ 636.4 KiB | Viewed 670 times ]

Attachment:
shflux.png
shflux.png [ 972.89 KiB | Viewed 670 times ]


The Pair is kind of jagged but it is because of the coarse resolution of the NCEP dataset. I cannot reproduce your parallel problem. I am clueless about what it is going on your application. Are you also activating BULK_FLUXES? What is the range of your longitude in the Pair data?


Top
 Profile  
Reply with quote  
PostPosted: Sat Mar 09, 2019 5:09 am 
Offline
User avatar

Joined: Mon May 05, 2003 2:41 pm
Posts: 122
Location: The University of Western Australia, Perth, Australia
John,

tried mpirun -np 1 and it is working OK, gives identical result as serial run.

It is southern hemisphere system, NO BULK_FLUX, only storm surge case with surface wind stress and pressure.

Reading of original data is OK as well (in all ceases, MPI or serial), reasonable values within range.

Will try other constellations i.e. different NX * NY


Top
 Profile  
Reply with quote  
PostPosted: Sat Mar 09, 2019 6:20 am 
Offline
User avatar

Joined: Mon May 05, 2003 2:41 pm
Posts: 122
Location: The University of Western Australia, Perth, Australia
It is getting even more interesting;
it is working for certain tile (2x2, 3x2) configuration and crashing for 6x4, and then working for 32, 36, 48 but having wrong pressure as I wrote.
I recompiled roms with debug option and gfortran + openmpi and bomb turns to be in inp_par.f90 which in line 77 is the one with IF statement complaining about logical statement of type kind=4:

!-----------------------------------------------------------------------
! Set lower and upper bounds indices per domain partition for all
! nested grids.
!-----------------------------------------------------------------------
!
! Determine the number of ghost-points in the halo region.
!
NghostPoints=2
IF (ANY(CompositeGrid).or.ANY(RefinedGrid)) THEN
NghostPoints=MAX(3,NghostPoints)
END IF
!

The error:

inp_par.f90:77: runtime error: load of null pointer of type 'logical(kind=4)'
ASAN:DEADLYSIGNAL
=================================================================
==2257==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x564f18654538 bp 0x7ffff1babab0 sp 0x7ffff1ba7d50 T0)
==2257==The signal is caused by a READ memory access.
==2257==Hint: address points to the zero page.


Top
 Profile  
Reply with quote  
PostPosted: Sat Mar 09, 2019 12:09 pm 
Offline
User avatar

Joined: Mon May 05, 2003 2:41 pm
Posts: 122
Location: The University of Western Australia, Perth, Australia
Not sure if I am right, but think the problem is in new version of regrid and new variable "myxout"

After compiling in debug mode (with mpi) I managed to trap it for 6x6 tile configuration (4x4 and 8x8 works?!):

At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
Fortran runtime error: Index '1' of dimension 1 of array 'myxout' below lower bound of 335

Error termination. Backtrace:
At line 155 of file regrid.f90
Fortran runtime error: Index '171' of dimension 1 of array 'myxout' above upper bound of 170

Error termination. Backtrace:
At line 155 of file regrid.f90At line 155 of file regrid.f90
Fortran runtime error: Index '1' of dimension 1 of array 'myxout' below lower bound of 167

Error termination. Backtrace:
At line 155 of file regrid.f90At line 155 of file regrid.f90At line 155 of file regrid.f90At line 155 of file regrid.f90
Fortran runtime error: Index '1' of dimension 1 of array 'myxout' below lower bound of 671
Fortran runtime error: Index '1' of dimension 1 of array 'myxout' below lower bound of 503

Fortran runtime error: Index '0' of dimension 2 of array 'myxout' below lower bound of 319

Error termination. Backtrace:

Fortran runtime error: Index '0' of dimension 2 of array 'myxout' below lower bound of 399

Error termination. Backtrace:
At line 155 of file regrid.f90#0 0x7ff967e11d1d in ???
#1 0x7ff967e12825 in ???
#2 0x7ff967e12bca in ???
#0 0x7fe6e95fcd1d in ???

Fortran runtime error: Index '0' of dimension 2 of array 'myxout' below lower bound of 399

Error termination. Backtrace:
#3 0x55f7829eb993 in regrid_
at /home/ivica/NORTH_TC/Build/regrid.f90:155
#4 0x55f7829cc6df in __nf_fread2d_mod_MOD_nf_fread2d
at /home/ivica/NORTH_TC/Build/nf_fread2d.f90:309
#5 0x55f782670b2d in get_2dfld_
at /home/ivica/NORTH_TC/Build/get_2dfld.f90:227
#6 0x55f7823095da in get_data_
at /home/ivica/NORTH_TC/Build/get_data.f90:95
#7 0x55f782230117 in initial_
at /home/ivica/NORTH_TC/Build/initial.f90:229
#8 0x55f781e11ee2 in __ocean_control_mod_MOD_roms_initialize
at /home/ivica/NORTH_TC/Build/ocean_control.f90:133
#9 0x55f781e0e43d in ocean
at /home/ivica/NORTH_TC/Build/master.f90:95
#10 0x55f781e0eab2 in main
at /home/ivica/NORTH_TC/Build/master.f90:50


and so on....


Error termination. Backtrace:

and regrid.f90 155 line is MyXout(i,j)=Xout(i,j) :
DO j=Jmin,Jmax
DO i=Imin,Imax
MyXout(i,j)=Xout(i,j) ! range [-180 180]
END DO
END DO



If you want I can put the example on my server so you can grab it.

Thanks!

Ivica


Top
 Profile  
Reply with quote  
PostPosted: Sat Mar 09, 2019 5:53 pm 
Offline
Site Admin
User avatar

Joined: Wed Feb 26, 2003 4:41 pm
Posts: 1081
Location: IMCS, Rutgers University
It doesn't make sense to me. MyXout is a state, tiled variable and it is allocated as the others, and the pointer is passed correctly. It is the only way that this can be done. I bet that the problem is not in regrid. It seems like a memory leakage somewhere else.

Yes, you can put the application in somewhere for me to access. I don't know what I can do but compile with the strict flags in ifort and gfortan. I cannot debug with that many processors. We need to put print statements for Imin, Imax, Jmin, Jmax, LBi, UBi, LBj, and UBj to check what got corrupted with so many processors.


Top
 Profile  
Reply with quote  
PostPosted: Sun Mar 10, 2019 4:07 am 
Offline
User avatar

Joined: Mon May 05, 2003 2:41 pm
Posts: 122
Location: The University of Western Australia, Perth, Australia
I've sent you the link privately on the email..

Ivica


Top
 Profile  
Reply with quote  
PostPosted: Sun Mar 10, 2019 10:38 pm 
Offline
Site Admin
User avatar

Joined: Wed Feb 26, 2003 4:41 pm
Posts: 1081
Location: IMCS, Rutgers University
I updated the code to correct the bug in regrid.F. Check the following trac ticket :arrow: src:ticket:808 for more details. The parallel bug was corrected. Good luck.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 11 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group