Ocean Modeling Discussion

ROMS/TOMS

Search for:
It is currently Wed Sep 18, 2019 5:57 pm




Post new topic Reply to topic  [ 7 posts ] 

All times are UTC

Author Message
PostPosted: Tue Aug 24, 2010 11:42 am 
Offline

Joined: Thu Apr 16, 2009 1:21 am
Posts: 7
Location: Instituto Espanol de Oceanografia
Hello,

I have always run the ROMS model in parallel using OpenMP, but recently I changed to MPI. To my surprise, depending on the number of processors and the model decomposition, I get blowups, and/or different results. With OpenMP, I had been using a 4x4 scheme. If I use the same scheme with MPI, I get a blowup. 5x4 seems to work ok, but if I use more processors, I get blowups again. The only scheme that worked with more than 5x4 processors is 1x35.

When I get blowups, it doesn't matter if I change the timestep or VISC2/VISC4/TNU2/TNU4 parameters, it keeps blowing up.

With the different configurations, normally there is no difference in the mean kynetic energy (up to the point when it blows up), as you can see in the figure attached (the blue and green lines overlap). But with some configurations using 40 processors (both 5x8 and 1x40) the mean kynetic energy increases, owing to a strong current near the southern boundary, that you can see in the plot. The configuration 1x35 worked, 1x40 didn't.

The only parameter that changed in all the configurations was NtileI and NtileJ, the rest didn't change (boundary conditions, forcings, timestep, theta_b,theta_s, grid,...). I am using ROMS version 3.3. The only thing I did to use MPI was to change the value of "USE_MPI ?= on" and unset "USE_OpenMP =?" in the makefile, and to set the compiler "FC := mpiifort" in "Compilers/Linux-ifort.mk". With this setup I got ROMS compiled and run with the 5x4 scheme, with the same results as with OpenMP.

Am I missing something? Am I doing anything wrong? Has anybody had issues like this? Any comment would be a great help.

Many thanks,

Marcos Cobas Garcia

Centro de Supercomputacion de Galicia


Attachments:
File comment: Mean kynetic energy for several domain decompositions.
MeanKineticEnergy.MPI.png
MeanKineticEnergy.MPI.png [ 252.84 KiB | Viewed 3291 times ]
File comment: 1x35 decomposition, no strong current in southern boundary
ubar.MPI35strips.png
ubar.MPI35strips.png [ 284.28 KiB | Viewed 3291 times ]
File comment: 1x40 decomposition, see strong current in southern boundary
ubar.MPI40strips.png
ubar.MPI40strips.png [ 293.97 KiB | Viewed 3291 times ]
Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 24, 2010 5:28 pm 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3667
Location: IMS/UAF, USA
Rather than running it until it blows up, I would run for say 5-10 steps, saving a history record each timestep. Do this for a good case and a bad case, then do ncdiff on the two outputs. The correct answer is that you get all zeroes, but clearly there is a bug somewhere.

What are your values of Lm and Mm? It's possible that we've only really tested the case where Lm/NtileI and Mm/NtileJ are integers.


Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 25, 2010 9:20 am 
Offline

Joined: Thu Apr 16, 2009 1:21 am
Posts: 7
Location: Instituto Espanol de Oceanografia
Hello Kate and everybody,

Thanks for your fast reply, Kate.

So far I had compared the solutions visually, and they seemed equal to me. But today I took the two configurations with 35 processors (5x7 and 1x35) and compared the temperature at every layer. The difference is relevant for the top layers, as you may see in the figures attached, from the very beguinning. The differences in the bottom layers are much smaller, up to 2 days before it blows up (the output is saved every 12 hours, approx.)

The values of Lm and Mm are Lm = 258 and Mm = 508. I picked some bad numbers. 508 is only divisible by 2,4,127 and 254. If I had chosen 500, or even 507, I would have more options. 258 is only divisible by 2,3,6, and much bigger numbers. I will try a decomposition in 6x4, which gives integer values of Lm/NtileI and Mm/NtileJ, as you suggest.

Thanks a lot,

Marcos Cobas Garcia

Centro de Supercomputacion de Galicia


Attachments:
TemperatureDifferences10TopLayers.png
TemperatureDifferences10TopLayers.png [ 264.57 KiB | Viewed 3241 times ]
TemperatureDifferences20BottomLayers.png
TemperatureDifferences20BottomLayers.png [ 122.29 KiB | Viewed 3241 times ]
Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 25, 2010 1:50 pm 
Offline

Joined: Wed Dec 31, 2003 6:16 pm
Posts: 795
Location: USGS, USA
one approach is to #undef all the extra stuff like uv_visc2 and ts_mixing and tides etc etc etc. Run the model for just a few time steps and save every time step. Then slowly turn these things back on. compare solutions. One note of caution- running the model w/ the coeefs for mix to =0 may not be the same as #undef that mixing and rebuilding. One would hope so, but i suggest that you undef these things and slowly add them back in.


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 26, 2010 10:26 am 
Offline

Joined: Thu Apr 16, 2009 1:21 am
Posts: 7
Location: Instituto Espanol de Oceanografia
Hello,

Thanks for the replies.

The run for the 6x4 domain decomposition didn't work either. Besides, I noticed there are differences between the OpenMP 4x4 and MPI 1x35 runs. In temperature the maximum difference is over 2.5 degrees. The differences in temperature, salinity, zeta, u and v concentrate near the shelf break, mainly, and have a similar pattern.

I will try as jcwarner suggests, and keep posting.

Thanks a lot,

Marcos Cobas

Centro de Supercomputacion de Galicia


Attachments:
File comment: Salinity differences between OpenMP 4x4 and MPI 1x35.
SalinityDifferences.OMP-MPI1x35.TopLayerLastSnapshot.png
SalinityDifferences.OMP-MPI1x35.TopLayerLastSnapshot.png [ 205.96 KiB | Viewed 3170 times ]
File comment: Temperature differences between OpenMP 4x4 and MPI 1x35.
TemperatureDifferences.OMP-MPI1x35.TopLayerLastSnapshot.png
TemperatureDifferences.OMP-MPI1x35.TopLayerLastSnapshot.png [ 187.94 KiB | Viewed 3170 times ]
Top
 Profile  
Reply with quote  
PostPosted: Wed Sep 15, 2010 6:10 pm 
Offline
Site Admin
User avatar

Joined: Wed Feb 26, 2003 4:41 pm
Posts: 1081
Location: IMCS, Rutgers University
I don't know what the problem is here. It seems a problem with partition and a parallel bug. However, your choice of partitions are not optimal. I also think that this application is not that large to use that high number of processors. You will be penalized by excessive MPI communications. There is always an optimal number of partitions per application. I had mentioned this so many times in this forum. The fact that you have so many processor available does not means that you need to use all of them to run your application. I bet that your application will be more efficient with less number of processors.

My experience with distributed-memory is that I get the same behavior with MPICH1, MPICH2, and OpenMPI. In some cases you can get identical solutions which depends on compiler, compiler flags, and MPI libraries. We always use the MPI libraries compiled with the same version of the compiler as ROMS. Notice that in the distributed-memory exchanges, ROMS uses the lower level MPI communication routines. Nothing fancy.

As far as I know the ROMS code, as distributed, is free of distributed-memory (MPI) parallel bugs. If a parallel bug is found in application it is usually associated with the user customization of the code. The MPI paradigm is the easier one. However, shared-memory and serial with partitions is more difficult in ROMS. I sometimes forget this when coding. I am so used to MPI nowadays. All the adjoint-based algorithms only work in MPI.

Now, OpenMP is a protocol for shared-memory. I just fixed today a parallel bug for the biharmonic stress tensor in shared-memory and serial with partitions. See :arrow: ticket. You need to update. This will be a problem for you if you are using shared-memory, OpenMP.

I always recommend users to use the build.sh or build.bach script instead of modifying the makefile.

:idea: When configuring a large application, it is always a good idea to set your grid as a multiple of 2 or 3. I actually use powers of 2, the best... This allows a lot of choices for tile partition and tile balancing for efficiency. You cannot select the number of grid points capriciously in the parallel computing world.


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 04, 2010 11:40 am 
Offline

Joined: Sun Dec 07, 2008 7:57 pm
Posts: 1
Location: MeteoGalicia
Hi Marcos

We have now exactly the same problem, our model roms version is 511


Did you find the solution??


Attachments:
FIG1.png
FIG1.png [ 23.79 KiB | Viewed 2925 times ]
FIG2.png
FIG2.png [ 24.05 KiB | Viewed 2925 times ]
FIG3.png
FIG3.png [ 16.19 KiB | Viewed 2925 times ]
Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 7 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group