Ocean Modeling Discussion

ROMS/TOMS

Search for:
It is currently Wed Jul 24, 2019 12:22 am




Post new topic Reply to topic  [ 5 posts ] 

All times are UTC

Author Message
PostPosted: Fri Oct 22, 2010 5:55 pm 
Offline

Joined: Fri Apr 30, 2004 6:43 pm
Posts: 9
Location: PMEL, USA
We had been having severe issues with ROMS (plus biology) on a new (384 AMD core) cluster, using intel fortran:
/opt/intel/Compiler/11.1/069/bin/intel64/ifort

When we used the standard options in Linux-ifort.mk:
FFLAGS = -heap-arrays -fp-model-precise

we were getting random hangups of the execution (job ceased to progress at some random place, but didn't die, either). This happened with both small (5-layer) and large (60-layer) runs of the same code.

If we eliminated those two options in Linux-ifort.mk:
FFLAGS =

then the small (5-layer) version of the model ran just fine (and blazingly fast) on 192 processors, but the large (60-layer) version of the model yielded a segmentation fault during initialization:

> -----------------------------------------
> NLM: GET_STATE - Read state initial conditions, t = 36166 12:00:00
> (File: feastIC_99_31dyes_full.nc, Rec=0001, Index=1)
> - free-surface
> (Min = -9.12851274E-01 Max = 8.12097967E-01)
> - vertically integrated u-momentum component
> (Min = -6.04284108E-01 Max = 4.89444733E-01)
> - vertically integrated v-momentum component
> (Min = -5.01668155E-01 Max = 6.81821764E-01)
> --------------------------------------------------------------------------
> mpirun noticed that process rank 50 with PID 23298 on node compute-0-3 exited on signal 11 (Segmentation fault).
> -----------------------------------------

We looked around on the ROMS and other web pages, and as suggested there tried to increase the stack size before submitting the job:
> -------------------------
> ulimit -s unlimited
> mpirun -np 192 --hostfile myhosts oceanM roms.in
> ---------------------------

but unfortunately this yielded the same result (seg fault) as before.

It turns out the "ulimit -s unlimited" command is the right sort of fix, but ONLY if it is forced on ALL the nodes (not just the head node). On some systems (like ours) this requires a bit of superuser manipulation. A web search suggested appropriate tricks to make it the default configuration on our system; for example, see this thread:

https://lists.sdsc.edu/pipermail/npaci- ... 24157.html

Once our vendor had changed things globally, so that all nodes used "satcksize unlimited" as the default setting, we were able to run without the "-heap-arrays" option, with no complaints. The "-fp-model-precise" option was restored as well, without complaint.

The model (~200x200x60 with 15 bio tracers), now scales well out to at least 192 processors.

We hope this is useful to other large memory users, who may be experiencing similar problems...

-Al Hermann and Kerim Aydin


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 22, 2010 6:31 pm 
Offline
Site Admin
User avatar

Joined: Wed Feb 26, 2003 4:41 pm
Posts: 1079
Location: IMCS, Rutgers University
Thank you for the info. We set -fp-model-precise option for standard IEEE floating point operations in any application. Otherwise, we will not able to check parallel partition bugs. Without this option ifort will give different results with the same executable and partition. It becomes completely impossible for us to check parallel bugs with different partitions by comparing NetCDF files byte by byte. I was chock about this :evil: It turns out that a way for ifort to accelerate computations is to compromise (approximate) floating point operations. The more annoying aspect of these is the randomness of the round-off. You get different results each time. I didn't know if this was a compiler bug. We have updated to newer compiler versions and I haven't check this problem.

It seems to me that your grid is too small to have 192 partitions. I am having hard time believing that this scales very well due to the excessive communications between tiles. Your grid is square and I had mentioned several times about the danger of square grids because there is not a way to check for transposed array dimensions. Also 200 is a multiple of 2, 4, 5, 8, 10, 16, 20, 40, and 80. None of these numbers give you a balanced tile partitions with 192 nodes.

I have run this type of application with the same memory requirements or even larger on 8-16 nodes and I never ran into memory problems. Our adjoint-based applications uses several copies of the state for four different models running simultaneously. It seems to me that your cluster has a restricted memory or the problem is somewhere else. This does not make sense to me.


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 22, 2010 6:57 pm 
Offline

Joined: Fri Apr 30, 2004 6:43 pm
Posts: 9
Location: PMEL, USA
Hi Hernan,

Just to clarify, the precise dimensions of our grid are 182x258x60. It seemed our problem was more with stack vs the heap, irrespective of the fp-model issue. Apparently our cluster did have a restricted memory problem (restricted stack size), which the new settings have overcome. Not a ROMS problem per se, although other ROMS users may run into this hardware issue.

-Al


Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 25, 2018 1:13 am 
Offline

Joined: Fri Dec 15, 2017 5:58 pm
Posts: 2
Location: Nanjing university
hermann wrote:
Hi Hernan,

Just to clarify, the precise dimensions of our grid are 182x258x60. It seemed our problem was more with stack vs the heap, irrespective of the fp-model issue. Apparently our cluster did have a restricted memory problem (restricted stack size), which the new settings have overcome. Not a ROMS problem per se, although other ROMS users may run into this hardware issue.

-Al

hi,
we have get the same problem :(


Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 25, 2018 1:27 am 
Offline

Joined: Fri Dec 15, 2017 5:58 pm
Posts: 2
Location: Nanjing university
arango wrote:
Thank you for the info. We set -fp-model-precise option for standard IEEE floating point operations in any application. Otherwise, we will not able to check parallel partition bugs. Without this option ifort will give different results with the same executable and partition. It becomes completely impossible for us to check parallel bugs with different partitions by comparing NetCDF files byte by byte. I was chock about this :evil: It turns out that a way for ifort to accelerate computations is to compromise (approximate) floating point operations. The more annoying aspect of these is the randomness of the round-off. You get different results each time. I didn't know if this was a compiler bug. We have updated to newer compiler versions and I haven't check this problem.

It seems to me that your grid is too small to have 192 partitions. I am having hard time believing that this scales very well due to the excessive communications between tiles. Your grid is square and I had mentioned several times about the danger of square grids because there is not a way to check for transposed array dimensions. Also 200 is a multiple of 2, 4, 5, 8, 10, 16, 20, 40, and 80. None of these numbers give you a balanced tile partitions with 192 nodes.

I have run this type of application with the same memory requirements or even larger on 8-16 nodes and I never ran into memory problems. Our adjoint-based applications uses several copies of the state for four different models running simultaneously. It seems to me that your cluster has a restricted memory or the problem is somewhere else. This does not make sense to me.


dear arango,

thanks for your answer. I'm a graduate student from CHINA and a beginner in ROMS, I have gotten a same problem, and my grid is 117*137*32, and have 2 partitions. I have try some ways in myroms.org, but didn't have any effect. So how I try again?

thank you!


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 5 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group