stack, heap, and segmentation faults on large simulations

Message

hermann · #1 Unread post by **hermann** » Fri Oct 22, 2010 5:55 pm

We had been having severe issues with ROMS (plus biology) on a new (384 AMD core) cluster, using intel fortran:
/opt/intel/Compiler/11.1/069/bin/intel64/ifort

When we used the standard options in Linux-ifort.mk:
FFLAGS = -heap-arrays -fp-model-precise

we were getting random hangups of the execution (job ceased to progress at some random place, but didn't die, either). This happened with both small (5-layer) and large (60-layer) runs of the same code.

If we eliminated those two options in Linux-ifort.mk:
FFLAGS =

then the small (5-layer) version of the model ran just fine (and blazingly fast) on 192 processors, but the large (60-layer) version of the model yielded a segmentation fault during initialization:

> -----------------------------------------
> NLM: GET_STATE - Read state initial conditions, t = 36166 12:00:00
> (File: feastIC_99_31dyes_full.nc, Rec=0001, Index=1)
> - free-surface
> (Min = -9.12851274E-01 Max = 8.12097967E-01)
> - vertically integrated u-momentum component
> (Min = -6.04284108E-01 Max = 4.89444733E-01)
> - vertically integrated v-momentum component
> (Min = -5.01668155E-01 Max = 6.81821764E-01)
> --------------------------------------------------------------------------
> mpirun noticed that process rank 50 with PID 23298 on node compute-0-3 exited on signal 11 (Segmentation fault).
> -----------------------------------------

We looked around on the ROMS and other web pages, and as suggested there tried to increase the stack size before submitting the job:
> -------------------------
> ulimit -s unlimited
> mpirun -np 192 --hostfile myhosts oceanM roms.in
> ---------------------------

but unfortunately this yielded the same result (seg fault) as before.

It turns out the "ulimit -s unlimited" command is the right sort of fix, but ONLY if it is forced on ALL the nodes (not just the head node). On some systems (like ours) this requires a bit of superuser manipulation. A web search suggested appropriate tricks to make it the default configuration on our system; for example, see this thread:

https://lists.sdsc.edu/pipermail/npaci- ... 24157.html

Once our vendor had changed things globally, so that all nodes used "satcksize unlimited" as the default setting, we were able to run without the "-heap-arrays" option, with no complaints. The "-fp-model-precise" option was restored as well, without complaint.

The model (~200x200x60 with 15 bio tracers), now scales well out to at least 192 processors.

We hope this is useful to other large memory users, who may be experiencing similar problems...

-Al Hermann and Kerim Aydin

arango · #2 Unread post by **arango** » Fri Oct 22, 2010 6:31 pm

Thank you for the info. We set -fp-model-precise option for standard IEEE floating point operations in any application. Otherwise, we will not able to check parallel partition bugs. Without this option ifort will give different results with the same executable and partition. It becomes completely impossible for us to check parallel bugs with different partitions by comparing NetCDF files byte by byte. I was chock about this

It turns out that a way for ifort to accelerate computations is to compromise (approximate) floating point operations. The more annoying aspect of these is the randomness of the round-off. You get different results each time. I didn't know if this was a compiler bug. We have updated to newer compiler versions and I haven't check this problem.

It seems to me that your grid is too small to have 192 partitions. I am having hard time believing that this scales very well due to the excessive communications between tiles. Your grid is square and I had mentioned several times about the danger of square grids because there is not a way to check for transposed array dimensions. Also 200 is a multiple of 2, 4, 5, 8, 10, 16, 20, 40, and 80. None of these numbers give you a balanced tile partitions with 192 nodes.

I have run this type of application with the same memory requirements or even larger on 8-16 nodes and I never ran into memory problems. Our adjoint-based applications uses several copies of the state for four different models running simultaneously. It seems to me that your cluster has a restricted memory or the problem is somewhere else. This does not make sense to me.

hermann · #3 Unread post by **hermann** » Fri Oct 22, 2010 6:57 pm

Hi Hernan,

Just to clarify, the precise dimensions of our grid are 182x258x60. It seemed our problem was more with stack vs the heap, irrespective of the fp-model issue. Apparently our cluster did have a restricted memory problem (restricted stack size), which the new settings have overcome. Not a ROMS problem per se, although other ROMS users may run into this hardware issue.

-Al

cym · #4 Unread post by **cym** » Sat Aug 25, 2018 1:13 am

hermann wrote:Hi Hernan,

Just to clarify, the precise dimensions of our grid are 182x258x60. It seemed our problem was more with stack vs the heap, irrespective of the fp-model issue. Apparently our cluster did have a restricted memory problem (restricted stack size), which the new settings have overcome. Not a ROMS problem per se, although other ROMS users may run into this hardware issue.

-Al

hi,
we have get the same problem

cym · #5 Unread post by **cym** » Sat Aug 25, 2018 1:27 am

arango wrote:Thank you for the info. We set -fp-model-precise option for standard IEEE floating point operations in any application. Otherwise, we will not able to check parallel partition bugs. Without this option ifort will give different results with the same executable and partition. It becomes completely impossible for us to check parallel bugs with different partitions by comparing NetCDF files byte by byte. I was chock about this It turns out that a way for ifort to accelerate computations is to compromise (approximate) floating point operations. The more annoying aspect of these is the randomness of the round-off. You get different results each time. I didn't know if this was a compiler bug. We have updated to newer compiler versions and I haven't check this problem.

It seems to me that your grid is too small to have 192 partitions. I am having hard time believing that this scales very well due to the excessive communications between tiles. Your grid is square and I had mentioned several times about the danger of square grids because there is not a way to check for transposed array dimensions. Also 200 is a multiple of 2, 4, 5, 8, 10, 16, 20, 40, and 80. None of these numbers give you a balanced tile partitions with 192 nodes.

I have run this type of application with the same memory requirements or even larger on 8-16 nodes and I never ran into memory problems. Our adjoint-based applications uses several copies of the state for four different models running simultaneously. It seems to me that your cluster has a restricted memory or the problem is somewhere else. This does not make sense to me.

dear arango,

thanks for your answer. I'm a graduate student from CHINA and a beginner in ROMS, I have gotten a same problem, and my grid is 117*137*32, and have 2 partitions. I have try some ways in myroms.org, but didn't have any effect. So how I try again?

thank you!

Ocean Modeling Discussion

stack, heap, and segmentation faults on large simulations

stack, heap, and segmentation faults on large simulations

Re: stack, heap, and segmentation faults on large simulations

Re: stack, heap, and segmentation faults on large simulations

Re: stack, heap, and segmentation faults on large simulation

Re: stack, heap, and segmentation faults on large simulation