Segmentation fault after 10 model years of running

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
Timh37
Posts: 15
Joined: Thu May 09, 2019 3:25 pm
Location: NIOZ

Segmentation fault after 10 model years of running

#1 Unread post by Timh37 »

Dear all,

I'm attempting to run a ROMS configuration for the North Sea with ERA_interim atmospheric forcing and GLORYS ocean boundary conditions. This is for 1993 to 2014. I've ran this for just 1 year (1993) without any problems. I'm running on a HPC cluster using mpi, and using netcdf4 and gfortran as a compiler (this is the one I got to work without errors on this cluster). The cluster has 12 compute nodes (each 96GB memory, 40 2.0GHz cores with 2 threads/core). My grid has the following dimensions:

Lm == 120 ! Number of I-direction INTERIOR RHO-points
Mm == 108 ! Number of J-direction INTERIOR RHO-points
N == 30 ! Number of vertical levels


and I used:

NtileI == 10 ! I-direction partition
NtileJ == 8 ! J-direction partition


NTIMES == 140256
DT == 225.0d0
NDTFAST == 30


and the following sbatch script:

#!/bin/sh
#SBATCH --partition=normal # default "normal", if not specified
#SBATCH --time=2-00:00:00 # run time in days-hh:mm:ss
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=80 # (by default, "ntasks"="cpus")
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

mpirun -np 80 ./romsM ocean_northsea4.in > log_northsea4


After this run, I increased the number of model years (increasing NTIMES to 2932276) and kept all other settings the same as before. The model ran succesfully for the first ~11 years (~1993-2004), but after that it crashed giving the error message below:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x7fc7cddde27f in ???
#1 0x7fc7bca7b660 in ???
#2 0x7fc7ccb8626b in ???
#3 0x7fc7cef06514 in ???
#4 0x7fc7cef5bcef in ???
#5 0x7fc7cef5c0b3 in ???
#6 0x7fc7b6fa485a in ???
#7 0x7fc7cef201b5 in ???
#8 0x7fc7cf211634 in ???
#9 0x42b028 in ???
#10 0x671561 in ???
#11 0x61a0ac in ???
#12 0x60f258 in ???
#13 0x60e389 in ???
#14 0x578d2b in ???
#15 0x4bfb56 in ???
#16 0x46c78c in ???
#17 0x403fb5 in ???
#18 0x403bfb in ???
#19 0x40381c in ???
#20 0x7fc7cddca3d4 in ???
#21 0x40386b in ???
#22 0xffffffffffffffff in ???
--------------------------------------------------------------------------
mpirun noticed that process rank 37 with PID 31000 on node no72 exited on signal 11 (Segmentation fault).


As a first step to solve I tried to rerun the experiment, using the restart file and starting the run just before the model time of the crash. This model is running now, but is already passed the time of crashing so it does not seem to be a problem with this specific date. What else could be the problem? Is it related to using too much memory? Would very much appreciate your help.

Kind regards,
Tim

ulimit -a gives:

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 381854
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

User avatar
kate
Posts: 4088
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: Segmentation fault after 10 model years of running

#2 Unread post by kate »

It's possible that there's a memory leak, so that the program grows over time until it hits some limit. I just ran something in debug mode and got a *ton* of these messages:

Code: Select all

==6434==ERROR: LeakSanitizer: detected memory leaks 

Direct leak of 34816 byte(s) in 8 object(s) allocated from: 
    #0 0x7f1d004860fa in __interceptor_malloc ../../../../libsanitizer/asan/asan_malloc_linux.cc:62
    #1 0x7f1cfcbd6f7c in mca_btl_openib_endpoint_connect_eager_rdma (/usr/local/pkg/mpi/OpenMPI/1.10.3-GCC-5.4.0-2.26/lib/libmpi.so.12+0xf7f7c)


=================================================================

Timh37
Posts: 15
Joined: Thu May 09, 2019 3:25 pm
Location: NIOZ

Re: Segmentation fault after 10 model years of running

#3 Unread post by Timh37 »

kate wrote:It's possible that there's a memory leak, so that the program grows over time until it hits some limit. I just ran something in debug mode and got a *ton* of these messages:

Code: Select all

==6434==ERROR: LeakSanitizer: detected memory leaks 

Direct leak of 34816 byte(s) in 8 object(s) allocated from: 
    #0 0x7f1d004860fa in __interceptor_malloc ../../../../libsanitizer/asan/asan_malloc_linux.cc:62
    #1 0x7f1cfcbd6f7c in mca_btl_openib_endpoint_connect_eager_rdma (/usr/local/pkg/mpi/OpenMPI/1.10.3-GCC-5.4.0-2.26/lib/libmpi.so.12+0xf7f7c)


=================================================================
Dear Kate,

Thanks for your reply. When I ran in debug mode before I got similar error messages from LeakSanitizer, although it was able to complete the run successfully for a shorter time period.

==212741==ERROR: LeakSanitizer: detected memory leaks
==212722==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 2048 byte(s) in 1 object(s) allocated from:
==212720==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 2048 byte(s) in 1 object(s) allocated from:

Direct leak of 2048 byte(s) in 1 object(s) allocated from:
#0 0x7ff61da05ac8 in __interceptor_calloc ../../../../libsanitizer/asan/asan_malloc_linux.cc:70
#0 0x7fc46e017ac8 in __interceptor_calloc ../../../../libsanitizer/asan/asan_malloc_linux.cc:70
#0 0x7f364c9c6ac8 in __interceptor_calloc ../../../../libsanitizer/asan/asan_malloc_linux.cc:70
#1 0x7ff610949fe0 (<unknown module>)


Is there a way to fix this myself, or am I forced to use restarts?

Kind regards,
Tim

User avatar
kate
Posts: 4088
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: Segmentation fault after 10 model years of running

#4 Unread post by kate »

I don't know how to fix this. I have always used restarts because I operate on a supercomputer with queue length restrictions. You might want to be using PERFECT_RESTART.

Timh37
Posts: 15
Joined: Thu May 09, 2019 3:25 pm
Location: NIOZ

Re: Segmentation fault after 10 model years of running

#5 Unread post by Timh37 »

Hi,

I've started splitting up my runs in years and running them with restarts. However, I occasionally still get similar errors during runtime which is really annoying. Debugging with fcheck-all gives me a series of unhelpful error messages:

Direct leak of 2048 byte(s) in 1 object(s) allocated from:
#0 0x7fd8dc8a4ac8 in __interceptor_calloc ../../../../libsanitizer/asan/asan_malloc_linux.cc:70
#1 0x7fd8cf849fe0 (<unknown module>)


I've also tried to set my stacksize to unlimited using ulimit -S -s unlimited, but to no avail. Any other ideas?

Timh37
Posts: 15
Joined: Thu May 09, 2019 3:25 pm
Location: NIOZ

Re: Segmentation fault after 10 model years of running

#6 Unread post by Timh37 »

Dear all,

I would like to get back to this problem for which I still haven't found a solution. As mentioned before I do not get any meaningful error message so find myself unable to solve the problem.

Any help is greatly appreciated!

Post Reply