Ocean Modeling Discussion

ROMS/TOMS

Search for:
It is currently Sun Jul 21, 2019 9:27 pm




Post new topic Reply to topic  [ 4 posts ] 

All times are UTC

Author Message
PostPosted: Tue Jun 04, 2019 2:40 pm 
Offline

Joined: Thu May 09, 2019 3:25 pm
Posts: 4
Location: NIOZ
Dear all,

I'm attempting to run a ROMS configuration for the North Sea with ERA_interim atmospheric forcing and GLORYS ocean boundary conditions. This is for 1993 to 2014. I've ran this for just 1 year (1993) without any problems. I'm running on a HPC cluster using mpi, and using netcdf4 and gfortran as a compiler (this is the one I got to work without errors on this cluster). The cluster has 12 compute nodes (each 96GB memory, 40 2.0GHz cores with 2 threads/core). My grid has the following dimensions:

Lm == 120 ! Number of I-direction INTERIOR RHO-points
Mm == 108 ! Number of J-direction INTERIOR RHO-points
N == 30 ! Number of vertical levels


and I used:

NtileI == 10 ! I-direction partition
NtileJ == 8 ! J-direction partition


NTIMES == 140256
DT == 225.0d0
NDTFAST == 30


and the following sbatch script:

#!/bin/sh
#SBATCH --partition=normal # default "normal", if not specified
#SBATCH --time=2-00:00:00 # run time in days-hh:mm:ss
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=80 # (by default, "ntasks"="cpus")
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

mpirun -np 80 ./romsM ocean_northsea4.in > log_northsea4


After this run, I increased the number of model years (increasing NTIMES to 2932276) and kept all other settings the same as before. The model ran succesfully for the first ~11 years (~1993-2004), but after that it crashed giving the error message below:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x7fc7cddde27f in ???
#1 0x7fc7bca7b660 in ???
#2 0x7fc7ccb8626b in ???
#3 0x7fc7cef06514 in ???
#4 0x7fc7cef5bcef in ???
#5 0x7fc7cef5c0b3 in ???
#6 0x7fc7b6fa485a in ???
#7 0x7fc7cef201b5 in ???
#8 0x7fc7cf211634 in ???
#9 0x42b028 in ???
#10 0x671561 in ???
#11 0x61a0ac in ???
#12 0x60f258 in ???
#13 0x60e389 in ???
#14 0x578d2b in ???
#15 0x4bfb56 in ???
#16 0x46c78c in ???
#17 0x403fb5 in ???
#18 0x403bfb in ???
#19 0x40381c in ???
#20 0x7fc7cddca3d4 in ???
#21 0x40386b in ???
#22 0xffffffffffffffff in ???
--------------------------------------------------------------------------
mpirun noticed that process rank 37 with PID 31000 on node no72 exited on signal 11 (Segmentation fault).


As a first step to solve I tried to rerun the experiment, using the restart file and starting the run just before the model time of the crash. This model is running now, but is already passed the time of crashing so it does not seem to be a problem with this specific date. What else could be the problem? Is it related to using too much memory? Would very much appreciate your help.

Kind regards,
Tim

ulimit -a gives:

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 381854
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 04, 2019 10:00 pm 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3633
Location: IMS/UAF, USA
It's possible that there's a memory leak, so that the program grows over time until it hits some limit. I just ran something in debug mode and got a *ton* of these messages:
Code:
==6434==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 34816 byte(s) in 8 object(s) allocated from:
    #0 0x7f1d004860fa in __interceptor_malloc ../../../../libsanitizer/asan/asan_malloc_linux.cc:62
    #1 0x7f1cfcbd6f7c in mca_btl_openib_endpoint_connect_eager_rdma (/usr/local/pkg/mpi/OpenMPI/1.10.3-GCC-5.4.0-2.26/lib/libmpi.so.12+0xf7f7c)


=================================================================


Top
 Profile  
Reply with quote  
PostPosted: Wed Jun 05, 2019 8:31 am 
Offline

Joined: Thu May 09, 2019 3:25 pm
Posts: 4
Location: NIOZ
kate wrote:
It's possible that there's a memory leak, so that the program grows over time until it hits some limit. I just ran something in debug mode and got a *ton* of these messages:
Code:
==6434==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 34816 byte(s) in 8 object(s) allocated from:
    #0 0x7f1d004860fa in __interceptor_malloc ../../../../libsanitizer/asan/asan_malloc_linux.cc:62
    #1 0x7f1cfcbd6f7c in mca_btl_openib_endpoint_connect_eager_rdma (/usr/local/pkg/mpi/OpenMPI/1.10.3-GCC-5.4.0-2.26/lib/libmpi.so.12+0xf7f7c)


=================================================================


Dear Kate,

Thanks for your reply. When I ran in debug mode before I got similar error messages from LeakSanitizer, although it was able to complete the run successfully for a shorter time period.

==212741==ERROR: LeakSanitizer: detected memory leaks
==212722==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 2048 byte(s) in 1 object(s) allocated from:
==212720==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 2048 byte(s) in 1 object(s) allocated from:

Direct leak of 2048 byte(s) in 1 object(s) allocated from:
#0 0x7ff61da05ac8 in __interceptor_calloc ../../../../libsanitizer/asan/asan_malloc_linux.cc:70
#0 0x7fc46e017ac8 in __interceptor_calloc ../../../../libsanitizer/asan/asan_malloc_linux.cc:70
#0 0x7f364c9c6ac8 in __interceptor_calloc ../../../../libsanitizer/asan/asan_malloc_linux.cc:70
#1 0x7ff610949fe0 (<unknown module>)


Is there a way to fix this myself, or am I forced to use restarts?

Kind regards,
Tim


Top
 Profile  
Reply with quote  
PostPosted: Wed Jun 05, 2019 5:34 pm 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3633
Location: IMS/UAF, USA
I don't know how to fix this. I have always used restarts because I operate on a supercomputer with queue length restrictions. You might want to be using PERFECT_RESTART.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 4 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group