Model hangs unexpectedly

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
petalas
Posts: 12
Joined: Mon Jul 04, 2016 1:31 pm
Location: University of the Aegean, GR

Model hangs unexpectedly

#1 Unread post by petalas »

Greetings, I have the following problem:

My model setup hangs right after initialization without giving a very usable output in order to debug it.
The application is on an HPC infrastructure, and I am using the gnu compiler and a serial setup in order to be able to use the ROMS debugger.
The message I get from the machine (slurm error file) and the log file, are respectively:

Code: Select all

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x2B317B308347
#1  0x2B317B30895E
#2  0x2B317BD2050F
#3  0x40A550 in inp_par_
#4  0x403A73 in __ocean_control_mod_MOD_roms_initialize
#5  0x402F98 in MAIN__ at master.f90:0
srun: error: node190: task 0: Segmentation fault
srun: Terminating job step 456240.0
and

Code: Select all

Model Input Parameters:  ROMS/TOMS version 3.7
                          Tuesday - January 23, 2018 -  6:20:09 PM
 -----------------------------------------------------------------------------

 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

 Operating system : Linux
 CPU/hardware     : x86_64
 Compiler system  : gfortran
 Compiler command : /apps/compilers/gnu/4.9.2/bin/gfortran
 Compiler flags   : -frepack-arrays -g -fbounds-check -ffree-form -ffree-line-length-none -ffree-form -ffree-line-length-none -ffree-form -ffree-line-length-none

 SVN Root URL  : ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^$
 SVN Revision  : ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

 Local Root    : ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^$
 Header Dir    : ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^$
 Header file   : ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^$
 Analytical Dir: ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^$

When I try with the intel compiler the log file is the same and the error file is:

Code: Select all

forrtl: severe (408): fort: (2): Subscript #1 of the array GRIDNUMBER has value 1 which is greater than the upper bound of 0

Image              PC                Routine            Line        Source
libnetcdff.so.6    00002B5FA9DBA773  Unknown               Unknown  Unknown
oceanG             0000000002104AF2  ntimesteps_                80  ntimestep.f90
oceanG             0000000000740998  main3d_                    82  main3d.f90
oceanG             0000000000404DB8  ocean_control_mod         160  ocean_control.f90
oceanG             0000000000403320  MAIN__                     86  master.f90
oceanG             0000000000402C9E  Unknown               Unknown  Unknown
libc.so.6          00002B5FAC8FAD1D  Unknown               Unknown  Unknown
oceanG             0000000000402BA9  Unknown               Unknown  Unknown
srun: error: node204: task 0: Exited with exit code 152
srun: Terminating job step 455883.0

The parameters Ngrids, GridsInLayer and NestLayers are all set to 1 in my input file.

Facts:
1) The model runs on my local computer, but not on the HPC infrastructure.
2) Another ROMS setup I have, runs on the HPC infrastructure.
3) I have tried running the model using only analytical headers for forcing and initial conditions, and closed boundaries with no success.

From 1) and 2) I can conclude that there is no problem in the setup of my model AND there is no problem with the setup of ROMS in the HPC infrastructure.
From 3) I can conclude that it is not the input files causing the problem.

I am out of ideas and in great need of help..
Thank you.

Aminrahdarian
Posts: 7
Joined: Wed Jan 25, 2017 5:26 pm
Location: University Of Waikato

Re: Model hangs unexpectedly

#2 Unread post by Aminrahdarian »

Hi ,
I think you have asked for too much memory . Is your domain size very big?

petalas
Posts: 12
Joined: Mon Jul 04, 2016 1:31 pm
Location: University of the Aegean, GR

Re: Model hangs unexpectedly

#3 Unread post by petalas »

Thank you for your answer Aminrahdarian. My domain is indeed large (1024 x 512 x 30), but as I described in my question I already ran this configuration without any problems on my home cluster which has 8G of memory on each of its' three nodes.

On the contrary, the HPC machine I am trying to run the configuration on allocates 56G of memory on each node, so I believe memory shouldn't be a problem.
Any other suggestions?

Aminrahdarian
Posts: 7
Joined: Wed Jan 25, 2017 5:26 pm
Location: University Of Waikato

Re: Model hangs unexpectedly

#4 Unread post by Aminrahdarian »

Hi again ,
I hope the link below helps : ( Common Causes of Segmentation Faults (Segfaults))

https://www.nas.nasa.gov/hecc/support/k ... )_524.html

The usual remedy is to increase the stack size and re-run your program. For example, to set the stack size to unlimited, run:
For csh unlimit stacksize
For bash ulimit -s unlimited
Last edited by Aminrahdarian on Thu Jan 17, 2019 2:22 pm, edited 1 time in total.

petalas
Posts: 12
Joined: Mon Jul 04, 2016 1:31 pm
Location: University of the Aegean, GR

Re: Model hangs unexpectedly

#5 Unread post by petalas »

Thanks again for the time and input. I had already set the stack size to unlimited and still didn't work, so this was not causing the problem.

PROBLEM SOLVED:
It turned out that I had to put all my files in the same directory:
i.e. the HPC infrastructure that I'm using has two separate file systems (a "fast" one and a "safe" one). I had all my input and executables on the "safe" file system, and directed the model to output on the "fast" file system. Once I got everything in the same system it runs.

Regards,
Stamatis

Post Reply