severe (174): SIGSEGV, segmentation fault occurred. libpthre

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
flcastej
Posts: 68
Joined: Tue Nov 10, 2009 6:42 pm
Location: Technical University of Cartagena,Murcia, Spain

severe (174): SIGSEGV, segmentation fault occurred. libpthre

#1 Unread post by flcastej »

Dear all,

I know that this error (the one that I am more afraid to get) can be caused by a lot of things, so I will try to give as much information as possible. I have been working for a while with ROMS. Now I am using ROMS/TOMS version 3.7, revision 921. After last updates, when I try to run the model I get:

--------------------------------------------------------------------------------
Model Input Parameters: ROMS/TOMS version 3.7
Wednesday - September 19, 2018 - 5:10:14 PM
--------------------------------------------------------------------------------
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
oceanM 00000000008544E5 Unknown Unknown Unknown
oceanM 0000000000852107 Unknown Unknown Unknown
oceanM 0000000000801784 Unknown Unknown Unknown
oceanM 0000000000801596 Unknown Unknown Unknown
oceanM 00000000007B40A6 Unknown Unknown Unknown
oceanM 00000000007B7CA0 Unknown Unknown Unknown
libpthread.so.0 00007F9460F46790 Unknown Unknown Unknown
oceanM 000000000086E313 Unknown Unknown Unknown
oceanM 00000000007FCE18 Unknown Unknown Unknown
oceanM 000000000041A869 Unknown Unknown Unknown
oceanM 00000000004249A2 Unknown Unknown Unknown
oceanM 0000000000412CEC Unknown Unknown Unknown
oceanM 000000000040BAD2 Unknown Unknown Unknown
oceanM 000000000040B59C Unknown Unknown Unknown
oceanM 000000000040B45E Unknown Unknown Unknown
libc.so.6 00007F946093DD5D Unknown Unknown Unknown
oceanM 000000000040B369 Unknown Unknown Unknown


I am able to run the model with older ROMS revision (i.e: ROMS/TOMS version 3.7 SVN Revision : 836M). I have tested successfully that NetCDF used was NetCDF4 using "ncdump -k", returning for all the input files: netCDF-4.

I am using ifort to compile the code. I am sorry, but I am not able to use another one cause I am not the admin of the system. The compiler file is set up:

setenv USE_MPI on # distributed-memory parallelism
# setenv USE_MPIF90 on # compile with mpif90 script
#setenv which_MPI mpich # compile with MPICH library
setenv which_MPI mpich2 # compile with MPICH2 library
## setenv which_MPI openmpi # compile with OpenMPI library

#setenv USE_OpenMP on # shared-memory parallelism

setenv FORT ifort
#setenv FORT gfortran
#setenv FORT pgi

#setenv USE_DEBUG on # use Fortran debugging flags
setenv USE_LARGE on # activate 64-bit compilation
setenv USE_NETCDF4 on # compile with NetCDF-4 library
setenv USE_PARALLEL_IO on # Parallel I/O with NetCDF-4/HDF5



I get the same result running it serial (oceanS). When I tried to activate DEBUG, I get this error:

ld: cannot find -ldl


So I am not able to know exactly what file is getting me in troubles.

I have been able to parallel run with MPI the upwelling test case. So I am pretty sure that my problem is related with the NetCDF file that I am using.

I really appreciate if you could point me out next steps to follow.

Thanks a lot, :D

-Francisco

I tried

User avatar
kate
Posts: 4088
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: severe (174): SIGSEGV, segmentation fault occurred. libp

#2 Unread post by kate »

I have a similar problem and am running with gfortran for one domain and with an old code for the other domain. Sorry I don’t have a third fix.

flcastej
Posts: 68
Joined: Tue Nov 10, 2009 6:42 pm
Location: Technical University of Cartagena,Murcia, Spain

Re: severe (174): SIGSEGV, segmentation fault occurred. libp

#3 Unread post by flcastej »

Thanks a lot Kate. Let´s see if someone could give us any clue about what´s happening. Meanwhile I will try to use the old code as you suggested.

User avatar
arango
Site Admin
Posts: 1347
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: severe (174): SIGSEGV, segmentation fault occurred. libp

#4 Unread post by arango »

Nowadays, severe segmentation errors are associated with stack size, which is used for allocating automatic arrays. They are allocated on stack or heap according to you choice of compiler options. I mentioned this in the last trac ticket.

flcastej
Posts: 68
Joined: Tue Nov 10, 2009 6:42 pm
Location: Technical University of Cartagena,Murcia, Spain

Re: severe (174): SIGSEGV, segmentation fault occurred. libp

#5 Unread post by flcastej »

Dear Arango,

Thanks a lot for your answer. I will talk with the administrator of our HPC system to try to configure the stack size properly.

Regards,

-Francisco

User avatar
arango
Site Admin
Posts: 1347
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: severe (174): SIGSEGV, segmentation fault occurred. libp

#6 Unread post by arango »

It is very simple as I have mentioned several times before. You just need to edit your login script and add one of the lines below:

Code: Select all

.cshrc, .tcshrc, etc.

limit stacksize unlimited

or  .bashrc

ulimit -s unlimited 
I wrote lots of information in previous :arrow: trac ticket.

flcastej
Posts: 68
Joined: Tue Nov 10, 2009 6:42 pm
Location: Technical University of Cartagena,Murcia, Spain

Re: severe (174): SIGSEGV, segmentation fault occurred. libp

#7 Unread post by flcastej »

Dear Arango,

I am sorry for not explaining it properly. I followed your advice at the ticket doing:


.cshrc, .tcshrc, etc.

limit stacksize unlimited

or .bashrc

ulimit -s unlimited


But I got the same error. The next step was to compile the model with the option -heap-arrays (I am using ifort). So I asked the administrator to do so. Although as you point out in the ticket it may affect performance by slowing down the computations, I hope it could help to detect where is the problem and solved it in a better way.

Thanks a lot, I really appreciate your help.

-Francisco

flcastej
Posts: 68
Joined: Tue Nov 10, 2009 6:42 pm
Location: Technical University of Cartagena,Murcia, Spain

Re: severe (174): SIGSEGV, segmentation fault occurred. libp

#8 Unread post by flcastej »

Today I have been able to run the model without error using the option -heap-arrays but it affect a lot the performance. Below you will find a comparative table


OceanM Versión 3.7 rev.922 // 2 Nested grid // 2 nodes 16 CPU/node // Total Elapsed CPU Time = 30138.171 sec
OceanM Versión 3.7 rev.836 // 2 Nested grid // 2 nodes 16 CPU/node // Total Elapsed CPU Time = 9568.668 sec


I will like to keep ROMS updated but the performance penalty it´s too high. The memory configuration used has been:

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256669
max locked memory (kbytes, -l) 4086160
max memory size (kbytes, -m) 65536000
open files (-n) 1048576
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited


I will try to keep on working to find the problem with my NetCDF file that I was able to use in the older version but give me this error in the last one. Any clue are really welcome.

User avatar
arango
Site Admin
Posts: 1347
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: severe (174): SIGSEGV, segmentation fault occurred. libp

#9 Unread post by arango »

Yes, your problem is the stack size per CPU and it seems to be associated with the automatic arrays used in distributed-memory for I/O operations. This is not a direct ROMS problem, but a computer problem because of not enough memory to handle automatic arrays that are either allocated on stack or heap for scattering/gathering of I/O.

I see that you are using 32 CPUs. How big are all your grids? You said that have two nested grids.

I updated the code today for reporting memory requirements. See :arrow: trac ticket for more information.

flcastej
Posts: 68
Joined: Tue Nov 10, 2009 6:42 pm
Location: Technical University of Cartagena,Murcia, Spain

Re: severe (174): SIGSEGV, segmentation fault occurred. libp

#10 Unread post by flcastej »

Dear Arango,

I have updated the code with the last revision. Now I have been able to run the model without the option -heap-arrays but it still take more time than older revisions.

OceanM Versión 3.7 rev.923 // 2 Nested grid // 2 nodes 16 CPU/node // Total Elapsed CPU Time = 28580.836 sec
OceanM Versión 3.7 rev.922 // 2 Nested grid // 2 nodes 16 CPU/node // Total Elapsed CPU Time = 30138.171 sec
OceanM Versión 3.7 rev.836 // 2 Nested grid // 2 nodes 16 CPU/node // Total Elapsed CPU Time = 9568.668 sec

The memory report shows:

Code: Select all

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

 Dynamic and Automatic memory (MB) usage for Grid 01:  332x332x10  tiling: 4x8

     tile          Dynamic        Automatic            USAGE      MPI-Buffers

        0            44.95            19.63            64.59             8.92
        1            45.33            19.63            64.97             8.92
        2            45.33            19.63            64.97             8.92
        3            45.14            19.63            64.78             8.92
        4            46.48            19.63            66.11             8.92
        5            46.90            19.63            66.53             8.92
        6            46.90            19.63            66.53             8.92
        7            46.69            19.63            66.32             8.92
        8            46.48            19.63            66.11             8.92
        9            46.90            19.63            66.53             8.92
       10            46.90            19.63            66.53             8.92
       11            46.69            19.63            66.32             8.92
       12            46.48            19.63            66.11             8.92
       13            46.90            19.63            66.53             8.92
       14            46.90            19.63            66.53             8.92
       15            46.69            19.63            66.32             8.92
       16            46.48            19.63            66.11             8.92
       17            46.90            19.63            66.53             8.92
       18            46.90            19.63            66.53             8.92
       19            46.69            19.63            66.32             8.92
       20            46.48            19.63            66.11             8.92
       21            46.90            19.63            66.53             8.92
       22            46.90            19.63            66.53             8.92
       23            46.69            19.63            66.32             8.92
       24            46.48            19.63            66.11             8.92
       25            46.90            19.63            66.53             8.92
       26            46.90            19.63            66.53             8.92
       27            46.69            19.63            66.32             8.92
       28            45.33            19.63            64.97             8.92
       29            45.72            19.63            65.36             8.92
      30            45.72            19.63            65.36             8.92
       31            45.53            19.63            65.16             8.92

      SUM          1484.82           628.28          2113.11           285.58

 Dynamic and Automatic memory (MB) usage for Grid 02:  222x189x10  tiling: 4x8

     tile          Dynamic        Automatic            USAGE      MPI-Buffers

        0            24.61             9.00            33.60             9.00
        1            24.61             9.00            33.60             9.00
        2            24.61             9.00            33.60             9.00
        3            24.48             9.00            33.48             9.00
        4            24.61             9.00            33.60             9.00
        5            24.61             9.00            33.60             9.00
        6            24.61             9.00            33.60             9.00
        7            24.48             9.00            33.48             9.00
        8            24.61             9.00            33.60             9.00
        9            24.61             9.00            33.60             9.00
       10            24.61             9.00            33.60             9.00
       11            24.48             9.00            33.48             9.00
       12            24.61             9.00            33.60             9.00
       13            24.61             9.00            33.60             9.00
       14            24.61             9.00            33.60             9.00
       15            24.48             9.00            33.48             9.00
       16            24.61             9.00            33.60             9.00
       17            24.61             9.00            33.60             9.00
       18            24.61             9.00            33.60             9.00
       19            24.48             9.00            33.48             9.00
       20            24.61             9.00            33.60             9.00
       21            24.61             9.00            33.60             9.00
       22            24.61             9.00            33.60             9.00
       23            24.48             9.00            33.48             9.00
       24            24.61             9.00            33.60             9.00
       25            24.61             9.00            33.60             9.00
       26            24.61             9.00            33.60             9.00
       27            24.48             9.00            33.48             9.00
       28            24.06             9.00            33.06             9.00
       29            24.06             9.00            33.06             9.00
       30            24.06             9.00            33.06             9.00
       31            23.94             9.00            32.94             9.00

      SUM           784.22           287.92          1072.14           287.92

    TOTAL          2269.04           916.20          3185.24           573.50
I have been reviewing old model outputs and I have realize that in the older version the options heap-array was activated. Below you will find the compiler options used:

Code: Select all

 Operating system : Linux

 CPU/hardware     : x86_64
 Compiler system  : ifort
 Compiler command : /opt/intel/parallel_studio_xe_2016_update2/impi/5.1.3.181/intel64/bin/mpiifort
 Compiler flags   : -heap-arrays -fp-model precise -ip -O3 -free -free -free

 SVN Root URL  : https://www.myroms.org/svn/src/trunk
 SVN Revision  : 836M
==============================================================

 Operating system : Linux
 CPU/hardware     : x86_64
 Compiler system  : ifort
 Compiler command : /opt/intel/parallel_studio_xe_2016_update2/impi/5.1.3.181/intel64/bin/mpiifort
 Compiler flags   : -fp-model precise -ip -O3
 MPI Communicator : 1140850688  PET size = 32

 SVN Root URL  : https://www.myroms.org/svn/src/trunk
 SVN Revision  : 923M
I see that you are using 32 CPUs. How big are all your grids? You said that have two nested grids.
I used to run 1 grid with 3 refined grids. After reading some of your tickets explaining the importance to test the best cores configuration to run the model I decided to get a test case only with 1 donor grid (Lm=332 and Mn=332) and 1 refined grid (Lm=222 and Mm=189)and do some test changing the number of code used and different domain decomposition parameters. But then I started to get the segmentation fault error.

Thanks a lot for your help,

-Francisco

User avatar
arango
Site Admin
Posts: 1347
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: severe (174): SIGSEGV, segmentation fault occurred. libp

#11 Unread post by arango »

I think that you need to read the following :arrow: trac ticket and choose the MPI communication options that are more efficient in the computer environment that you are running. You should check the profiling information that ROMS reports to standard output to see in what region of the code are slower. If -heap-arrays is faster, then use it. However, it is our experience that the -heap-arrays option for ifort is less efficient.

flcastej
Posts: 68
Joined: Tue Nov 10, 2009 6:42 pm
Location: Technical University of Cartagena,Murcia, Spain

Re: severe (174): SIGSEGV, segmentation fault occurred. libp

#12 Unread post by flcastej »

Dear Arango,

I really appreciate your help. I was trying to test the different configuration to try to speed up my runs, but then I got in trouble with the segmentation fault.

I will talk with the administrator to analyze the output to see in what region of the code are slower.

About the heap-array options it´s quite strange. The latest revisions (922) took three time more than the older ones (836), both using heap-array. After last updated (923) I am able to run the model without heap-array but I am still getting worse performance than with 836M (near three times slower).

Regards,

-Francisco

User avatar
arango
Site Admin
Posts: 1347
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: severe (174): SIGSEGV, segmentation fault occurred. libp

#13 Unread post by arango »

I am going to try again for the last time. Read carefully :arrow: trac ticket 747. In the older version of the code, we choose either lower- or higher-level MPI function for exchanges. We no longer do that in the newer versions, you need to experiment and select which options are more efficient in your computer. The computer administrator cannot help you with that. You need to select the appropriate ROMS CPP options :idea: If you don't know what I am talking about, you need to learn a little about the distributed-memory paradigm.

flcastej
Posts: 68
Joined: Tue Nov 10, 2009 6:42 pm
Location: Technical University of Cartagena,Murcia, Spain

Re: severe (174): SIGSEGV, segmentation fault occurred. libp

#14 Unread post by flcastej »

Hi Arango,

I am sorry for bothering you. I was writing the result in the forum just in case it could help other users and perhaps get some feedback. I have started to do the performance test with the different configuration explained in the ticket and trying to learn a little about distributed memory paradigm. I hope to be able to get the same performance with the new revision as the older one.

Thanks a lot,

-Francisco

Post Reply