Ocean Modeling Discussion

ROMS/TOMS

Search for:
It is currently Wed Sep 18, 2019 5:59 pm




Post new topic Reply to topic  [ 12 posts ] 

All times are UTC

Author Message
PostPosted: Tue Feb 15, 2011 3:33 pm 
Quote:
Compiler flags are not optimal: -heap-arrays ---> -no-heap-arrays, but you have to adjust stacksize limit.
Instruction set: -msse2 --> -xSSE4.1 because your processor in Xeon 5400-series, not Pentium4 "Northwood"
or Opteron 248.
OK. At the HP I have tried the new compiler options:
Quote:
Operating system : Linux
CPU/hardware : x86_64
Compiler system : ifort
Compiler command : /opt/intel/composerxe-2011.0.084/bin/intel64/ifort
Compiler flags : -no-heap-arrays -fp-model precise -openmp -fpp -ip -O3 -xSSE4.1 -free
and it runs until it has to write avg file. It stops without any explanation and no complains just BEFORE the output
Quote:
WRT_HIS - wrote history fields (Index=1,1) into time record = 0000030
WRT_AVG - wrote averaged fields into time record = 0000001
WRT_RST - wrote re-start fields (Index=1,1) into time record = 0000001
but it had written to his and rst files previously. Inside his file there are 30 records and inside avg file there is none.

I have tried limiting stack size to 16 MB and unlimiting. Just in case it was a problem with netcdf I have recompiled netcdf libraries following intel instructions changing -xT by -xSSE4.1:
Quote:
$ export CC=icc
$ export CXX=icpc
$ export CFLAGS='-O3 -xSSE4.1 -ip -no-prec-div -static'
$ export CXXFLAGS='-O3 -xSSE4.1 -ip -no-prec-div -static'

$ export F77=ifort
$ export FC=ifort
$ export F90=ifort
$ export FFLAGS='-O3 -xSSE4.1 -ip -no-prec-div -static'

$ export CPP='icc -E'
$ export CXXCPP='icpc -E'


It always stop at the same point but does not give any reason. It does not write anything else at the output file. I attach the redirected output file log3.txt

Using "Compiler flags : -heap-arrays -fp-model precise -openmp -fpp -ip -O3 -xSSE4.1 -free" runs fine as usual. I have restarted the run from the last saved rst after recompiling with the -heap-arrays option and log3_rst.txt is the output that is still running.
Sorry, I have no experience with this. What am I doing wrong? Thanks for your help.


Attachments:
log3_rst.txt [104.36 KiB]
Downloaded 76 times
log3.txt [1.34 MiB]
Downloaded 81 times
Top
  
Reply with quote  
PostPosted: Tue Feb 15, 2011 7:16 pm 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3667
Location: IMS/UAF, USA
If you redirected the standard output, did you also redirect the standard error? It might have written something there, say from the netcdf library.

Did the restarted run write to the avg file? I'm guessing no. For debugging, you can ask it to only average the first few steps, but you still need a way to see what is going wrong.


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 15, 2011 8:53 pm 
No, I did not redirect the standard error! I forgot it. Of course I will, tomorrow. Thanks.
The restarted run with -heap-arrays option does write into the averaged file. I have to check it but I know from previous experiences restarting from a break that the first record after restart of averaged values seems to be not correct and not related to the assigned time.
The restarted run with -no-heap-arrays option stops at the very same point and does not write into the avg file.


Top
  
Reply with quote  
PostPosted: Thu Feb 17, 2011 9:42 am 
Standard error says "Segmentation fault"
I asked the program to write to avg file every day instead of every month and it stops just before writing to the avg file after calculating the first day.

I checked the previously restarted with -heap-arrays option and it writes to avg file at day 30 (30th Jan) (360 days/year climatological run). In nc file it says time is 15th of Feb!? OK this is other thing I don´t understand but is not the point here.

Thanks


Top
  
Reply with quote  
PostPosted: Thu Feb 17, 2011 5:47 pm 
Offline
User avatar

Joined: Wed Jul 02, 2003 5:29 pm
Posts: 3667
Location: IMS/UAF, USA
Try again with -O2 instead of -O3 and see if that runs. Compiler bugs are often in the optimizer phase of the compile. You want to find the fastest options that give the correct answer. You can always report compiler bugs if you have a current license, but they'll want a short (<100 lines) program that demonstrates the error. This can be unbelievably hard to obtain, probably not worth the trouble.


Top
 Profile  
Reply with quote  
PostPosted: Fri Feb 18, 2011 11:47 am 
Thanks, Kate.
-O2 stops in the same way. It only runs fine with the -heap-arrays.
There seems to be something inside set_avg_tile. I modified set_avg.F
Code:
       write(*,*)'entra en set_avg_tile',ng,tile
      CALL set_avg_tile (ng, tile,                                      &
     &                   LBi, UBi, LBj, UBj,                            &
     &                   IminS, ImaxS, JminS, JmaxS,                    &
# ifdef SOLVE3D
     &                   NOUT,                                          &
# endif
     &                   KOUT)
      write(*,*)'sale de set_avg_tile',ng,tile

Now I ask the program to write avgs every 10 time steps. I attach the output "log6.txt"
If I try to do the same inside set_avg_tile, for instance asking for iic(ng) or just to say "hello" it stops when it reaches the write statement.

The compiler complains for some things that I dont know if are important. Just in case:
Quote:
WS> build.sh -j > build.log
makefile:241: INCLUDING FILE /home/balbin/make_macros.mk WHICH CONTAINS APPLICATION-DEPENDENT MAKE DEFINITIONS
makefile:237: INCLUDING FILE /media/Data/projects/medsea/build/make_macros.mk WHICH CONTAINS APPLICATION-DEPENDENT MAKE DEFINITIONS
set_weights.f90(199): remark #8290: Recommended relationship between field width 'W' and the number of fractional digits 'D' in this edit descriptor is 'W>=D+3'.
30 FORMAT (/,1x,'ndtfast, nfast = ',2i4,3x,'nfast/ndtfast = ',f7.5)
------------------------------------------------------------------^
ar: creating /media/Data/projects/medsea/build/libNLM_bio.a
ar: creating /media/Data/projects/medsea/build/libNLM_sed.a
ar: creating /media/Data/projects/medsea/build/libMODS.a
ar: creating /media/Data/projects/medsea/build/libANA.a
ar: creating /media/Data/projects/medsea/build/libNLM.a
ar: creating /media/Data/projects/medsea/build/libUTIL.a
ifort: command line remark #10010: option '-Vaxlib' is deprecated and will be removed in a future release. See '-help deprecated'
I also include build.log


Attachments:
log6.txt [46.31 KiB]
Downloaded 97 times
build.log [215.14 KiB]
Downloaded 91 times
Top
  
Reply with quote  
PostPosted: Fri Feb 18, 2011 9:59 pm 
Offline
User avatar

Joined: Fri Nov 14, 2003 4:57 pm
Posts: 185
It appears that you are looking at very basic segmentation fault, which may be
associated with either ROMS itself, or to the particular netCDF version you are
using. set_avg_tile is a long routine. Most likely the breaking point occurs
inside
Code:
!  Convert accumulated sums into time-averages, if appropriate.
!-----------------------------------------------------------------------
!
      IF ((iic(ng).gt.ntsAVG(ng)).and.                                  &
     &    (MOD(iic(ng)-1,nAVG(ng)).eq.0) .....


part of the code, starting at line 2170 and ending at line 2888, since this
segment of the code is executed only when MOD(iic-1,nAVG)==0, which is
the final stage of averaging. Still, too much code to pinpoint the problem easily.

1.Is there any way to recompile and execute the code with compiler flags
appended by
Code:
-g -check all

or
Code:
-g -check arg_temp_created,bounds,pointers,uninit,format,output_conversion

while keeping -openmp -fpp -no-heap-arrays -xSSE4.1 -free in place (i.e. run
it in parallel, with proper instruction set and using stack instead of heap.

This may pinpoint breaking point.

Suppress flag -O3 (since -g will override it any way);

Suppress flag -ip (interprocedural analysis -- I observed at least once that this
flag caused problem because of compiler bug, but that was associated to arithmetic
precision, and not with memory issue, so here it is probably irrelevant);

2. what version of netCDF do you use and how did you compile it? [A general
advice here is to use either netcdf-3.6.3 (the final release of version 3 generation),
or to use netcdf-4.1.1, but stay away from anything in between, i.e., 4.0.x should be
avoided. Also, 4.1.1 can be compiled with or without HDF support. Do you use it?]


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 21, 2011 5:57 pm 
Q.2:
I used netcdf v 3.6.2 compiled following intel recipe for linux http://software.intel.com/en-us/article ... compilers/
and including a line, "#include <cstring>", at ncvalues.cpp and sfc_pres_temp_rd.cpp as explained at http://www.unidata.ucar.edu/support/hel ... 09331.html. But still there was a warning:
Quote:
netcdf.cpp(1267): warning #68: integer conversion resulted in a change of sign
t[5] = -1;
^

netcdf.cpp(1270): warning #68: integer conversion resulted in a change of sign
if (t[j] == -1) {
^
I did not care but this morning I fixed it by copying netcdf.cpp from version 4.1.1 (maybe not a good idea). I got the same error running roms.

After that I compiled v 3.6.3 following intel instructions and there was no need to modify anything. log.1,2,3,and 4 are the outputs of ./configure, make, make check and sudo make install respectively. err.2 and 3 are the related error messages.
My program runs exactly the same way as yesterday. It stops when reaching set_avg_tile.

Q.1:
I compiled with -g -check all and no -ip
I included a line into set_avg_tile
Code:
# include "set_bounds.h"

            print*,'hola'
!
!-----------------------------------------------------------------------
!  Return if time-averaging window is zero.
!-----------------------------------------------------------------------
!
      IF (nAVG(ng).eq.0) RETURN


And it stops as soon as it reaches there.
The output shows some errors related to TIME_REF = -1 (the output format seems to complain when writhing day -1) and the reading of T/F flags for writing output fields. I include the complete output log8.txt

Thanks again


Attachments:
log8.txt [40.71 KiB]
Downloaded 71 times
log.1.txt [12.7 KiB]
Downloaded 84 times
log.2.txt [27.72 KiB]
Downloaded 79 times
log.3.txt [54.54 KiB]
Downloaded 81 times
log.4.txt [20.28 KiB]
Downloaded 76 times
err.2.txt [480 Bytes]
Downloaded 78 times
err.3.txt [854 Bytes]
Downloaded 78 times
Top
  
Reply with quote  
PostPosted: Tue Feb 22, 2011 2:25 am 
Offline
User avatar

Joined: Fri Nov 14, 2003 4:57 pm
Posts: 185
Now it looks like this time your problem is different than before (when it was compiled
without -g -extra_flags): now the code terminates immediately when attempts it to call
set_avg for the very first time, not when finalizing averaging.

Based on the fact that word 'hola' never gets printed in log8.txt,
Code:
            print*,'hola'
....
      IF (nAVG(ng).eq.0) RETURN

It appears that segmentation fault occurs at the very moment when set_avg_tile is called
by its driver, CALL set_avg_tile (ng, tile,....), starting with line 60 of set_avg.F. This means that
some of the arguments of the routine being called are not valid pointers, i.e., an allocatable array
was not properly allocated (this is unlikely because, after all, the code runs with -heap_arrays
compiler flag), or because the compiler decided to create a temporal copy-in - copy-out array for
one of the arguments, and there is no enough space in stack to allocate it if it must go to stack
(i.e., when the code compiled with with -no-heap_arrays flag.)

To check/verify that this is the case place another print*,'hola 1' statement just before
CALL set_avg_tile (ng, tile,... line, and see whether this message shows up, while the one
from the inside does not.

What about undefining CPP-switch AVERAGES completely? Would it still terminate?


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 22, 2011 2:55 pm 
I am checking all these things under the UPWELLING case.
I reproduce the problem using 8 threads and:
Code:
! C-preprocessing Flag.

    MyAppCPP = UPWELLING

! Input variable information file name.  This file needs to be processed
! first so all information arrays can be initialized properly.

     VARNAME = /home/balbin/roms/ROMS/External/varinfo.dat

! Grid dimension parameters. See notes below in the Glossary for how to set
! these parameters correctly.
          Lm == 160           ! Number of I-direction INTERIOR RHO-points
          Mm == 320           ! Number of J-direction INTERIOR RHO-points
           N == 30            ! Number of vertical levels
...
      NtileI == 1                               ! I-direction partition
      NtileJ == 16                               ! J-direction partition
I also played with number of threads and NtileJ. With the original 41x80x16 grid the code runs fine
If I undef AVERAGES the code runs fine

1.- Checking how it reaches set_avg_tile with -g -extra_flags
Code:
       print*,'going into set_avg_tile',tile
      CALL set_avg_tile (ng, tile,                                      &
     &                   LBi, UBi, LBj, UBj,                            &
     &                   IminS, ImaxS, JminS, JmaxS,                    &
# ifdef SOLVE3D
     &                   NOUT,                                          &
# endif
     &                   KOUT)
       print*,'out of set_avg_tile',tile
it goes into and out of set_avg_tile until it has to compute mean values. See log1.txt


2.- Checking inside set_avg_tile with -g -extra_flags
Code:
!-----------------------------------------------------------------------
!  Return if time-averaging window is zero.
!-----------------------------------------------------------------------
!
      print*,'hola'
      IF (nAVG(ng).eq.0) RETURN
!
!-----------------------------------------------------------------------
!  Compute vorticity fields.
!-----------------------------------------------------------------------
The it does not reach to print anything. See log2.txt


Attachments:
log1.txt [37.93 KiB]
Downloaded 77 times
log2.txt [33.72 KiB]
Downloaded 78 times
Top
  
Reply with quote  
PostPosted: Tue Feb 22, 2011 5:22 pm 
My MacBook Pro reproduces the UPWELLING problem

MacBookPro6.2, Intel Core i7, 2.66GHz, 2x2GB DDR3@1067 MHz

netcdf library version "3.6.3" of Feb 22 2011 16:31:32
MacBook> uname -a
Darwin Mac-Book-Pro-de-Rosa-Balbin.local 10.6.0 Darwin Kernel Version 10.6.0: Wed Nov 10 18:11:58 PST 2010; root:xnu-1504.9.26~3/RELEASE_X86_64 x86_64
MacBook> ifort -v
Version 12.0.2

See attached output log3.txt

I will check the MacPro and I will tell you.


Attachments:
log3.txt [37.62 KiB]
Downloaded 91 times
Top
  
Reply with quote  
PostPosted: Wed Feb 23, 2011 11:43 am 
yes, the MacPro also reproduces the UPWELLING problem.
Playing around with compilation options sometimes it says "Illegal instruction" instead of "Segmentation fault"

Should I try to recompile netcdf with different options?


Top
  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 12 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 6 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group