Problem with large size initial files

General scientific issues regarding ROMS

Moderators: arango, robertson

Post Reply
Message
Author
User avatar
bhatt.vihang
Posts: 11
Joined: Thu Aug 19, 2010 12:51 pm
Location: Indian Institute of Science

Problem with large size initial files

#1 Post by bhatt.vihang » Fri Jan 24, 2014 11:42 am

Hi,

I am planning to run the roms for Indian Ocean region with a resolution of 1/36 degrees. With great difficulties (as matlab fails to handle filesize of netCDF file).

Here are the dimensions of the file
dimensions:
one = 1 ;
s_rho = 40 ;
time = 1 ;
eta_rho = 2268 ;
xi_rho = 3241 ;
eta_u = 2268 ;
xi_u = 3240 ;
eta_v = 2267 ;
xi_v = 3241 ;
s_w = 41 ;
I have also given HDF5 CPP flag in order to enable HDF5/netCDF4 formats that handle large files

Just to give an idea, the file size of roms_ini.nc is 9.5 GB.

The program exists with following error.
NLM: GET_STATE - Read state initial conditions, t = 0 00:00:00
(Grid 01, File: roms_ini.nc, Rec=0001, Index=1)
- free-surface
(Min = -4.11881377E-01 Max = 8.11087419E-01)
- vertically integrated u-momentum component
(Min = -4.07339808E-01 Max = 5.99377847E-01)
- vertically integrated v-momentum component
(Min = -7.66673929E-01 Max = 1.03701301E+00)
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
I am using netcdf-4.1.3 for compiling the model with HDF5-1.8.9

Hope to have some suggestion to resolve this issue.

ptimko
Posts: 34
Joined: Wed Mar 02, 2011 6:46 pm
Location: University of Michigan

Re: Problem with large size initial files

#2 Post by ptimko » Sat Jan 25, 2014 1:38 am

What is the physical memory of the system you are using?

Patrick

User avatar
bhatt.vihang
Posts: 11
Joined: Thu Aug 19, 2010 12:51 pm
Location: Indian Institute of Science

Re: Problem with large size initial files

#3 Post by bhatt.vihang » Sat Jan 25, 2014 10:45 am

The computer, I am trying to run the code on, is relatively small cluster.

It has 176 compute nodes with 12 processors on each compute node.

the command qnodes give following information on memory and CPUs.
status = state=free,netload=578004143,gres=,ncpus=12,physmem=24659208kb,availmem=49408976kb,totmem=49825020kb,idletime=450897, #1 SMP Thu Jan 13 15:51:15 EST 2011 x86_64,opsys=linux

ptimko
Posts: 34
Joined: Wed Mar 02, 2011 6:46 pm
Location: University of Michigan

Re: Problem with large size initial files

#4 Post by ptimko » Sun Jan 26, 2014 4:21 am

with ncpus=12 are you running the code with OMP on a single node?

you might want to try compiling under mpi and using 2 or more nodes.

might depend on your job script....

User avatar
bhatt.vihang
Posts: 11
Joined: Thu Aug 19, 2010 12:51 pm
Location: Indian Institute of Science

Re: Problem with large size initial files

#5 Post by bhatt.vihang » Sun Jan 26, 2014 9:53 am

Hi,

thank you for your reply, I am running the code on 512 cpus. I have noticed in my ".err" file that is generated by MPI suggesting segmentation fault when it starts reading 3d varibale. It is related to allocating memory. I tried with various flags along with -heap_arrays, i.e. -mcmodel=large -i-dynamic. However, the program stops at same location.

Assuming the problem is with I/O, I wrote a small program to read the roms_ini.nc file in serial mode. It works perfactly fine without giving any segmentation fault.

I am amazed to see the program is making my workstation (that has two xeons with 16 cores each and 64 GB RAM) cry. It occupy all RAM and swapping the data. The same workstatoin I used to prepare the data.

I am curious to know what additional flags I need to use or how relaiable parallel IO of netcdf could help as unidata do not claim their parallel IO netcdf lib may not be accurate.

User avatar
bhatt.vihang
Posts: 11
Joined: Thu Aug 19, 2010 12:51 pm
Location: Indian Institute of Science

Re: Problem with large size initial files

#6 Post by bhatt.vihang » Sun Jan 26, 2014 12:39 pm

In debug mode I received following information. Can anyone suggest what does it mean? Looks like it is unable to allocate memory or something.
forrtl: severe (408): fort: (2): Subscript #1 of the array GRIDNUMBER has value 1 which is greater than the upper bound of 0

Image PC Routine Line Source
libintlc.so.5 00002B722FFCD9AA Unknown Unknown Unknown
libintlc.so.5 00002B722FFCC4A6 Unknown Unknown Unknown
libifcore.so.5 00002B722F2E87AC Unknown Unknown Unknown
libifcore.so.5 00002B722F25BE42 Unknown Unknown Unknown
libifcore.so.5 00002B722F25C3C3 Unknown Unknown Unknown
oceanG 0000000001C029F9 ntimesteps_ 67 ntimestep.f90
oceanG 00000000006FFBC8 main3d_ 78 main3d.f90
oceanG 0000000000404F2C ocean_control_mod 151 ocean_control.f90
oceanG 00000000004032B5 MAIN__ 86 master.f90
oceanG 0000000000402BDC Unknown Unknown Unknown
libc.so.6 000000379E81D994 Unknown Unknown Unknown
oceanG 0000000000402AE9 Unknown Unknown Unknown
line 67 in ntimestep is
! Loop over all grids in current layer nesting layer.
!
DO ig=1,GridsInLayer(nl)
ng=GridNumber(ig,nl)
!
! Determine number of steps in time-interval window.

ptimko
Posts: 34
Joined: Wed Mar 02, 2011 6:46 pm
Location: University of Michigan

Re: Problem with large size initial files

#7 Post by ptimko » Mon Jan 27, 2014 12:57 am

I'm kinda stuck, I'm currently out of my office and don't have access to the source code to look at the routines themselves. It appears that there is a problem with the way that the variable GridNumber is defined. Is it possible that you are missing or have incorrectly specified something in your input file(s) that is preventing GridNumber or GridsInLayer being defined properly?

Without access to the source code I can't check to see how those variables are defined. You could try grepping the source code files to look for how those variables are defined and what might be missing or incorrectly specified in the input file(s).

Unfortunately I don't expect to be back in my office in the near future. Hopefully we can get the attention of one of the ROMS developers to resolve this issue.

Patrick

User avatar
shchepet
Posts: 185
Joined: Fri Nov 14, 2003 4:57 pm

Re: Problem with large size initial files

#8 Post by shchepet » Mon Jan 27, 2014 1:43 am

The following message
forrtl: severe (408): fort: (2): Subscript #1 of the array GRIDNUMBER has value 1 which is greater than the upper bound of 0
says it all, you either have a simple code bug, or an uninitialized index (most likely also due to code bug, or, alternatively, due to allocation/addressing error because you asking for too much memory).

Do you use Intel ifort compiler?

If you do, the, do you have

-mcmodel=medium -i-dynamic

among all the other compiler flags?

Normally, if compiled with default compiler options Intel compiler generates code which cannot exceed 2 GBytes of memory due to the default 32-bit pointer size. The -mcmodel=medium -i-dynamic flag is designed to circumvent this limitation.

This limit is typically not an issue for MPI codes since the sub-problems are usually small. But I suspect that it is no longer true in your case because of the enormous size of your grid. So add the flags, recompile, and see whether the problem remains.

ptimko
Posts: 34
Joined: Wed Mar 02, 2011 6:46 pm
Location: University of Michigan

Re: Problem with large size initial files

#9 Post by ptimko » Tue Jan 28, 2014 12:54 am

thanks for the input...

previous message spcecified that the complile flags were:

mcmodel=large and i-dynamic

that's why I was suggesting an error in the input file and/or a code bug.

You're could be right about memory limitations. I've run larger grids for global tide simulations using another model (HYCOM) at Texas (TACC) and haven't encoutered such an error. I'm unable to access the source at this time so I'm unable to trace it. I think this needs to be looked into since I'm considering similar grid sizes in my own research once I manage to get back to my office...

User avatar
arango
Site Admin
Posts: 1117
Joined: Wed Feb 26, 2003 4:41 pm
Location: IMCS, Rutgers University
Contact:

Re: Problem with large size initial files

#10 Post by arango » Tue Jan 28, 2014 2:12 am

It is possible that either the input nesting parameter NestLayers or GridsInLayer were set to zero in the ocean.in script. That may explain the out-of-bounds in GridNumber. I updated the code today to overwrite such zero value in non-nesting applications.

In non-nesting applications, we need to have the following values always:

Code: Select all

! Number of nested grids.

      Ngrids =  1

! Number of grid nesting layers.  This parameter is used to allow refinement
! and composite grid combinations.

  NestLayers =  1

! Number of grids in each nesting layer [1:NestLayers].

GridsInLayer =  1
This explained in the documentation.

Now, it seems to me that running this application in 512 nodes maybe inefficient. You just need to experiment with the values to obtain an optimal parallel partition. Perhaps, less number of nodes is more efficient. You will be penalized in distributed-memory by the communications to fill the ghost points data. With 512 nodes (2^9) you have different possibilities for tile partitions (powers of 2). For example, NtitleI=2^4=16 and NtileJ=2^5=32. This implies that your horizontal tile size is around 141x101 but it is not exact. In such large applications, I always have parameters Lm and Mm to be exactly powers of two or divisible by nice numbers so we get equal partition of tiles.

ptimko
Posts: 34
Joined: Wed Mar 02, 2011 6:46 pm
Location: University of Michigan

Re: Problem with large size initial files

#11 Post by ptimko » Wed Jan 29, 2014 12:44 am

Thanks for the input, I have plans to run similar size grids myself when I get back to work! :)

Post Reply