ROMS simulation is killed after using too much RAM

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
rgawde
Posts: 11
Joined: Sat Oct 10, 2015 1:04 am
Location: UMCES Horn Point Lab

ROMS simulation is killed after using too much RAM

#1 Unread post by rgawde »

Hello,

I am trying to run the ROMS model for the Choptank River system. It has been compiled and a test run for the year 2010 was set up. After executing it using the command line,

mpirun -np 8 ./oceanM choproms.in > myrun.log &

the model runs successfully upto about 24,000 timesteps. However, after this point, the run is simply killed giving the error:

24778 11051.86782 1.087116E-02 5.255789E+01 5.256876E+01 9.304978E+09 0
24779 11051.86794 1.088490E-02 5.255786E+01 5.256874E+01 9.304784E+09 0
24780 11051.86806 1.089875E-02 5.255783E+01 5.256873E+01 9.304592E+09 0
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 25824 on node cbeps2 exited on signal 9 (Killed).

This was accompanied by a message "memory space (RAM) on cbeps2 exceeded" to the person-in-charge of the server the simulation was run on. Could anyone please help with this error? Has anyone else experienced this before?

Thanks!

User avatar
kate
Posts: 4088
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: ROMS simulation is killed after using too much RAM

#2 Unread post by kate »

I have not had a job get killed that far into a run, but I have had them get killed during initialization, when all the memory allocation happens. Was the model about to write some output? Can you get access to more memory, perhaps by using more processors? How many grid points do you have, anyway? My Arctic with 688x1088x50 points fits onto the 32 GB that comes with 16 cores, at least long enough for some debugging.

rgawde
Posts: 11
Joined: Sat Oct 10, 2015 1:04 am
Location: UMCES Horn Point Lab

Re: ROMS simulation is killed after using too much RAM

#3 Unread post by rgawde »

Thanks for your reply kate. The server actually supports 128GB of RAM, with I believe 24 processors which is substantial amount of memory. The model itself has 501x261x20 grid points but with a timestep of 10secs. So even if the simulation runs for 24000 steps, it still generates only two days of output variables. Also, it doesn't give me a chance to debug... it's simply killed between one timestep to the next.

Any suggestions?

User avatar
kate
Posts: 4088
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: ROMS simulation is killed after using too much RAM

#4 Unread post by kate »

It sounds like the system has more than enough memory for your problem. Now you need to find out if you can access all of it. There are system limits per job and/or per user:
pacman3 851% limit
cputime 8:00:00
filesize unlimited
datasize unlimited
stacksize unlimited
coredumpsize 0 kbytes
memoryuse unlimited
vmemoryuse 33554432 kbytes
descriptors 1024
memorylocked unlimited
maxproc 512
pacman3 852% bash
bash-4.1$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515220
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) 28800
max user processes (-u) 512
virtual memory (kbytes, -v) 33554432
file locks (-x) unlimited
limit and ulimit are built into the shell - use the one supported by your shell. I have had things get killed by the cputime limit.

I asked about output because the model creates and destroys temporary arrays during the output (other times too).

rgawde
Posts: 11
Joined: Sat Oct 10, 2015 1:04 am
Location: UMCES Horn Point Lab

Re: ROMS simulation is killed after using too much RAM

#5 Unread post by rgawde »

First, I would like to say thanks for all your suggestions! I checked with the person in charge of the server and there are no limits on the memory space or cpu time for my account. I'm not sure what you mean by "limit and ulimit are built into the shell". Could you provide some ideas on that?

Also, the output is constantly being written but my concern is the one you pointed out. I believe that the temporary arrays that are being created are not being de-allocated once the output is written and that is using up the memory of the server.

I'm not certain what to do at this point.

User avatar
kate
Posts: 4088
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: ROMS simulation is killed after using too much RAM

#6 Unread post by kate »

Is there another compiler you can try? I have access to three different ones here.

Are there system tools for looking at your memory use? Does it grow linearly or does it grow with each output? Memory leaks are notoriously difficult to debug and I don't have experience doing so because I haven't found ROMS to be horrible in that regard. Then again, I'm limited to two-days of wallclock time by the machine queues and have to restart all the time.

smchen
Posts: 11
Joined: Sat Mar 21, 2015 12:38 am
Location: TORI, Taiwan

Re: ROMS simulation is killed after using too much RAM

#7 Unread post by smchen »

ulimit (for bash) limit (for tcsh) are tools showing system resource limitations on your account. My ulimit output is as below and our model of 600x400x20 grids runs hundreds of thousands steps with 32 mpi nodes (2 servers, 16 cores for each).

Code: Select all

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256300
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
You can use either ulimit -a or limit to check the limitations.

rgawde
Posts: 11
Joined: Sat Oct 10, 2015 1:04 am
Location: UMCES Horn Point Lab

Re: ROMS simulation is killed after using too much RAM

#8 Unread post by rgawde »

Kate: I am not sure how I would go about changing compilers. I am relatively new to using ROMS. I am currently using mpif90 I believe. I think I do have access to ifort also. Do you have any experience with the two?

Smchen: Thanks for your reply. I did check the limits. Here's my ulimit output:

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1031350
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Does the max locked memory being 64 kbytes have anything to do with my problem?

Thanks!

smchen
Posts: 11
Joined: Sat Mar 21, 2015 12:38 am
Location: TORI, Taiwan

Re: ROMS simulation is killed after using too much RAM

#9 Unread post by smchen »

Max locked memory is not related to your problem in my opinion. My max locked memory is set to unlimited because our mpi software complains about too small max locked memory. Our model still runs with 64 kbytes max locked memory, except that there are annoying messages in the log.

I would suggest running ROMS in OpenMP or even serial mode to confirm if this problem is due to MPI. Also, sitting in front of the terminal and using "top" command to monitor memory usage is an idea. Of course you don't need to sit at the model start. Just monitor it before the model terminats.

User avatar
kate
Posts: 4088
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: ROMS simulation is killed after using too much RAM

#10 Unread post by kate »

To find out about the compilers, you need to ask your local system people. On the system here, it is handled with the "module" command, so there's a PrgEnv-gnu, a PrgEnv-pgi and a PrgEnv-intel. In all cases the Fortran compiler is invoked with "mpif90". The modules may or may not come with appropriate NetCDF libraries, depending. Anyway, I like having access to more than one compiler on the off chance that there are compiler bugs.

bjhaupt
Posts: 3
Joined: Fri Jun 27, 2008 5:37 pm
Location: Pennstate University
Contact:

Re: ROMS simulation is killed after using too much RAM

#11 Unread post by bjhaupt »

I had once a job that ran for several model years but then supposedly ran out of memory. I couldn't explain it because I had done similar runs before. I removed the output files and ran the same run again without any problems. I learned later that one of the nodes that I had requested had an issue. Can you check with your sysadmins?

Bernd

rgawde
Posts: 11
Joined: Sat Oct 10, 2015 1:04 am
Location: UMCES Horn Point Lab

Re: ROMS simulation is killed after using too much RAM

#12 Unread post by rgawde »

Thank you for all your inputs.
Smchen: I have started the run in serial mode. That is the first checkpoint.

Kate: I am looking into the possibility of it being a compiler issue. The setup was previously compiled using gfort and ran successfully. My version was compiled using ifort and I'm running into memory issues. I'm trying to check that up.

Bernd: I don't believe it is a node issue since there are others who are running their simulations successfully on the same system.

Post Reply