ROMS 3.4 Stalling

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
sarahgid
Posts: 7
Joined: Thu Feb 03, 2011 11:46 pm
Location: University of Washington

ROMS 3.4 Stalling

#1 Unread post by sarahgid »

I just updated to ROMS 3.4 last week (revision 566) and my model run is stalling. It is an mpirun compiled with ifort and says it is still running but is no longer outputting any history files nor updating the output log file. My colleague running the same model had the same problem with revision 558 and when she restarted the model from the previous rst file she was able to restart it no problem but it continued to stall every day or so. The model is not blowing up and the identical model on the previous ROMS version 3.3 (revision 534) ran perfectly for a full year simulation (about 7.5 day run time with our grid) without any stalling.

Any thoughts? We like to keep up to date with the newest ROMS and are especially excited about the ability to input multiple forcing files without restarting necessary... but if we have to restart frequently due to stalls anyways it defeats the purpose!

Thanks for any ideas/feedback!
Sarah

nganju
Posts: 82
Joined: Mon Aug 16, 2004 8:47 pm
Location: U.S. Geological Survey, Woods Hole
Contact:

Re: ROMS 3.4 Stalling

#2 Unread post by nganju »

Is it stalling on the first time step, at some random spot, or just before writing to a history file? I recall having a similar problem...

ljzhong
Posts: 15
Joined: Tue Nov 25, 2003 3:36 pm
Location: CSIRO

Re: ROMS 3.4 Stalling

#3 Unread post by ljzhong »

I have the same problem with the revision 566. It seems to be stalling when it reads the initial NetCDF file.

User avatar
arango
Site Admin
Posts: 1347
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: ROMS 3.4 Stalling

#4 Unread post by arango »

I haven't observed this. Usually, stalling in distributed-memory applications may be due to a parallel bug. That is, not all the nodes are at the same place in the code when calling any of the MPI library routines. All the parallel tiles (nodes) need to participate in the MPI communications. One way, to check this is to run serial or with a single node (partition 1x1) to see if the code passes the stalling point. There is not enough information here to diagnose the problem. It will be nice to know exactly the place where the code is stalling. Some print statements in the Build/*.f90 code will help.

However, the stalling cannot be random if it is due to a parallel bug. It will be always at the same place. If you are having random behavior, it may be due to the computer or the compiler. I usually test ROMS with the latest version of the ifort, pgf90, and gfortran compilers, MPICH, OpenMPI, and NetCDF libraries.

sarahgid
Posts: 7
Joined: Thu Feb 03, 2011 11:46 pm
Location: University of Washington

Re: ROMS 3.4 Stalling

#5 Unread post by sarahgid »

Thanks so much for everyone's replies,

The stalling does indeed appear to be random, it has stalled for us during the creation of a history file, during the reading in of climatology values as well as just in the middle of a time step and the time and place at which it stalls changes with each run.

Our version of ifort is not the newest but we have upgraded to the latest openMPI and the stalling occurred both with an older openMPI and the latest. Two days ago I recompiled with pgi and so far it appears to be running without stalling. The old ifort with the older ROMS did not stall. Could it be that to run the new ROMS 3.4 with ifort we would need to upgrade our ifort compiler? We will probably stick with pgi if it continues to run without stalling but Im just trying to understand the issue.

Thanks again,
Sarah

Post Reply