ROMS 2.0 with MPI

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
ljzhong
Posts: 14
Joined: Tue Nov 25, 2003 3:36 pm
Location: CSIRO

ROMS 2.0 with MPI

#1 Post by ljzhong » Thu Jan 11, 2007 7:21 pm

Hi,

I am running ROMS 2.0 on a 12-node linux cluster. I have no problem with the grid size 80x120x20, But when I doubled the horizontal resolution, ie, grid size 160x240x20, the run failed in the initial phase. Here is the model output,

....

INITIAL: Configurating and initializing ...

Node # 0 (pid= 21711) is active.
Node # 2 (pid= 15312) is active.
Node # 3 (pid= 12698) is active.
Node # 1 (pid= 3404) is active.

.....

Centers of gravity and integrals (values must be 1, 1, approx 1/2, 1, 1):

1.000000000000 1.033944429488 0.516972214744 1.000000000000 1.000000000000
rank 3 in job 1697 master_4268 caused collective abort of all ranks
exit status of rank 3: killed by signal 11
rank 2 in job 1697 master_4268 caused collective abort of all ranks
exit status of rank 2: killed by signal 11

------
I tried to increase stack size by setting "ulimit -s unlimited", but it did not work.

I also tried to test the model in OpenMP. it did not work on the linux cluster, but it did work on a linux workstation with dual processors.

Can someone give any suggestion?

By the way, For the grid 80x160x20, if I only run hydrodynamic part, I can use 12 processors, but if bio-model was included, I can use only up to 4 processors. Does anyone know why?

I use ifort 9.0 and MPICH2.

Liejun

schen
Posts: 29
Joined: Wed Feb 09, 2005 6:34 pm
Location: WHOI

#2 Post by schen » Fri Jan 12, 2007 3:35 pm

Liejun,
if the problem is stack size, setting "ulimit -s unlimited" on the command line may not work. Try putting it in your .bashrc to see if it solves the problem. ---- Shih-Nan

ljzhong
Posts: 14
Joined: Tue Nov 25, 2003 3:36 pm
Location: CSIRO

#3 Post by ljzhong » Tue Jan 16, 2007 9:30 pm

Shih-Nan, I did do as you said, but it did not help.

I got a little more progress in debugging the code, but still have no idea how to solve the problem.

Here is what I found. The program aborted when it either read in 3D variables (u,v,T,S) in nf_fread.F or wrote out 3D variables in nf_fwrite.F. Note that these two routines were also called to deal with 2D variables (sea level, ubar, vbar) without problem. Digging into these two routines, I found that the errors came from CALL mp_scatter (in nf_fread) and CALL mp_gather (in nf_fwrite).

mp_scatter and mp_gather are routines used by the master node to scatter/collect data to/from each tiled node. I have no experience in parallel coding and do not understand why they work for 2D variables but not for 3D variables.

I appreciate any help.

Liejun

User avatar
kate
Posts: 3809
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

#4 Post by kate » Tue Jan 16, 2007 10:48 pm

Have you tried getting a newer version of ROMS? 2.0 may have had bugs there that have been fixed since.

ljzhong
Posts: 14
Joined: Tue Nov 25, 2003 3:36 pm
Location: CSIRO

#5 Post by ljzhong » Thu Jan 18, 2007 2:11 pm

unfortunately, the same error persisted in ROMS 3.0. I did not debug the code, but the output message seemed to point to the same problem.

.....

INITIAL: Configurating and initializing forward nonlinear model ...


NLM: GET_STATE - Read state initial conditions, t = 0.0000
(Iter=0001, File: cpb_ini_1996_160x240.nc, Rec=0001, Index=1)
- free-surface
(Min = 0.00000000E+00 Max = 0.00000000E+00)
- vertically integrated u-momentum component
(Min = 0.00000000E+00 Max = 0.00000000E+00)
- vertically integrated v-momentum component
(Min = 0.00000000E+00 Max = 0.00000000E+00)
rank 3 in job 1778 master_4268 caused collective abort of all ranks
exit status of rank 3: killed by signal 11
rank 2 in job 1778 master_4268 caused collective abort of all ranks
exit status of rank 2: killed by signal 11
rank 1 in job 1778 master_4268 caused collective abort of all ranks
exit status of rank 1: killed by signal 11

Paul_Budgell
Posts: 18
Joined: Wed Apr 23, 2003 1:34 pm
Location: IMR, Bergen, Norway

I/O on large array?

#6 Post by Paul_Budgell » Fri Jan 19, 2007 11:00 am

Have you tried:

#define INLINE_2DIO

Perhaps there are some issues with reading in large 3D arrays all at once. Reading them in level by level might help.

ljzhong
Posts: 14
Joined: Tue Nov 25, 2003 3:36 pm
Location: CSIRO

#7 Post by ljzhong » Fri Jan 19, 2007 3:38 pm

I just tried this CPP flag. It solved my I/O problem. Thank you, Paul.

There are still other problems. but I think they are related with the configuration when I switch from ROMS v2.0 to v3.0. INLINE_2DIO is not available in ROMS 2.0, so I have to modify ROMS 3.0 as I did with ROMS 2.0.

Post Reply